LegoAI - DASlab @ Harvard University

Overview

Training a large language model today means orchestrating thousands of GPUs across a massive design space of parallelism strategies, memory optimizations, and hardware configurations. One wrong choice can waste millions of dollars in compute. LegoAI is our research program at DASlab that tackles this challenge head-on by building composable, modular tools — just like Lego bricks, you assemble only the pieces you need.

The program currently includes two systems: TorchTitan, a production-grade distributed training framework that makes it easy to combine different parallelism techniques in any configuration, and TorchSim, a fast simulator that lets you predict how a training run will perform — without ever touching a GPU cluster.

Together, these tools dramatically reduce the barrier to training frontier AI models at scale, enabling researchers to explore, iterate, and innovate faster.

Why "Lego"?

Modern AI systems are too rigid. We build general-purpose composable primitives that snap together seamlessly — giving researchers the flexibility to explore the full design space without re-engineering the stack.

Highlights

TorchTitan delivers dramatic speedups across model scales,
measured on NVIDIA H100 GPUs training Llama 3.1 models.

65%

Throughput Gain

Llama 3.1 8B · 128 GPUs

30%

Throughput Gain

Llama 3.1 405B · 512 GPUs

15×

Checkpoint Speedup

Distributed checkpointing

Parallelism

DP + TP + PP + CP composable

262K

Token Context

Long context training on 8 GPUs

GPUs for TorchSim

Simulate on commodity CPU only

TorchTitan: Production-Ready LLM Pretraining

Training the next generation of LLMs requires combining multiple parallelism techniques simultaneously. TorchTitan is an open-source, PyTorch-native framework that makes this composable and easy.

4D Parallelism

Compose data, tensor, pipeline, and context parallelism in any combination.

Float8 Training

Hardware-software co-design for maximum throughput on modern accelerators.

Distributed Checkpointing

Efficient fault-tolerant checkpointing that reduces overhead by up to 15×.

Flight Recorder

Built-in debugging tools to diagnose distributed training failures at scale.

Existing training systems are often monolithic — once you pick a configuration, you are locked in. TorchTitan was designed from the ground up around 4D parallelism: the seamless combination of Data Parallel (DP), Tensor Parallel (TP), Pipeline Parallel (PP), and Context Parallel (CP) in any configuration you choose. This composability means that the same codebase that trains an 8-billion-parameter model on 8 GPUs can scale — without modification — to a 405-billion-parameter model on 512 GPUs.

TorchTitan also integrates hardware-software co-designed optimizations, including Float8 mixed-precision training and Asynchronous Tensor Parallel (AsyncTP), which overlaps computation and communication to maximize hardware utilization. Production features like distributed checkpointing and a built-in flight recorder for debugging make TorchTitan ready for real workloads — not just research demos.

TorchSim: Navigate the Design Space Without a GPU

TorchTitan exposes a vast configuration space. TorchSim lets you explore it entirely on a commodity CPU — no GPU cluster required.

The Problem

Parallelism strategies, memory optimizations, precision settings — empirically benchmarking thousands of configurations would cost enormous amounts of GPU time and money before a single model trains.

Our Solution

TorchSim simulates a full distributed training run — runtime and memory — by mirroring the real PyTorch execution engine with FakeTensors and FakeProcessGroups, all on an ordinary laptop.

Rich Insights

Beyond a single number, TorchSim breaks down runtime into exposed vs. overlapped communication, and memory into parameters, activations, gradients, and optimizer states — at per-module granularity.

TorchSim is built around the insight that you only need tensor metadata — shapes, dtypes, and device placement — to faithfully simulate a training step. By running under PyTorch's FakeTensorMode, TorchSim executes your exact training script (forward pass, backward pass, optimizer step) without any actual data. It intercepts every operator dispatch and synchronization primitive, building a precise timeline of compute and communication that it simulates using hardware-aware cost models. The result: evaluate thousands of training configurations in minutes and arrive at your GPU cluster with confidence.

Publications

ICLR 2025 TorchTitan: One-Stop PyTorch Native Solution for Production Ready LLM Pretraining

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos

International Conference on Learning Representations (ICLR), 2025

Thesis / Tech Report LegoAI: Auto-Scaling Large Model Training

Sanket Purandare, under the supervision of Stratos Idreos

Doctoral Dissertation, Harvard University 2025

ES-FoMo III 2025 TORCHSIM: High Fidelity Runtime and Memory Estimation for Distributed Training

Sanket Purandare, Emma Yang, Andrew Zhao, Qitong Wang, Wei Feng, Alban Desmaison, Andrew Gu, Tianyu Liu, Less Wright, Gokul Nadathur, Stratos Idreos

ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025