3× Faster Multi-Model Training — a novel compiler that maximizes GPU utilization
through orchestration of fusion, swapping, and memory optimization.
Modern GPUs are severely underutilized (~50%) during neural network training — and it gets worse as GPUs get faster.
Deep learning workflows are ubiquitous but expensive. Training BERT — a language model with 200 million parameters — takes 79 hours on 64 high-end GPUs, costs approximately $12,000 USD, and emits as much carbon as six US cars over their lifetimes. Real-world workflows multiply this cost: neural architecture search, hyperparameter tuning, and ensemble learning all require training many models.
State-of-the-art approaches tackle low GPU utilization via two strategies: (1) increasing the mini-batch size to boost data parallelism, and (2) horizontal fusion — fusing identical operators across concurrently trained models into a single operator. However, both strategies are limited by GPU memory: fusing just four state-of-the-art models exceeds the memory capacity of an NVIDIA A100 GPU.
Existing memory optimization techniques — tensor swapping (offloading to host memory) and tensor recomputation (discarding and recomputing feature maps) — do not scale when applied to concurrent multi-model training. Tensor swapping causes massive computation stalls as model sizes grow. Tensor recomputation adds superfluous computation overhead. Directly applying these to horizontally fused models can cause up to a 50% slowdown.
Our insight: training performance is a function of a three-way trade-off between compute utilization, peak memory consumption, and the degree of independence between operations. μ-TWO is a novel compiler that automatically navigates this trade-off for any given set of models and target GPU, generating tailored training schedules that achieve up to 3× speedup.
Scalable concurrent training across diverse models, hardware, and applications —
from vision to NLP to recommendation systems.
μ-TWO navigates the three-way trade-off to find the optimal balance
for any given set of models and target GPU.
A seven-stage compilation pipeline that automatically generates tailored training schedules for any given set of models and target GPU.
Partition the input model array into sub-arrays to balance fusion and independence.
Fuse operators within each sub-array to saturate GPU compute.
Derive forward and backward computational graphs with data dependencies.
Collect static analysis, run-time, and memory usage statistics.
Decide swap vs. recompute, multiplex operations, and simulate memory.
Merge graph pairs with scheduling hints for swap, offload, and recompute.
Execute merged graphs using CUDA streams for zero-stall training.
The core design choice in μ-TWO is sub-array fusion. Rather than fusing all models together (maximum compute but no overlap opportunities and high memory) or keeping them all separate (maximum independence but poor compute utilization), μ-TWO partitions the model array into sub-arrays and fuses within each.
This unlocks the key insight: the forward and backward passes of different sub-arrays are independent. μ-TWO multiplexes backward pass operations from one sub-array with forward pass operations from another, overlapping any necessary data swaps with useful compute — so GPUs never have to wait for data.
The Scheduler uses a greedy policy that selects tensors for swapping or recomputation based on their inactive time (how long they sit idle in memory) and recompute ratio (memory saved per unit of recomputation time). A memory simulator validates every decision against the GPU memory limit, ensuring the generated schedule is always feasible.
Balances compute saturation with overlap opportunities.
Overlaps swaps with forward ops from other sub-arrays.
Only 4 iterations to collect all necessary statistics.
Built with FX, AOT Autograd, vmap, and CUDA Graphs.
μ-TWO is the only approach that simultaneously achieves high compute utilization, high memory utilization, and out-of-memory support for multi-model training.
| Technique | Out-of-Memory | Compute Overhead | Stalling |
|---|---|---|---|
| vDNN (Swapping) | None | High | |
| Checkmate (Recomputation) | High | None | |
| Capuchin (Hybrid) | High | Low | |
| HFTA (Fusion only) | None | None | |
| μ-TWO | Low | None |
vs. HFTA (state-of-the-art): HFTA achieves good compute utilization through horizontal fusion but offers no memory optimization. When models do not fit in GPU memory, HFTA must train subsets sequentially. μ-TWO enables concurrent training of 3–5× more models on the same GPU with up to 3× speedup.
vs. HFTA-Capuchin: Naively applying state-of-the-art memory optimization (Capuchin) to HFTA leads to performance degradation as the number of models grows, due to excessive recomputation. μ-TWO’s sub-array fusion and intelligent multiplexing overlaps swaps 2–3× more effectively, with less than half the recomputation overhead.
| Feature | HFTA | μ-TWO |
|---|---|---|
| Out-of-memory support | ||
| Large mini-batch size | ||
| Large model size | ||
| Large number of models | ||
| High memory utilization | ||
| High compute utilization |