μ-TWO

Overview

Modern GPUs are severely underutilized (~50%) during neural network training — and it gets worse as GPUs get faster.

Deep learning workflows are ubiquitous but expensive. Training BERT — a language model with 200 million parameters — takes 79 hours on 64 high-end GPUs, costs approximately $12,000 USD, and emits as much carbon as six US cars over their lifetimes. Real-world workflows multiply this cost: neural architecture search, hyperparameter tuning, and ensemble learning all require training many models.

State-of-the-art approaches tackle low GPU utilization via two strategies: (1) increasing the mini-batch size to boost data parallelism, and (2) horizontal fusion — fusing identical operators across concurrently trained models into a single operator. However, both strategies are limited by GPU memory: fusing just four state-of-the-art models exceeds the memory capacity of an NVIDIA A100 GPU.

Existing memory optimization techniques — tensor swapping (offloading to host memory) and tensor recomputation (discarding and recomputing feature maps) — do not scale when applied to concurrent multi-model training. Tensor swapping causes massive computation stalls as model sizes grow. Tensor recomputation adds superfluous computation overhead. Directly applying these to horizontally fused models can cause up to a 50% slowdown.

Our insight: training performance is a function of a three-way trade-off between compute utilization, peak memory consumption, and the degree of independence between operations. μ-TWO is a novel compiler that automatically navigates this trade-off for any given set of models and target GPU, generating tailored training schedules that achieve up to 3× speedup.

Highlights

Scalable concurrent training across diverse models, hardware, and applications —
from vision to NLP to recommendation systems.

3×

Speedup

vs. state-of-the-art HFTA

3–5×

More Models

Concurrently on one GPU

6×

Memory Oversubscription

Beyond GPU memory capacity

40 hrs

Training Saved

Out of 60 GPU hours (BERT & ViT)

Model Architectures

BERT, GPT-2, ViT, ResNet, MobileNet, DLRM

Zero

Stalling

Overlaps all data movement with compute

Compute
Utilization

Independent
Operations

Peak Memory
Consumption

μ-TWO Sweet Spot

μ-TWO navigates the three-way trade-off to find the optimal balance
for any given set of models and target GPU.

How μ-TWO Works

A seven-stage compilation pipeline that automatically generates tailored training schedules for any given set of models and target GPU.

Sub-Array Constructor

Partition the input model array into sub-arrays to balance fusion and independence.

Horizontal Fuser

Fuse operators within each sub-array to saturate GPU compute.

Graph Tracer

Derive forward and backward computational graphs with data dependencies.

Profiler

Collect static analysis, run-time, and memory usage statistics.

Scheduler

Decide swap vs. recompute, multiplex operations, and simulate memory.

Graph Rewriter

Merge graph pairs with scheduling hints for swap, offload, and recompute.

Schedule Interpreter

Execute merged graphs using CUDA streams for zero-stall training.

The core design choice in μ-TWO is sub-array fusion. Rather than fusing all models together (maximum compute but no overlap opportunities and high memory) or keeping them all separate (maximum independence but poor compute utilization), μ-TWO partitions the model array into sub-arrays and fuses within each.

This unlocks the key insight: the forward and backward passes of different sub-arrays are independent. μ-TWO multiplexes backward pass operations from one sub-array with forward pass operations from another, overlapping any necessary data swaps with useful compute — so GPUs never have to wait for data.

The Scheduler uses a greedy policy that selects tensors for swapping or recomputation based on their inactive time (how long they sit idle in memory) and recompute ratio (memory saved per unit of recomputation time). A memory simulator validates every decision against the GPU memory limit, ensuring the generated schedule is always feasible.

Sub-Array Fusion

Balances compute saturation with overlap opportunities.

Multiplexing

Overlaps swaps with forward ops from other sub-arrays.

Lightweight Profiling

Only 4 iterations to collect all necessary statistics.

PyTorch Integration

Built with FX, AOT Autograd, vmap, and CUDA Graphs.

Comparison with Existing Approaches

μ-TWO is the only approach that simultaneously achieves high compute utilization, high memory utilization, and out-of-memory support for multi-model training.

Technique	Compute Overhead	Stalling
vDNN (Swapping)	None	High
Checkmate (Recomputation)	High	None
Capuchin (Hybrid)	High	Low
HFTA (Fusion only)	None	None
μ-TWO	Low	None

vs. HFTA (state-of-the-art): HFTA achieves good compute utilization through horizontal fusion but offers no memory optimization. When models do not fit in GPU memory, HFTA must train subsets sequentially. μ-TWO enables concurrent training of 3–5× more models on the same GPU with up to 3× speedup.

vs. HFTA-Capuchin: Naively applying state-of-the-art memory optimization (Capuchin) to HFTA leads to performance degradation as the number of models grows, due to excessive recomputation. μ-TWO’s sub-array fusion and intelligent multiplexing overlaps swaps 2–3× more effectively, with less than half the recomputation overhead.

Multi-Model Training Feature Comparison

Feature	HFTA	μ-TWO
Out-of-memory support
Large mini-batch size
Large model size
Large number of models
High memory utilization
High compute utilization