20-04 โ Distributed Training Mathematics
Phase: 20 โ Training & Fine-tuning Mathematics Subject: 20-04 Prerequisites: 20-03 (Batch Size and Gradient Accumulation), 14-03 (SGD Variants), 17-05 (Attention Mathematics โ memory analysis), 09-04 (Spectral Theorem and Quadratic Forms โ for communication analysis), 15-01 (Floating Point Arithmetic โ precision tradeoffs) Next subject: 20-05 โ Instruction Tuning (SFT)
Learning Objectives
By the end of this subject, you will be able to:
- Derive the all-reduce communication pattern and compute its bandwidth cost as a function of parameter count and GPU count
- Analyze the three ZeRO stages mathematically โ deriving memory savings for optimizer states, gradients, and parameters
- Compute the memory footprint of pipeline parallelism with K micro-batches and derive the bubble overhead formula
- Compare tensor parallelism (intra-layer) vs pipeline parallelism (inter-layer) vs data parallelism in terms of communication volume and scalability
- Design a 3D parallel training configuration given model size, GPU count, and interconnect bandwidth constraints
Core Content
1. Why Distributed Training?
A single GPU is insufficient for modern LLMs:
| Model | Parameters | fp16 Size | Adam States (fp32) | Total (~) |
|---|---|---|---|---|
| Llama 7B | 7B | 14 GB | 56 GB | 70 GB |
| Llama 70B | 70B | 140 GB | 560 GB | 700 GB |
| GPT-3 175B | 175B | 350 GB | 1.4 TB | 1.75 TB |
| Llama 3 405B | 405B | 810 GB | 3.24 TB | 4.05 TB |
An A100 80GB can't even hold a 70B model's weights + Adam states. We MUST distribute across GPUs.
The three dimensions of distribution:
- Data Parallelism (DP): Split the batch across GPUs; each has a full model copy
- Model Parallelism: Split the MODEL across GPUs
- Tensor Parallelism (TP): Split individual layers (e.g., each GPU holds part of each weight matrix)
- Pipeline Parallelism (PP): Split layers across GPUs (GPU 0 has layers 0-7, GPU 1 has layers 8-15, etc.)
- ZeRO (Zero Redundancy Optimizer): Shard optimizer states, gradients, and parameters across DP GPUs
โ ๏ธ THIS IS CRITICAL โ Modern LLM training combines ALL of these (3D parallelism). Understanding the mathematics of each allows you to compute optimal GPU allocation.
2. Data Parallelism
Setup: N GPUs, each with a full model copy. Each GPU processes B/N samples per step.
Algorithm:
For each GPU k in parallel:
1. Forward pass on local micro-batch
2. Backward pass โ local gradients g_k
3. ALL-REDUCE: g = (1/N) ฮฃ g_k (average gradients across all GPUs)
4. Optimizer step: ฮธ โ ฮธ โ ฮทยทg (identical step on all GPUs)
All-reduce mathematics: The all-reduce operation computes the sum (or average) of values across all GPUs and distributes the result. The most efficient algorithm is the ring all-reduce:
- N GPUs arranged in a logical ring
- Data (gradients) of size D bytes
- Each GPU sends D/N bytes to its right neighbor and receives D/N from its left, Nโ1 times
Communication per GPU = 2 ยท (Nโ1)/N ยท D โ 2D bytes (for large N)
(The factor of 2 is for send + receive.)
The time for an all-reduce on a ring:
$T_allreduce = 2(Nโ1)/N ยท D / B_interconnect $
where B_interconnect is the interconnect bandwidth (e.g., 600 GB/s for NVLink on A100).
For N = 8, D = 14 GB (7B fp16 params): T โ 2ยท7/8ยท14GB/600GB/s โ 2ยท0.875ยท0.0233 โ 0.041 sec = 41 ms per all-reduce.
This is acceptable โ the forward/backward pass takes much longer (~500msโ1s per micro-batch).
3. ZeRO: Zero Redundancy Optimizer
The key insight: in standard data parallelism, each GPU stores a FULL copy of optimizer states and gradients โ massive redundancy. ZeRO shards these across DP GPUs.
ZeRO Stage 1: Optimizer State Partitioning
Each GPU holds 1/N of the optimizer states (m and v for Adam). After the all-reduce of gradients, each GPU updates only its partition of parameters, then all-gathers the updated parameters.
Memory savings:
$Without ZeRO: params (2 bytes) + gradients (2) + opt_states (8) = 12 bytes/param With ZeRO-1: params (2) + gradients (2) + opt_states (8/N) = 4 + 8/N bytes/param $
For N=8: 4 + 1 = 5 bytes/param โ 2.4ร reduction.
ZeRO Stage 2: Gradient Partitioning
Additionally shard gradients (each GPU only stores gradients for its parameter partition). After backward, gradients are reduced-scattered (each GPU ends up with only its partition).
Memory savings:
$With ZeRO-2: params (2) + gradients (2/N) + opt_states (8/N) = 2 + 10/N bytes/param $
For N=8: 2 + 1.25 = 3.25 bytes/param โ 3.7ร reduction vs no ZeRO.
ZeRO Stage 3: Parameter Partitioning
Additionally shard the parameters themselves. Each GPU only holds 1/N of the parameters at any time. During forward/backward, parameters are all-gathered as needed and discarded immediately after use.
Memory savings:
$With ZeRO-3: params (2/N) + gradients (2/N) + opt_states (8/N) = 12/N bytes/param $
For N=8: 1.5 bytes/param โ 8ร reduction. Now a 70B model (12 bytes/param ร 70B = 840 GB without ZeRO) needs only 105 GB across 8 GPUs โ ~13 GB per GPU.
Communication tradeoff: ZeRO-3 adds all-gather communication for parameters during forward/backward. This roughly matches the original data-parallel all-reduce cost.
| Stage | Per-GPU Memory (bytes/param) | Communication Overhead |
|---|---|---|
| No ZeRO | 12 | 1ร all-reduce |
| ZeRO-1 | 4 + 8/N | 1ร all-reduce + scatter |
| ZeRO-2 | 2 + 10/N | 1ร reduce-scatter + all-gather |
| ZeRO-3 | 12/N | 1ร reduce-scatter + 2ร all-gather |
4. Tensor Parallelism (Megatron-LM Style)
Split individual transformer layers across GPUs. For a weight matrix W โ โ^{dรd}:
Column-parallel (for Q, K, V projections): Split W into [Wโ | Wโ] vertically. GPU 0 has Wโ, GPU 1 has Wโ. Each computes partial output, then the outputs are concatenated.
Row-parallel (for the output projection after attention): Split W into [Wโ; Wโ] horizontally. Each GPU computes with its row partition, then all-reduce sums the partial results.
Communication: Each transformer block requires 2 all-reduces in the forward pass and 2 in backward. For L layers:
$Total communication per step = 4L ยท (comm per all-reduce) $
For a 32-layer transformer with d=4096, each all-reduce is ~2ยทdยฒยท2 bytes (fp16) โ 64 MB. 4L = 128 all-reduces โ ~8.2 GB per step. This is expensive โ tensor parallelism is bandwidth-hungry.
When to use: NVLink-connected GPUs within a single node (high bandwidth, low latency). Tensor parallelism doesn't scale well beyond 8 GPUs due to quadratic communication growth.
5. Pipeline Parallelism
Split layers across GPUs: GPU 0 handles layers 0โ7, GPU 1 handles layers 8โ15, etc. Activations flow forward; gradients flow backward.
The bubble problem: GPUs are idle during pipeline startup and teardown. With K micro-batches in a pipeline of P stages:
$Bubble time = (Pโ1) / (Pโ1 + K) fraction of total time $
Derivation: Total steps to process K micro-batches through P stages: forward takes K+Pโ1 time units, backward takes another K+Pโ1. Total = 2(K+Pโ1). Productive work = 2PK units (P stages ร K micro-batches ร 2 passes). Bubble = total โ productive = 2(K+Pโ1) โ 2PK... wait, the productive work is done when ALL stages are busy. Let me be more precise:
- Forward: stage p processes micro-batch k at time k+pโ1
- Total forward time = K+Pโ1 (from first micro-batch entering stage 0 to last exiting stage Pโ1)
- Each stage does K units of work, so total work = KยทP
- Forward bubble = (K+Pโ1) โ K = Pโ1 "idle" steps distributed across stages
Same for backward. Total bubble = 2(Pโ1) steps per micro-batch cycle.
$Bubble fraction = 2(Pโ1) / 2(K+Pโ1) = (Pโ1)/(K+Pโ1) $
As K โ โ, bubble fraction โ 0. In practice, K is limited by memory (activations for K micro-batches must be stored). Scheduling schemes like 1F1B (one-forward-one-backward) reduce the activation memory peak.
Communication: Only activations and gradients at pipeline boundaries. Much less communication than tensor parallelism.
6. 3D Parallelism: Putting It All Together
Modern LLM training combines all three:
$Total GPUs = N_dp ร N_tp ร N_pp $
where: - N_dp = data-parallel replicas (with ZeRO) - N_tp = tensor-parallel size (within a node, NVLink) - N_pp = pipeline-parallel stages (across nodes)
Example: Llama 70B on 128 GPUs - N_tp = 8 (within 8-GPU NVLink nodes) - N_pp = 4 (4 pipeline stages) - N_dp = 128/(8ร4) = 4
Each pipeline stage = 8 GPUs running tensor parallelism. 4 such groups form the pipeline. 4 replicas of this entire setup for data parallelism.
Memory: Each GPU stores 1/(N_tpยทN_dp) of model + ZeRO sharding. Communication: intra-node NVLink for TP, inter-node InfiniBand for PP/DP.
7. Communication/Computation Overlap
A key optimization: overlap communication with computation. During backward pass, as soon as layer L's gradients are computed, their all-reduce can begin while layers Lโ1, Lโ2, ... continue backward. This hides communication latency.
Theoretical maximum overlap: If T_comm โค T_comp (communication time โค computation time for that layer), communication is completely hidden. For transformer layers, this is typically achievable for data parallelism with NVLink but challenging for slower interconnects.
Worked Examples
Example 1: ZeRO Memory Calculation
Problem: A 13B parameter model in fp16 with Adam optimizer (fp32 states). Compute the per-GPU memory for: (a) No ZeRO, 1 GPU (b) ZeRO-2, 8 GPUs (c) ZeRO-3, 8 GPUs
Solution:
$params (fp16) = 13B ร 2 = 26 GB gradients (fp16) = 13B ร 2 = 26 GB opt_states (fp32, m+v) = 13B ร 4 ร 2 = 104 GB $
(a) No ZeRO, 1 GPU: 26 + 26 + 104 = 156 GB. Cannot fit on A100 80GB!
(b) ZeRO-2, 8 GPUs: - params per GPU: 26 GB (full copy) - gradients per GPU: 26/8 = 3.25 GB - opt_states per GPU: 104/8 = 13 GB - Total: 26 + 3.25 + 13 = 42.25 GB per GPU โ (fits on 80GB)
(c) ZeRO-3, 8 GPUs: - params per GPU: 26/8 = 3.25 GB - gradients per GPU: 26/8 = 3.25 GB - opt_states per GPU: 104/8 = 13 GB - Total: 3.25 + 3.25 + 13 = 19.5 GB per GPU โ (very comfortable)
Example 2: Pipeline Bubble Calculation
Problem: A pipeline with P=8 stages processes K=32 micro-batches. What fraction of time is spent in bubbles?
Solution:
Using the bubble fraction formula:
$Bubble fraction = (Pโ1) / (K+Pโ1) = 7 / (32+8โ1) = 7/39 โ 0.179 = 17.9% $
This means 17.9% of GPU time is idle. To reduce bubble to <5%, solve:
(Pโ1)/(K+Pโ1) < 0.05
7/(K+7) < 0.05
7 < 0.05(K+7)
140 < K+7
K > 133
With K=133 micro-batches, bubble is <5%. But this requires storing 133ร the activation memory โ likely too much. In practice, bubble overhead of 15-25% is accepted.
Example 3: 3D Parallelism Configuration
Problem: You have 256 A100 GPUs (32 nodes ร 8 GPUs, NVLink within node, 200 GB/s InfiniBand between nodes). Model: 175B params. Design a 3D parallel configuration and verify memory fits (each GPU has 80 GB).
Solution:
Tensor parallelism within node: N_tp = 8 (full node) Pipeline parallelism: N_pp = 4 (across 4 nodes) Data parallelism: N_dp = 256/(8ร4) = 8
Memory per GPU with ZeRO-2 (shard optimizer states + gradients across DP dim):
params (fp16): 175B ร 2 / 8(TP) = 43.75 GB per TP group... no, TP shares params across GPUs within the same layer.
Actually for TP: each GPU in a TP group holds a FRACTION of each weight matrix.
For column-parallel: each GPU holds 1/N_tp of the matrix width.
For row-parallel: each GPU holds 1/N_tp of the matrix height.
On average, each TP GPU holds ~1/N_tp of parameters. With PP=4, each pipeline stage gets 1/4 of layers:
With TP=8, each GPU in a TP group stores roughly 1/8 of each layer's parameters. PP=4 means each pipeline stage has 1/4 of the layers. So params per GPU: 350 GB / 8 / 4 = 10.94 GB.
Plus optimizer states (fp32): ZeRO-2 shards across DP=8, so 175B ร 8 bytes / (PP=4 ร TP=8 ร DP=8) = 175B ร 8 / 256 โ 5.47 GB per GPU.
Plus gradients: 175B ร 2 / (4ร8ร8) = 350 / 256 โ 1.37 GB per GPU.
Total per GPU: 10.94 + 5.47 + 1.37 = 17.78 GB + activations (~5-10 GB) โ 25 GB. Fits easily in 80 GB.
This configuration is reasonable for 175B on 256 A100s.
Quiz
Q1: What does the concept of Data parallelism primarily refer to in this subject?
A) A visual representation of Data parallelism B) A computational error related to Data parallelism C) The definition and application of Data parallelism D) A historical anecdote about Data parallelism
Correct: C)
- If you chose A: This is incorrect. Data parallelism is defined as: the definition and application of data parallelism. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Data parallelism is defined as: the definition and application of data parallelism. The other options describe different aspects that are not the primary focus.
- If you chose C: Data parallelism is defined as: the definition and application of data parallelism. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Data parallelism is defined as: the definition and application of data parallelism. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Tensor parallelism?
A) It is used only in advanced research contexts B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used to tensor parallelism in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Tensor parallelism serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Tensor parallelism serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Tensor parallelism serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Tensor parallelism serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q3: Which statement about Pipeline parallelism is TRUE?
A) Pipeline parallelism is not related to this subject B) Pipeline parallelism is a fundamental concept covered in this subject C) Pipeline parallelism is an advanced topic beyond this subject's scope D) Pipeline parallelism is mentioned only as a historical footnote
Correct: B)
- If you chose A: This is incorrect. Pipeline parallelism is a fundamental concept covered in this subject. This subject covers Pipeline parallelism as part of its core content.
- If you chose B: Pipeline parallelism is a fundamental concept covered in this subject. This subject covers Pipeline parallelism as part of its core content. Correct!
- If you chose C: This is incorrect. Pipeline parallelism is a fundamental concept covered in this subject. This subject covers Pipeline parallelism as part of its core content.
- If you chose D: This is incorrect. Pipeline parallelism is a fundamental concept covered in this subject. This subject covers Pipeline parallelism as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) - N GPUs arranged in a logical ring C) The inverse of the correct answer D) A different result from a common mistake
Correct: B)
- If you chose A: This is incorrect. The worked examples show that the result is - N GPUs arranged in a logical ring. The other options represent common errors.
- If you chose B: The worked examples show that the result is - N GPUs arranged in a logical ring. The other options represent common errors. Correct!
- If you chose C: This is incorrect. The worked examples show that the result is - N GPUs arranged in a logical ring. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is - N GPUs arranged in a logical ring. The other options represent common errors.
Q5: How are Pipeline parallelism and Ring all-reduce related?
A) Pipeline parallelism and Ring all-reduce are completely unrelated topics B) Pipeline parallelism and Ring all-reduce are closely related concepts C) Pipeline parallelism is the inverse of Ring all-reduce D) Pipeline parallelism is a special case of Ring all-reduce
Correct: B)
- If you chose A: This is incorrect. Both Pipeline parallelism and Ring all-reduce are covered in this subject as interconnected topics.
- If you chose B: Both Pipeline parallelism and Ring all-reduce are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Pipeline parallelism and Ring all-reduce are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Pipeline parallelism and Ring all-reduce are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Bubble fraction?
A) The main error with Bubble fraction is using it when it is not needed B) Bubble fraction has no common misconceptions C) Bubble fraction is always computed the same way in all contexts D) A common mistake is confusing Bubble fraction with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse Bubble fraction with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Bubble fraction with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Bubble fraction with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse Bubble fraction with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q7: When should you apply NVLink?
A) Avoid NVLink unless explicitly instructed B) Apply NVLink to solve problems in this subject's domain C) NVLink is not practically useful D) Use NVLink only in pure mathematics contexts
Correct: B)
- If you chose A: This is incorrect. NVLink is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: NVLink is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. NVLink is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. NVLink is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
A ring all-reduce of D=200 MB among N=16 GPUs with interconnect bandwidth 100 GB/s. Compute the time.
Answer
T = 2(Nโ1)/N ยท D / B = 2ยท15/16 ยท 200MB / 100GB/s = 1.875 ยท 2ร10^{-3} s = 3.75 ms. Very fast โ all-reduce is efficient for moderately-sized data on fast interconnects.Problem 2
Derive why ZeRO-3 with N GPUs gives exactly 12/N bytes per parameter memory (for fp16 params, fp32 Adam).
Answer
Without ZeRO: fp16 params (2 bytes) + fp16 gradients (2 bytes) + fp32 m (4 bytes) + fp32 v (4 bytes) = 12 bytes/param. With ZeRO-3, each of these 12 bytes is sharded across N GPUs: 12/N bytes/param per GPU. This is the theoretical lower bound for data-parallel training โ no redundancy remains.Problem 3
You train with TP=4, PP=2. The forward pass of one micro-batch takes 100ms, backward takes 200ms. All-reduce for TP takes 10ms per occurrence (2 in forward, 2 in backward). PP activation transfer takes 5ms per boundary. Compute the step time for K=8 micro-batches with 1F1B scheduling.
Answer
Per-layer time in a single GPU (accounting for TP communication): - Forward: 100ms (compute) + 2ร10ms (TP all-reduce) = 120ms - Backward: 200ms + 2ร10ms = 220ms - Total per micro-batch per pipeline stage: 340ms With 1F1B scheduling across P=2 stages and K=8 micro-batches: - Total time โ (K + P โ 1) ร max(T_fwd, T_bwd) = (8+1) ร 220ms = 1980ms โ 2 sec Plus PP activation transfers: roughly 2K ร 5ms = 80ms. Total โ 2.06 sec. Effective per-micro-batch time = 2.06/8 = 0.258 sec. Compare to single GPU: 0.34 sec. Speedup = 0.34/0.258 โ 1.32ร from PP=2 (bubble overhead limits efficiency).Problem 4
Show that for K โซ P, the pipeline bubble fraction approaches 0. What limits K in practice?
Answer
Bubble fraction = (Pโ1)/(K+Pโ1). As K โ โ, this โ 0. The limit: activation memory scales as O(K) in the naive schedule โ each micro-batch's activations are stored until the corresponding backward pass. For a 175B model with 100 micro-batches, this could require terabytes of memory. 1F1B scheduling reduces the peak to O(P) by scheduling backward passes sooner, at the cost of slightly more complex orchestration.Problem 5
Compute the total bytes communicated per step for: (a) pure data parallelism (N_dp=64, D=20GB gradients), (b) ZeRO-3 with N_dp=64 (all-gathers for params + reduce-scatter for grads), and (c) tensor parallelism (N_tp=8, L=96 layers, d=8192). Which is most communication-intensive?
Answer
(a) DP all-reduce: 2ยท(Nโ1)/NยทD โ 2ยท63/64ยท20 = 39.4 GB per GPU (sent+received). (b) ZeRO-3: params all-gathered during fwd+bwd (2ร all-gather of 20/N=0.3125 GB each โ ~1.25 GB) + gradient reduce-scatter (~0.625 GB) โ 1.9 GB per GPU. Much less than DP! (c) TP: per layer, 4 all-reduces (2 fwd + 2 bwd). Each all-reduce is 2ยท(N_tpโ1)/N_tp ยท dยฒยท2 bytes โ 2ยท7/8ยท8192ยฒยท2 โ 2ยท0.875ยท134M โ 235 MB per layer. For L=96: 96ยท235MBยท4 โ 90.2 GB per GPU per step. Tensor parallelism has BY FAR the most communication โ this is why it's only used within NVLink-connected nodes.Summary
- Data parallelism splits the batch across GPUs; each has a full model copy; gradients are all-reduced โ communication cost is ~2D per GPU
- ZeRO shards optimizer states (Stage 1), gradients (Stage 2), and parameters (Stage 3) across DP GPUs, reducing memory by up to 12/N bytes/param
- Tensor parallelism splits individual layers across GPUs โ communication-intensive but enables very large models within a single node
- Pipeline parallelism splits layers across GPUs โ introduces a bubble of (Pโ1)/(K+Pโ1) idle time, reduced by larger K or 1F1B scheduling
- 3D parallelism combines all three: TP within nodes (NVLink), PP across nodes, DP for scale โ this is how 405B+ parameter models are trained
Pitfalls
- Using tensor parallelism across slow interconnects. TP requires 4 all-reduces per transformer layer, totaling ~90 GB of communication per step for a typical large model. This is only viable over NVLink (~600 GB/s). Placing TP across InfiniBand-connected nodes (200 GB/s or slower) makes communication the bottleneck, often exceeding computation time. TP belongs within a node; use PP or DP across nodes.
- Forgetting that pipeline parallelism introduces a bubble. Naive pipeline scheduling wastes (Pโ1)/(K+Pโ1) of GPU time to idle bubbles. With P=8 stages and K=32 micro-batches, that's 17.9% idle time. The bubble asymptotically approaches 0 as K โ โ, but K is limited by activation memory. Use 1F1B scheduling to reduce peak activation memory from O(K) to O(P), enabling larger K and lower bubble overhead.
- Assuming ZeRO-3 is always better than ZeRO-2. ZeRO-3 shards parameters, adding all-gather communication during both forward and backward passes. For smaller models (e.g., <13B parameters) where ZeRO-2 already fits comfortably in GPU memory, ZeRO-3's extra communication overhead can reduce throughput without providing a meaningful memory advantage. Choose the ZeRO stage based on actual memory constraints.
- Not overlapping communication with computation. Modern training frameworks overlap gradient all-reduce with the backward pass: layer L's gradients are communicated while layer L-1 computes backward. Misconfiguring the framework (e.g., using blocking instead of async communication) can serialize these, effectively doubling step time. Always verify that gradient communication is properly overlapped.
- Scaling data parallelism without ZeRO. Pure data parallelism stores full optimizer states (12 bytes/param) on every GPU. For a 70B model with 64 GPUs, that's 64 ร 840 GB = 53.8 TB of redundant storage. ZeRO-1 or ZeRO-2 eliminates this redundancy, often enabling 2-4ร larger models or batch sizes on the same hardware. Pure DP without ZeRO is only viable for very small models.
Key Terms
| Term | Definition |
|---|---|
| Data parallelism | Split batch across GPUs; each has full model copy; gradients all-reduced โ scales to hundreds of GPUs |
| Tensor parallelism | Split individual layers across GPUs โ communication-intensive (~90 GB/step), NVLink-only |
| Pipeline parallelism | Split layers across GPUs; bubble = (Pโ1)/(K+Pโ1) idle time; uses 1F1B scheduling to reduce activation memory |
| ZeRO | Zero Redundancy Optimizer โ shards optimizer states (S1), gradients (S2), parameters (S3) across DP GPUs |
| ZeRO-3 | Shards everything: 12/N bytes/param per GPU โ theoretical lower bound for data-parallel training |
| 3D parallelism | Combines TP (intra-node NVLink) + PP (inter-node) + DP (replicas) โ how 405B+ models are trained |
| Ring all-reduce | Communication cost ~2D bytes per GPU for data D โ efficient, scales with N |
| 1F1B scheduling | One-forward-one-backward: interleaves passes to reduce peak activation memory from O(K) to O(P) |
| Bubble fraction | (Pโ1)/(K+Pโ1) โ idle GPU time; asymptotically 0 as K โ โ but limited by activation memory |
| NVLink | ~600 GB/s GPU-to-GPU interconnect within a node โ 3ร faster than InfiniBand |
Next Steps
Continue to 20-05 โ Instruction Tuning (SFT) to learn how pre-trained LLMs are fine-tuned on instruction-response pairs, including the critical mathematical detail of masking prompt tokens during loss computation.