📐 Concept diagram

20-04 — Distributed Training Mathematics

Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-04 Prerequisites: 20-03 (Batch Size and Gradient Accumulation), 14-03 (SGD Variants), 17-05 (Attention Mathematics — memory analysis), 09-04 (Spectral Theorem and Quadratic Forms — for communication analysis), 15-01 (Floating Point Arithmetic — precision tradeoffs) Next subject: 20-05 — Instruction Tuning (SFT)

Learning Objectives

By the end of this subject, you will be able to:

Derive the all-reduce communication pattern and compute its bandwidth cost as a function of parameter count and GPU count
Analyze the three ZeRO stages mathematically — deriving memory savings for optimizer states, gradients, and parameters
Compute the memory footprint of pipeline parallelism with K micro-batches and derive the bubble overhead formula
Compare tensor parallelism (intra-layer) vs pipeline parallelism (inter-layer) vs data parallelism in terms of communication volume and scalability
Design a 3D parallel training configuration given model size, GPU count, and interconnect bandwidth constraints

Core Content

1. Why Distributed Training?

A single GPU is insufficient for modern LLMs:

Model	Parameters	fp16 Size	Adam States (fp32)	Total (~)
Llama 7B	7B	14 GB	56 GB	70 GB
Llama 70B	70B	140 GB	560 GB	700 GB
GPT-3 175B	175B	350 GB	1.4 TB	1.75 TB
Llama 3 405B	405B	810 GB	3.24 TB	4.05 TB

An A100 80GB can't even hold a 70B model's weights + Adam states. We MUST distribute across GPUs.

The three dimensions of distribution:

Data Parallelism (DP): Split the batch across GPUs; each has a full model copy
Model Parallelism: Split the MODEL across GPUs
Tensor Parallelism (TP): Split individual layers (e.g., each GPU holds part of each weight matrix)
Pipeline Parallelism (PP): Split layers across GPUs (GPU 0 has layers 0-7, GPU 1 has layers 8-15, etc.)
ZeRO (Zero Redundancy Optimizer): Shard optimizer states, gradients, and parameters across DP GPUs

⚠️ THIS IS CRITICAL — Modern LLM training combines ALL of these (3D parallelism). Understanding the mathematics of each allows you to compute optimal GPU allocation.

2. Data Parallelism

Setup: N GPUs, each with a full model copy. Each GPU processes B/N samples per step.

Algorithm:

For each GPU k in parallel:
    1. Forward pass on local micro-batch
    2. Backward pass → local gradients g_k
    3. ALL-REDUCE: g = (1/N) Σ g_k   (average gradients across all GPUs)
    4. Optimizer step: θ ← θ − η·g    (identical step on all GPUs)

All-reduce mathematics: The all-reduce operation computes the sum (or average) of values across all GPUs and distributes the result. The most efficient algorithm is the ring all-reduce:

N GPUs arranged in a logical ring
Data (gradients) of size D bytes
Each GPU sends D/N bytes to its right neighbor and receives D/N from its left, N−1 times

Communication per GPU = 2 · (N−1)/N · D ≈ 2D bytes (for large N)

(The factor of 2 is for send + receive.)

The time for an all-reduce on a ring:

$T_allreduce = 2(N−1)/N · D / B_interconnect
$

where B_interconnect is the interconnect bandwidth (e.g., 600 GB/s for NVLink on A100).

For N = 8, D = 14 GB (7B fp16 params): T ≈ 2·7/8·14GB/600GB/s ≈ 2·0.875·0.0233 ≈ 0.041 sec = 41 ms per all-reduce.

This is acceptable — the forward/backward pass takes much longer (~500ms–1s per micro-batch).

3. ZeRO: Zero Redundancy Optimizer

The key insight: in standard data parallelism, each GPU stores a FULL copy of optimizer states and gradients — massive redundancy. ZeRO shards these across DP GPUs.

ZeRO Stage 1: Optimizer State Partitioning

Each GPU holds 1/N of the optimizer states (m and v for Adam). After the all-reduce of gradients, each GPU updates only its partition of parameters, then all-gathers the updated parameters.

Memory savings:

$Without ZeRO:    params (2 bytes) + gradients (2) + opt_states (8) = 12 bytes/param
With ZeRO-1:    params (2) + gradients (2) + opt_states (8/N) = 4 + 8/N bytes/param
$

For N=8: 4 + 1 = 5 bytes/param → 2.4× reduction.

ZeRO Stage 2: Gradient Partitioning

Additionally shard gradients (each GPU only stores gradients for its parameter partition). After backward, gradients are reduced-scattered (each GPU ends up with only its partition).

Memory savings:

$With ZeRO-2:    params (2) + gradients (2/N) + opt_states (8/N) = 2 + 10/N bytes/param
$

For N=8: 2 + 1.25 = 3.25 bytes/param → 3.7× reduction vs no ZeRO.

ZeRO Stage 3: Parameter Partitioning

Additionally shard the parameters themselves. Each GPU only holds 1/N of the parameters at any time. During forward/backward, parameters are all-gathered as needed and discarded immediately after use.

Memory savings:

$With ZeRO-3:    params (2/N) + gradients (2/N) + opt_states (8/N) = 12/N bytes/param
$

For N=8: 1.5 bytes/param → 8× reduction. Now a 70B model (12 bytes/param × 70B = 840 GB without ZeRO) needs only 105 GB across 8 GPUs — ~13 GB per GPU.

Communication tradeoff: ZeRO-3 adds all-gather communication for parameters during forward/backward. This roughly matches the original data-parallel all-reduce cost.

Stage	Per-GPU Memory (bytes/param)	Communication Overhead
No ZeRO	12	1× all-reduce
ZeRO-1	4 + 8/N	1× all-reduce + scatter
ZeRO-2	2 + 10/N	1× reduce-scatter + all-gather
ZeRO-3	12/N	1× reduce-scatter + 2× all-gather

4. Tensor Parallelism (Megatron-LM Style)

Split individual transformer layers across GPUs. For a weight matrix W ∈ ℝ^{d×d}:

Column-parallel (for Q, K, V projections): Split W into [W₁ | W₂] vertically. GPU 0 has W₁, GPU 1 has W₂. Each computes partial output, then the outputs are concatenated.

Row-parallel (for the output projection after attention): Split W into [W₁; W₂] horizontally. Each GPU computes with its row partition, then all-reduce sums the partial results.

Communication: Each transformer block requires 2 all-reduces in the forward pass and 2 in backward. For L layers:

$Total communication per step = 4L · (comm per all-reduce)
$

For a 32-layer transformer with d=4096, each all-reduce is ~2·d²·2 bytes (fp16) ≈ 64 MB. 4L = 128 all-reduces → ~8.2 GB per step. This is expensive — tensor parallelism is bandwidth-hungry.

When to use: NVLink-connected GPUs within a single node (high bandwidth, low latency). Tensor parallelism doesn't scale well beyond 8 GPUs due to quadratic communication growth.

5. Pipeline Parallelism

Split layers across GPUs: GPU 0 handles layers 0–7, GPU 1 handles layers 8–15, etc. Activations flow forward; gradients flow backward.

The bubble problem: GPUs are idle during pipeline startup and teardown. With K micro-batches in a pipeline of P stages:

$Bubble time = (P−1) / (P−1 + K)  fraction of total time
$

Derivation: Total steps to process K micro-batches through P stages: forward takes K+P−1 time units, backward takes another K+P−1. Total = 2(K+P−1). Productive work = 2PK units (P stages × K micro-batches × 2 passes). Bubble = total − productive = 2(K+P−1) − 2PK... wait, the productive work is done when ALL stages are busy. Let me be more precise:

Forward: stage p processes micro-batch k at time k+p−1
Total forward time = K+P−1 (from first micro-batch entering stage 0 to last exiting stage P−1)
Each stage does K units of work, so total work = K·P
Forward bubble = (K+P−1) − K = P−1 "idle" steps distributed across stages

Same for backward. Total bubble = 2(P−1) steps per micro-batch cycle.

$Bubble fraction = 2(P−1) / 2(K+P−1) = (P−1)/(K+P−1)
$

As K → ∞, bubble fraction → 0. In practice, K is limited by memory (activations for K micro-batches must be stored). Scheduling schemes like 1F1B (one-forward-one-backward) reduce the activation memory peak.

Communication: Only activations and gradients at pipeline boundaries. Much less communication than tensor parallelism.

6. 3D Parallelism: Putting It All Together

Modern LLM training combines all three:

$Total GPUs = N_dp × N_tp × N_pp
$

where: - N_dp = data-parallel replicas (with ZeRO) - N_tp = tensor-parallel size (within a node, NVLink) - N_pp = pipeline-parallel stages (across nodes)

Example: Llama 70B on 128 GPUs - N_tp = 8 (within 8-GPU NVLink nodes) - N_pp = 4 (4 pipeline stages) - N_dp = 128/(8×4) = 4

Each pipeline stage = 8 GPUs running tensor parallelism. 4 such groups form the pipeline. 4 replicas of this entire setup for data parallelism.

Memory: Each GPU stores 1/(N_tp·N_dp) of model + ZeRO sharding. Communication: intra-node NVLink for TP, inter-node InfiniBand for PP/DP.

7. Communication/Computation Overlap

A key optimization: overlap communication with computation. During backward pass, as soon as layer L's gradients are computed, their all-reduce can begin while layers L−1, L−2, ... continue backward. This hides communication latency.

Theoretical maximum overlap: If T_comm ≤ T_comp (communication time ≤ computation time for that layer), communication is completely hidden. For transformer layers, this is typically achievable for data parallelism with NVLink but challenging for slower interconnects.

Worked Examples

Example 1: ZeRO Memory Calculation

Problem: A 13B parameter model in fp16 with Adam optimizer (fp32 states). Compute the per-GPU memory for: (a) No ZeRO, 1 GPU (b) ZeRO-2, 8 GPUs (c) ZeRO-3, 8 GPUs

Solution:

$params (fp16) = 13B × 2 = 26 GB
gradients (fp16) = 13B × 2 = 26 GB
opt_states (fp32, m+v) = 13B × 4 × 2 = 104 GB
$

(a) No ZeRO, 1 GPU: 26 + 26 + 104 = 156 GB. Cannot fit on A100 80GB!

(b) ZeRO-2, 8 GPUs: - params per GPU: 26 GB (full copy) - gradients per GPU: 26/8 = 3.25 GB - opt_states per GPU: 104/8 = 13 GB - Total: 26 + 3.25 + 13 = 42.25 GB per GPU ✓ (fits on 80GB)

(c) ZeRO-3, 8 GPUs: - params per GPU: 26/8 = 3.25 GB - gradients per GPU: 26/8 = 3.25 GB - opt_states per GPU: 104/8 = 13 GB - Total: 3.25 + 3.25 + 13 = 19.5 GB per GPU ✓ (very comfortable)

Example 2: Pipeline Bubble Calculation

Problem: A pipeline with P=8 stages processes K=32 micro-batches. What fraction of time is spent in bubbles?

Solution:

Using the bubble fraction formula:

$Bubble fraction = (P−1) / (K+P−1) = 7 / (32+8−1) = 7/39 ≈ 0.179 = 17.9%
$

This means 17.9% of GPU time is idle. To reduce bubble to <5%, solve:

(P−1)/(K+P−1) < 0.05
7/(K+7) < 0.05
7 < 0.05(K+7)
140 < K+7
K > 133

With K=133 micro-batches, bubble is <5%. But this requires storing 133× the activation memory — likely too much. In practice, bubble overhead of 15-25% is accepted.

Example 3: 3D Parallelism Configuration

Problem: You have 256 A100 GPUs (32 nodes × 8 GPUs, NVLink within node, 200 GB/s InfiniBand between nodes). Model: 175B params. Design a 3D parallel configuration and verify memory fits (each GPU has 80 GB).

Solution:

Tensor parallelism within node: N_tp = 8 (full node) Pipeline parallelism: N_pp = 4 (across 4 nodes) Data parallelism: N_dp = 256/(8×4) = 8

Memory per GPU with ZeRO-2 (shard optimizer states + gradients across DP dim):

params (fp16): 175B × 2 / 8(TP) = 43.75 GB per TP group... no, TP shares params across GPUs within the same layer.

Actually for TP: each GPU in a TP group holds a FRACTION of each weight matrix.
For column-parallel: each GPU holds 1/N_tp of the matrix width.
For row-parallel: each GPU holds 1/N_tp of the matrix height.

On average, each TP GPU holds ~1/N_tp of parameters. With PP=4, each pipeline stage gets 1/4 of layers:

With TP=8, each GPU in a TP group stores roughly 1/8 of each layer's parameters. PP=4 means each pipeline stage has 1/4 of the layers. So params per GPU: 350 GB / 8 / 4 = 10.94 GB.

Plus optimizer states (fp32): ZeRO-2 shards across DP=8, so 175B × 8 bytes / (PP=4 × TP=8 × DP=8) = 175B × 8 / 256 ≈ 5.47 GB per GPU.

Plus gradients: 175B × 2 / (4×8×8) = 350 / 256 ≈ 1.37 GB per GPU.

Total per GPU: 10.94 + 5.47 + 1.37 = 17.78 GB + activations (~5-10 GB) ≈ 25 GB. Fits easily in 80 GB.

This configuration is reasonable for 175B on 256 A100s.

Quiz

Q1: What does the concept of Data parallelism primarily refer to in this subject?

A) A visual representation of Data parallelism B) A computational error related to Data parallelism C) The definition and application of Data parallelism D) A historical anecdote about Data parallelism

Correct: C)

If you chose A: This is incorrect. Data parallelism is defined as: the definition and application of data parallelism. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Data parallelism is defined as: the definition and application of data parallelism. The other options describe different aspects that are not the primary focus.
If you chose C: Data parallelism is defined as: the definition and application of data parallelism. The other options describe different aspects that are not the primary focus. Correct!
If you chose D: This is incorrect. Data parallelism is defined as: the definition and application of data parallelism. The other options describe different aspects that are not the primary focus.

Q2: What is the primary purpose of Tensor parallelism?

A) It is used only in advanced research contexts B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used to tensor parallelism in mathematical analysis

Correct: D)

If you chose A: This is incorrect. Tensor parallelism serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Tensor parallelism serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Tensor parallelism serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: Tensor parallelism serves the purpose described in the correct answer. The other options misrepresent its role. Correct!

Q3: Which statement about Pipeline parallelism is TRUE?

A) Pipeline parallelism is not related to this subject B) Pipeline parallelism is a fundamental concept covered in this subject C) Pipeline parallelism is an advanced topic beyond this subject's scope D) Pipeline parallelism is mentioned only as a historical footnote

Correct: B)

If you chose A: This is incorrect. Pipeline parallelism is a fundamental concept covered in this subject. This subject covers Pipeline parallelism as part of its core content.
If you chose B: Pipeline parallelism is a fundamental concept covered in this subject. This subject covers Pipeline parallelism as part of its core content. Correct!
If you chose C: This is incorrect. Pipeline parallelism is a fundamental concept covered in this subject. This subject covers Pipeline parallelism as part of its core content.
If you chose D: This is incorrect. Pipeline parallelism is a fundamental concept covered in this subject. This subject covers Pipeline parallelism as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) - N GPUs arranged in a logical ring C) The inverse of the correct answer D) A different result from a common mistake

Correct: B)

If you chose A: This is incorrect. The worked examples show that the result is - N GPUs arranged in a logical ring. The other options represent common errors.
If you chose B: The worked examples show that the result is - N GPUs arranged in a logical ring. The other options represent common errors. Correct!
If you chose C: This is incorrect. The worked examples show that the result is - N GPUs arranged in a logical ring. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is - N GPUs arranged in a logical ring. The other options represent common errors.

Q5: How are Pipeline parallelism and Ring all-reduce related?

A) Pipeline parallelism and Ring all-reduce are completely unrelated topics B) Pipeline parallelism and Ring all-reduce are closely related concepts C) Pipeline parallelism is the inverse of Ring all-reduce D) Pipeline parallelism is a special case of Ring all-reduce

Correct: B)

If you chose A: This is incorrect. Both Pipeline parallelism and Ring all-reduce are covered in this subject as interconnected topics.
If you chose B: Both Pipeline parallelism and Ring all-reduce are covered in this subject as interconnected topics. Correct!
If you chose C: This is incorrect. Both Pipeline parallelism and Ring all-reduce are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both Pipeline parallelism and Ring all-reduce are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with Bubble fraction?

A) The main error with Bubble fraction is using it when it is not needed B) Bubble fraction has no common misconceptions C) Bubble fraction is always computed the same way in all contexts D) A common mistake is confusing Bubble fraction with a similar concept

Correct: D)

If you chose A: This is incorrect. Students often confuse Bubble fraction with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: This is incorrect. Students often confuse Bubble fraction with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse Bubble fraction with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: Students often confuse Bubble fraction with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!

Q7: When should you apply NVLink?

A) Avoid NVLink unless explicitly instructed B) Apply NVLink to solve problems in this subject's domain C) NVLink is not practically useful D) Use NVLink only in pure mathematics contexts

Correct: B)

If you chose A: This is incorrect. NVLink is a practical tool used throughout this subject to solve relevant problems.
If you chose B: NVLink is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. NVLink is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. NVLink is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Problem 1

A ring all-reduce of D=200 MB among N=16 GPUs with interconnect bandwidth 100 GB/s. Compute the time.

Answer

T = 2(N−1)/N · D / B = 2·15/16 · 200MB / 100GB/s = 1.875 · 2×10^{-3} s = 3.75 ms. Very fast — all-reduce is efficient for moderately-sized data on fast interconnects.

Problem 2

Derive why ZeRO-3 with N GPUs gives exactly 12/N bytes per parameter memory (for fp16 params, fp32 Adam).

Answer

Without ZeRO: fp16 params (2 bytes) + fp16 gradients (2 bytes) + fp32 m (4 bytes) + fp32 v (4 bytes) = 12 bytes/param. With ZeRO-3, each of these 12 bytes is sharded across N GPUs: 12/N bytes/param per GPU. This is the theoretical lower bound for data-parallel training — no redundancy remains.

Problem 3

You train with TP=4, PP=2. The forward pass of one micro-batch takes 100ms, backward takes 200ms. All-reduce for TP takes 10ms per occurrence (2 in forward, 2 in backward). PP activation transfer takes 5ms per boundary. Compute the step time for K=8 micro-batches with 1F1B scheduling.

Answer

Per-layer time in a single GPU (accounting for TP communication): - Forward: 100ms (compute) + 2×10ms (TP all-reduce) = 120ms - Backward: 200ms + 2×10ms = 220ms - Total per micro-batch per pipeline stage: 340ms With 1F1B scheduling across P=2 stages and K=8 micro-batches: - Total time ≈ (K + P − 1) × max(T_fwd, T_bwd) = (8+1) × 220ms = 1980ms ≈ 2 sec Plus PP activation transfers: roughly 2K × 5ms = 80ms. Total ≈ 2.06 sec. Effective per-micro-batch time = 2.06/8 = 0.258 sec. Compare to single GPU: 0.34 sec. Speedup = 0.34/0.258 ≈ 1.32× from PP=2 (bubble overhead limits efficiency).

Problem 4

Show that for K ≫ P, the pipeline bubble fraction approaches 0. What limits K in practice?

Answer

Bubble fraction = (P−1)/(K+P−1). As K → ∞, this → 0. The limit: activation memory scales as O(K) in the naive schedule — each micro-batch's activations are stored until the corresponding backward pass. For a 175B model with 100 micro-batches, this could require terabytes of memory. 1F1B scheduling reduces the peak to O(P) by scheduling backward passes sooner, at the cost of slightly more complex orchestration.

Problem 5

Compute the total bytes communicated per step for: (a) pure data parallelism (N_dp=64, D=20GB gradients), (b) ZeRO-3 with N_dp=64 (all-gathers for params + reduce-scatter for grads), and (c) tensor parallelism (N_tp=8, L=96 layers, d=8192). Which is most communication-intensive?

Answer

(a) DP all-reduce: 2·(N−1)/N·D ≈ 2·63/64·20 = 39.4 GB per GPU (sent+received). (b) ZeRO-3: params all-gathered during fwd+bwd (2× all-gather of 20/N=0.3125 GB each → ~1.25 GB) + gradient reduce-scatter (~0.625 GB) ≈ 1.9 GB per GPU. Much less than DP! (c) TP: per layer, 4 all-reduces (2 fwd + 2 bwd). Each all-reduce is 2·(N_tp−1)/N_tp · d²·2 bytes ≈ 2·7/8·8192²·2 ≈ 2·0.875·134M ≈ 235 MB per layer. For L=96: 96·235MB·4 ≈ 90.2 GB per GPU per step. Tensor parallelism has BY FAR the most communication — this is why it's only used within NVLink-connected nodes.

Summary

Data parallelism splits the batch across GPUs; each has a full model copy; gradients are all-reduced — communication cost is ~2D per GPU
ZeRO shards optimizer states (Stage 1), gradients (Stage 2), and parameters (Stage 3) across DP GPUs, reducing memory by up to 12/N bytes/param
Tensor parallelism splits individual layers across GPUs — communication-intensive but enables very large models within a single node
Pipeline parallelism splits layers across GPUs — introduces a bubble of (P−1)/(K+P−1) idle time, reduced by larger K or 1F1B scheduling
3D parallelism combines all three: TP within nodes (NVLink), PP across nodes, DP for scale — this is how 405B+ parameter models are trained

Pitfalls

Using tensor parallelism across slow interconnects. TP requires 4 all-reduces per transformer layer, totaling ~90 GB of communication per step for a typical large model. This is only viable over NVLink (~600 GB/s). Placing TP across InfiniBand-connected nodes (200 GB/s or slower) makes communication the bottleneck, often exceeding computation time. TP belongs within a node; use PP or DP across nodes.
Forgetting that pipeline parallelism introduces a bubble. Naive pipeline scheduling wastes (P−1)/(K+P−1) of GPU time to idle bubbles. With P=8 stages and K=32 micro-batches, that's 17.9% idle time. The bubble asymptotically approaches 0 as K → ∞, but K is limited by activation memory. Use 1F1B scheduling to reduce peak activation memory from O(K) to O(P), enabling larger K and lower bubble overhead.
Assuming ZeRO-3 is always better than ZeRO-2. ZeRO-3 shards parameters, adding all-gather communication during both forward and backward passes. For smaller models (e.g., <13B parameters) where ZeRO-2 already fits comfortably in GPU memory, ZeRO-3's extra communication overhead can reduce throughput without providing a meaningful memory advantage. Choose the ZeRO stage based on actual memory constraints.
Not overlapping communication with computation. Modern training frameworks overlap gradient all-reduce with the backward pass: layer L's gradients are communicated while layer L-1 computes backward. Misconfiguring the framework (e.g., using blocking instead of async communication) can serialize these, effectively doubling step time. Always verify that gradient communication is properly overlapped.
Scaling data parallelism without ZeRO. Pure data parallelism stores full optimizer states (12 bytes/param) on every GPU. For a 70B model with 64 GPUs, that's 64 × 840 GB = 53.8 TB of redundant storage. ZeRO-1 or ZeRO-2 eliminates this redundancy, often enabling 2-4× larger models or batch sizes on the same hardware. Pure DP without ZeRO is only viable for very small models.

Key Terms

Term	Definition
Data parallelism	Split batch across GPUs; each has full model copy; gradients all-reduced — scales to hundreds of GPUs
Tensor parallelism	Split individual layers across GPUs — communication-intensive (~90 GB/step), NVLink-only
Pipeline parallelism	Split layers across GPUs; bubble = (P−1)/(K+P−1) idle time; uses 1F1B scheduling to reduce activation memory
ZeRO	Zero Redundancy Optimizer — shards optimizer states (S1), gradients (S2), parameters (S3) across DP GPUs
ZeRO-3	Shards everything: 12/N bytes/param per GPU — theoretical lower bound for data-parallel training
3D parallelism	Combines TP (intra-node NVLink) + PP (inter-node) + DP (replicas) — how 405B+ models are trained
Ring all-reduce	Communication cost ~2D bytes per GPU for data D — efficient, scales with N
1F1B scheduling	One-forward-one-backward: interleaves passes to reduce peak activation memory from O(K) to O(P)
Bubble fraction	(P−1)/(K+P−1) — idle GPU time; asymptotically 0 as K → ∞ but limited by activation memory
NVLink	~600 GB/s GPU-to-GPU interconnect within a node — 3× faster than InfiniBand

Next Steps

Continue to 20-05 — Instruction Tuning (SFT) to learn how pre-trained LLMs are fine-tuned on instruction-response pairs, including the critical mathematical detail of masking prompt tokens during loss computation.

Progress

Phases

20-04 — Distributed Training Mathematics

Learning Objectives

Core Content

1. Why Distributed Training?

2. Data Parallelism

3. ZeRO: Zero Redundancy Optimizer

ZeRO Stage 1: Optimizer State Partitioning

ZeRO Stage 2: Gradient Partitioning

ZeRO Stage 3: Parameter Partitioning

4. Tensor Parallelism (Megatron-LM Style)

5. Pipeline Parallelism

6. 3D Parallelism: Putting It All Together

7. Communication/Computation Overlap

Worked Examples

Example 1: ZeRO Memory Calculation

Example 2: Pipeline Bubble Calculation

Example 3: 3D Parallelism Configuration

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Key Terms

Next Steps