Math graphic
๐Ÿ“ Concept diagram

20-04 โ€” Distributed Training Mathematics

Phase: 20 โ€” Training & Fine-tuning Mathematics Subject: 20-04 Prerequisites: 20-03 (Batch Size and Gradient Accumulation), 14-03 (SGD Variants), 17-05 (Attention Mathematics โ€” memory analysis), 09-04 (Spectral Theorem and Quadratic Forms โ€” for communication analysis), 15-01 (Floating Point Arithmetic โ€” precision tradeoffs) Next subject: 20-05 โ€” Instruction Tuning (SFT)


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the all-reduce communication pattern and compute its bandwidth cost as a function of parameter count and GPU count
  2. Analyze the three ZeRO stages mathematically โ€” deriving memory savings for optimizer states, gradients, and parameters
  3. Compute the memory footprint of pipeline parallelism with K micro-batches and derive the bubble overhead formula
  4. Compare tensor parallelism (intra-layer) vs pipeline parallelism (inter-layer) vs data parallelism in terms of communication volume and scalability
  5. Design a 3D parallel training configuration given model size, GPU count, and interconnect bandwidth constraints

Core Content

1. Why Distributed Training?

A single GPU is insufficient for modern LLMs:

Model Parameters fp16 Size Adam States (fp32) Total (~)
Llama 7B 7B 14 GB 56 GB 70 GB
Llama 70B 70B 140 GB 560 GB 700 GB
GPT-3 175B 175B 350 GB 1.4 TB 1.75 TB
Llama 3 405B 405B 810 GB 3.24 TB 4.05 TB

An A100 80GB can't even hold a 70B model's weights + Adam states. We MUST distribute across GPUs.

The three dimensions of distribution:

  1. Data Parallelism (DP): Split the batch across GPUs; each has a full model copy
  2. Model Parallelism: Split the MODEL across GPUs
  3. Tensor Parallelism (TP): Split individual layers (e.g., each GPU holds part of each weight matrix)
  4. Pipeline Parallelism (PP): Split layers across GPUs (GPU 0 has layers 0-7, GPU 1 has layers 8-15, etc.)
  5. ZeRO (Zero Redundancy Optimizer): Shard optimizer states, gradients, and parameters across DP GPUs

โš ๏ธ THIS IS CRITICAL โ€” Modern LLM training combines ALL of these (3D parallelism). Understanding the mathematics of each allows you to compute optimal GPU allocation.


2. Data Parallelism

Setup: N GPUs, each with a full model copy. Each GPU processes B/N samples per step.

Algorithm:

For each GPU k in parallel:
    1. Forward pass on local micro-batch
    2. Backward pass โ†’ local gradients g_k
    3. ALL-REDUCE: g = (1/N) ฮฃ g_k   (average gradients across all GPUs)
    4. Optimizer step: ฮธ โ† ฮธ โˆ’ ฮทยทg    (identical step on all GPUs)

All-reduce mathematics: The all-reduce operation computes the sum (or average) of values across all GPUs and distributes the result. The most efficient algorithm is the ring all-reduce:

Communication per GPU = 2 ยท (Nโˆ’1)/N ยท D โ‰ˆ 2D bytes (for large N)

(The factor of 2 is for send + receive.)

The time for an all-reduce on a ring:

$T_allreduce = 2(Nโˆ’1)/N ยท D / B_interconnect
$

where B_interconnect is the interconnect bandwidth (e.g., 600 GB/s for NVLink on A100).

For N = 8, D = 14 GB (7B fp16 params): T โ‰ˆ 2ยท7/8ยท14GB/600GB/s โ‰ˆ 2ยท0.875ยท0.0233 โ‰ˆ 0.041 sec = 41 ms per all-reduce.

This is acceptable โ€” the forward/backward pass takes much longer (~500msโ€“1s per micro-batch).


3. ZeRO: Zero Redundancy Optimizer

The key insight: in standard data parallelism, each GPU stores a FULL copy of optimizer states and gradients โ€” massive redundancy. ZeRO shards these across DP GPUs.

ZeRO Stage 1: Optimizer State Partitioning

Each GPU holds 1/N of the optimizer states (m and v for Adam). After the all-reduce of gradients, each GPU updates only its partition of parameters, then all-gathers the updated parameters.

Memory savings:

$Without ZeRO:    params (2 bytes) + gradients (2) + opt_states (8) = 12 bytes/param
With ZeRO-1:    params (2) + gradients (2) + opt_states (8/N) = 4 + 8/N bytes/param
$

For N=8: 4 + 1 = 5 bytes/param โ†’ 2.4ร— reduction.

ZeRO Stage 2: Gradient Partitioning

Additionally shard gradients (each GPU only stores gradients for its parameter partition). After backward, gradients are reduced-scattered (each GPU ends up with only its partition).

Memory savings:

$With ZeRO-2:    params (2) + gradients (2/N) + opt_states (8/N) = 2 + 10/N bytes/param
$

For N=8: 2 + 1.25 = 3.25 bytes/param โ†’ 3.7ร— reduction vs no ZeRO.

ZeRO Stage 3: Parameter Partitioning

Additionally shard the parameters themselves. Each GPU only holds 1/N of the parameters at any time. During forward/backward, parameters are all-gathered as needed and discarded immediately after use.

Memory savings:

$With ZeRO-3:    params (2/N) + gradients (2/N) + opt_states (8/N) = 12/N bytes/param
$

For N=8: 1.5 bytes/param โ†’ 8ร— reduction. Now a 70B model (12 bytes/param ร— 70B = 840 GB without ZeRO) needs only 105 GB across 8 GPUs โ€” ~13 GB per GPU.

Communication tradeoff: ZeRO-3 adds all-gather communication for parameters during forward/backward. This roughly matches the original data-parallel all-reduce cost.

Stage Per-GPU Memory (bytes/param) Communication Overhead
No ZeRO 12 1ร— all-reduce
ZeRO-1 4 + 8/N 1ร— all-reduce + scatter
ZeRO-2 2 + 10/N 1ร— reduce-scatter + all-gather
ZeRO-3 12/N 1ร— reduce-scatter + 2ร— all-gather

4. Tensor Parallelism (Megatron-LM Style)

Split individual transformer layers across GPUs. For a weight matrix W โˆˆ โ„^{dร—d}:

Column-parallel (for Q, K, V projections): Split W into [Wโ‚ | Wโ‚‚] vertically. GPU 0 has Wโ‚, GPU 1 has Wโ‚‚. Each computes partial output, then the outputs are concatenated.

Row-parallel (for the output projection after attention): Split W into [Wโ‚; Wโ‚‚] horizontally. Each GPU computes with its row partition, then all-reduce sums the partial results.

Communication: Each transformer block requires 2 all-reduces in the forward pass and 2 in backward. For L layers:

$Total communication per step = 4L ยท (comm per all-reduce)
$

For a 32-layer transformer with d=4096, each all-reduce is ~2ยทdยฒยท2 bytes (fp16) โ‰ˆ 64 MB. 4L = 128 all-reduces โ†’ ~8.2 GB per step. This is expensive โ€” tensor parallelism is bandwidth-hungry.

When to use: NVLink-connected GPUs within a single node (high bandwidth, low latency). Tensor parallelism doesn't scale well beyond 8 GPUs due to quadratic communication growth.


5. Pipeline Parallelism

Split layers across GPUs: GPU 0 handles layers 0โ€“7, GPU 1 handles layers 8โ€“15, etc. Activations flow forward; gradients flow backward.

The bubble problem: GPUs are idle during pipeline startup and teardown. With K micro-batches in a pipeline of P stages:

$Bubble time = (Pโˆ’1) / (Pโˆ’1 + K)  fraction of total time
$

Derivation: Total steps to process K micro-batches through P stages: forward takes K+Pโˆ’1 time units, backward takes another K+Pโˆ’1. Total = 2(K+Pโˆ’1). Productive work = 2PK units (P stages ร— K micro-batches ร— 2 passes). Bubble = total โˆ’ productive = 2(K+Pโˆ’1) โˆ’ 2PK... wait, the productive work is done when ALL stages are busy. Let me be more precise:

Same for backward. Total bubble = 2(Pโˆ’1) steps per micro-batch cycle.

$Bubble fraction = 2(Pโˆ’1) / 2(K+Pโˆ’1) = (Pโˆ’1)/(K+Pโˆ’1)
$

As K โ†’ โˆž, bubble fraction โ†’ 0. In practice, K is limited by memory (activations for K micro-batches must be stored). Scheduling schemes like 1F1B (one-forward-one-backward) reduce the activation memory peak.

Communication: Only activations and gradients at pipeline boundaries. Much less communication than tensor parallelism.


6. 3D Parallelism: Putting It All Together

Modern LLM training combines all three:

$Total GPUs = N_dp ร— N_tp ร— N_pp
$

where: - N_dp = data-parallel replicas (with ZeRO) - N_tp = tensor-parallel size (within a node, NVLink) - N_pp = pipeline-parallel stages (across nodes)

Example: Llama 70B on 128 GPUs - N_tp = 8 (within 8-GPU NVLink nodes) - N_pp = 4 (4 pipeline stages) - N_dp = 128/(8ร—4) = 4

Each pipeline stage = 8 GPUs running tensor parallelism. 4 such groups form the pipeline. 4 replicas of this entire setup for data parallelism.

Memory: Each GPU stores 1/(N_tpยทN_dp) of model + ZeRO sharding. Communication: intra-node NVLink for TP, inter-node InfiniBand for PP/DP.


7. Communication/Computation Overlap

A key optimization: overlap communication with computation. During backward pass, as soon as layer L's gradients are computed, their all-reduce can begin while layers Lโˆ’1, Lโˆ’2, ... continue backward. This hides communication latency.

Theoretical maximum overlap: If T_comm โ‰ค T_comp (communication time โ‰ค computation time for that layer), communication is completely hidden. For transformer layers, this is typically achievable for data parallelism with NVLink but challenging for slower interconnects.


Worked Examples

Example 1: ZeRO Memory Calculation

Problem: A 13B parameter model in fp16 with Adam optimizer (fp32 states). Compute the per-GPU memory for: (a) No ZeRO, 1 GPU (b) ZeRO-2, 8 GPUs (c) ZeRO-3, 8 GPUs

Solution:

$params (fp16) = 13B ร— 2 = 26 GB
gradients (fp16) = 13B ร— 2 = 26 GB
opt_states (fp32, m+v) = 13B ร— 4 ร— 2 = 104 GB
$

(a) No ZeRO, 1 GPU: 26 + 26 + 104 = 156 GB. Cannot fit on A100 80GB!

(b) ZeRO-2, 8 GPUs: - params per GPU: 26 GB (full copy) - gradients per GPU: 26/8 = 3.25 GB - opt_states per GPU: 104/8 = 13 GB - Total: 26 + 3.25 + 13 = 42.25 GB per GPU โœ“ (fits on 80GB)

(c) ZeRO-3, 8 GPUs: - params per GPU: 26/8 = 3.25 GB - gradients per GPU: 26/8 = 3.25 GB - opt_states per GPU: 104/8 = 13 GB - Total: 3.25 + 3.25 + 13 = 19.5 GB per GPU โœ“ (very comfortable)


Example 2: Pipeline Bubble Calculation

Problem: A pipeline with P=8 stages processes K=32 micro-batches. What fraction of time is spent in bubbles?

Solution:

Using the bubble fraction formula:

$Bubble fraction = (Pโˆ’1) / (K+Pโˆ’1) = 7 / (32+8โˆ’1) = 7/39 โ‰ˆ 0.179 = 17.9%
$

This means 17.9% of GPU time is idle. To reduce bubble to <5%, solve:

(Pโˆ’1)/(K+Pโˆ’1) < 0.05
7/(K+7) < 0.05
7 < 0.05(K+7)
140 < K+7
K > 133

With K=133 micro-batches, bubble is <5%. But this requires storing 133ร— the activation memory โ€” likely too much. In practice, bubble overhead of 15-25% is accepted.


Example 3: 3D Parallelism Configuration

Problem: You have 256 A100 GPUs (32 nodes ร— 8 GPUs, NVLink within node, 200 GB/s InfiniBand between nodes). Model: 175B params. Design a 3D parallel configuration and verify memory fits (each GPU has 80 GB).

Solution:

Tensor parallelism within node: N_tp = 8 (full node) Pipeline parallelism: N_pp = 4 (across 4 nodes) Data parallelism: N_dp = 256/(8ร—4) = 8

Memory per GPU with ZeRO-2 (shard optimizer states + gradients across DP dim):

params (fp16): 175B ร— 2 / 8(TP) = 43.75 GB per TP group... no, TP shares params across GPUs within the same layer.

Actually for TP: each GPU in a TP group holds a FRACTION of each weight matrix.
For column-parallel: each GPU holds 1/N_tp of the matrix width.
For row-parallel: each GPU holds 1/N_tp of the matrix height.

On average, each TP GPU holds ~1/N_tp of parameters. With PP=4, each pipeline stage gets 1/4 of layers:

With TP=8, each GPU in a TP group stores roughly 1/8 of each layer's parameters. PP=4 means each pipeline stage has 1/4 of the layers. So params per GPU: 350 GB / 8 / 4 = 10.94 GB.

Plus optimizer states (fp32): ZeRO-2 shards across DP=8, so 175B ร— 8 bytes / (PP=4 ร— TP=8 ร— DP=8) = 175B ร— 8 / 256 โ‰ˆ 5.47 GB per GPU.

Plus gradients: 175B ร— 2 / (4ร—8ร—8) = 350 / 256 โ‰ˆ 1.37 GB per GPU.

Total per GPU: 10.94 + 5.47 + 1.37 = 17.78 GB + activations (~5-10 GB) โ‰ˆ 25 GB. Fits easily in 80 GB.

This configuration is reasonable for 175B on 256 A100s.


Quiz

Q1: What does the concept of Data parallelism primarily refer to in this subject?

A) A visual representation of Data parallelism B) A computational error related to Data parallelism C) The definition and application of Data parallelism D) A historical anecdote about Data parallelism

Correct: C)

Q2: What is the primary purpose of Tensor parallelism?

A) It is used only in advanced research contexts B) It replaces all other methods in this domain C) It is primarily a historical notation system D) It is used to tensor parallelism in mathematical analysis

Correct: D)

Q3: Which statement about Pipeline parallelism is TRUE?

A) Pipeline parallelism is not related to this subject B) Pipeline parallelism is a fundamental concept covered in this subject C) Pipeline parallelism is an advanced topic beyond this subject's scope D) Pipeline parallelism is mentioned only as a historical footnote

Correct: B)

Q4: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) - N GPUs arranged in a logical ring C) The inverse of the correct answer D) A different result from a common mistake

Correct: B)

Q5: How are Pipeline parallelism and Ring all-reduce related?

A) Pipeline parallelism and Ring all-reduce are completely unrelated topics B) Pipeline parallelism and Ring all-reduce are closely related concepts C) Pipeline parallelism is the inverse of Ring all-reduce D) Pipeline parallelism is a special case of Ring all-reduce

Correct: B)

Q6: What is a common pitfall when working with Bubble fraction?

A) The main error with Bubble fraction is using it when it is not needed B) Bubble fraction has no common misconceptions C) Bubble fraction is always computed the same way in all contexts D) A common mistake is confusing Bubble fraction with a similar concept

Correct: D)

Q7: When should you apply NVLink?

A) Avoid NVLink unless explicitly instructed B) Apply NVLink to solve problems in this subject's domain C) NVLink is not practically useful D) Use NVLink only in pure mathematics contexts

Correct: B)

Practice Problems

Problem 1

A ring all-reduce of D=200 MB among N=16 GPUs with interconnect bandwidth 100 GB/s. Compute the time.

Answer T = 2(Nโˆ’1)/N ยท D / B = 2ยท15/16 ยท 200MB / 100GB/s = 1.875 ยท 2ร—10^{-3} s = 3.75 ms. Very fast โ€” all-reduce is efficient for moderately-sized data on fast interconnects.

Problem 2

Derive why ZeRO-3 with N GPUs gives exactly 12/N bytes per parameter memory (for fp16 params, fp32 Adam).

Answer Without ZeRO: fp16 params (2 bytes) + fp16 gradients (2 bytes) + fp32 m (4 bytes) + fp32 v (4 bytes) = 12 bytes/param. With ZeRO-3, each of these 12 bytes is sharded across N GPUs: 12/N bytes/param per GPU. This is the theoretical lower bound for data-parallel training โ€” no redundancy remains.

Problem 3

You train with TP=4, PP=2. The forward pass of one micro-batch takes 100ms, backward takes 200ms. All-reduce for TP takes 10ms per occurrence (2 in forward, 2 in backward). PP activation transfer takes 5ms per boundary. Compute the step time for K=8 micro-batches with 1F1B scheduling.

Answer Per-layer time in a single GPU (accounting for TP communication): - Forward: 100ms (compute) + 2ร—10ms (TP all-reduce) = 120ms - Backward: 200ms + 2ร—10ms = 220ms - Total per micro-batch per pipeline stage: 340ms With 1F1B scheduling across P=2 stages and K=8 micro-batches: - Total time โ‰ˆ (K + P โˆ’ 1) ร— max(T_fwd, T_bwd) = (8+1) ร— 220ms = 1980ms โ‰ˆ 2 sec Plus PP activation transfers: roughly 2K ร— 5ms = 80ms. Total โ‰ˆ 2.06 sec. Effective per-micro-batch time = 2.06/8 = 0.258 sec. Compare to single GPU: 0.34 sec. Speedup = 0.34/0.258 โ‰ˆ 1.32ร— from PP=2 (bubble overhead limits efficiency).

Problem 4

Show that for K โ‰ซ P, the pipeline bubble fraction approaches 0. What limits K in practice?

Answer Bubble fraction = (Pโˆ’1)/(K+Pโˆ’1). As K โ†’ โˆž, this โ†’ 0. The limit: activation memory scales as O(K) in the naive schedule โ€” each micro-batch's activations are stored until the corresponding backward pass. For a 175B model with 100 micro-batches, this could require terabytes of memory. 1F1B scheduling reduces the peak to O(P) by scheduling backward passes sooner, at the cost of slightly more complex orchestration.

Problem 5

Compute the total bytes communicated per step for: (a) pure data parallelism (N_dp=64, D=20GB gradients), (b) ZeRO-3 with N_dp=64 (all-gathers for params + reduce-scatter for grads), and (c) tensor parallelism (N_tp=8, L=96 layers, d=8192). Which is most communication-intensive?

Answer (a) DP all-reduce: 2ยท(Nโˆ’1)/NยทD โ‰ˆ 2ยท63/64ยท20 = 39.4 GB per GPU (sent+received). (b) ZeRO-3: params all-gathered during fwd+bwd (2ร— all-gather of 20/N=0.3125 GB each โ†’ ~1.25 GB) + gradient reduce-scatter (~0.625 GB) โ‰ˆ 1.9 GB per GPU. Much less than DP! (c) TP: per layer, 4 all-reduces (2 fwd + 2 bwd). Each all-reduce is 2ยท(N_tpโˆ’1)/N_tp ยท dยฒยท2 bytes โ‰ˆ 2ยท7/8ยท8192ยฒยท2 โ‰ˆ 2ยท0.875ยท134M โ‰ˆ 235 MB per layer. For L=96: 96ยท235MBยท4 โ‰ˆ 90.2 GB per GPU per step. Tensor parallelism has BY FAR the most communication โ€” this is why it's only used within NVLink-connected nodes.

Summary

  1. Data parallelism splits the batch across GPUs; each has a full model copy; gradients are all-reduced โ€” communication cost is ~2D per GPU
  2. ZeRO shards optimizer states (Stage 1), gradients (Stage 2), and parameters (Stage 3) across DP GPUs, reducing memory by up to 12/N bytes/param
  3. Tensor parallelism splits individual layers across GPUs โ€” communication-intensive but enables very large models within a single node
  4. Pipeline parallelism splits layers across GPUs โ€” introduces a bubble of (Pโˆ’1)/(K+Pโˆ’1) idle time, reduced by larger K or 1F1B scheduling
  5. 3D parallelism combines all three: TP within nodes (NVLink), PP across nodes, DP for scale โ€” this is how 405B+ parameter models are trained

Pitfalls


Key Terms

Term Definition
Data parallelism Split batch across GPUs; each has full model copy; gradients all-reduced โ€” scales to hundreds of GPUs
Tensor parallelism Split individual layers across GPUs โ€” communication-intensive (~90 GB/step), NVLink-only
Pipeline parallelism Split layers across GPUs; bubble = (Pโˆ’1)/(K+Pโˆ’1) idle time; uses 1F1B scheduling to reduce activation memory
ZeRO Zero Redundancy Optimizer โ€” shards optimizer states (S1), gradients (S2), parameters (S3) across DP GPUs
ZeRO-3 Shards everything: 12/N bytes/param per GPU โ€” theoretical lower bound for data-parallel training
3D parallelism Combines TP (intra-node NVLink) + PP (inter-node) + DP (replicas) โ€” how 405B+ models are trained
Ring all-reduce Communication cost ~2D bytes per GPU for data D โ€” efficient, scales with N
1F1B scheduling One-forward-one-backward: interleaves passes to reduce peak activation memory from O(K) to O(P)
Bubble fraction (Pโˆ’1)/(K+Pโˆ’1) โ€” idle GPU time; asymptotically 0 as K โ†’ โˆž but limited by activation memory
NVLink ~600 GB/s GPU-to-GPU interconnect within a node โ€” 3ร— faster than InfiniBand

Next Steps

Continue to 20-05 โ€” Instruction Tuning (SFT) to learn how pre-trained LLMs are fine-tuned on instruction-response pairs, including the critical mathematical detail of masking prompt tokens during loss computation.