Math graphic
📐 Concept diagram

20-03 — Batch Size and Gradient Accumulation

Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-03 Prerequisites: 20-01 (Learning Rate Schedules), 14-02 (Gradient Descent), 14-03 (SGD Variants), 15-04 (Backpropagation Algorithm), 10-06 (Expectation of Discrete RVs — variance properties), 17-05 (Attention Mathematics — memory analysis) Next subject: 20-04 — Distributed Training Mathematics


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the gradient accumulation algorithm and prove its mathematical equivalence to large-batch training for models without batch-dependent operations
  2. Compute the gradient noise scale B_noise = tr(Σ)/G^T G and explain its role in determining the critical batch size
  3. Derive the linear scaling rule (η ∝ B) from the perspective of gradient variance and step equivalence
  4. Analyze the tradeoff space of batch size: sample efficiency vs computational efficiency vs generalization
  5. Calculate the memory savings of gradient accumulation and determine the maximum effective batch size given GPU memory constraints

Core Content

1. The Problem: GPU Memory vs Batch Size

Modern LLMs are enormous. A 7B parameter model in fp16 occupies ~14 GB just for parameters. Training adds: - Optimizer states (Adam: 2× model size in fp32 = 8 bytes/param for m and v) - Gradients (2 bytes/param in fp16 or 4 in fp32) - Activations (depends on batch size, sequence length, hidden dimension)

A single A100 80GB GPU can barely fit a 7B model with batch size 1. But optimal training often requires effective batch sizes of 4M+ tokens. How?

Answer: Gradient accumulation — run multiple forward/backward passes with micro-batches, ACCUMULATE the gradients, and only update weights after K accumulation steps.


2. Gradient Accumulation: Algorithm and Proof

Algorithm:

For each accumulation step k = 1, 2, ..., K:
    1. Sample micro-batch B_k of size B_micro
    2. Forward pass: compute loss L(θ; B_k)
    3. Backward pass: compute ∇L(θ; B_k) and ACCUMULATE (sum, don't zero)
    4. (Optional) scale gradients by 1/K now, or scale LR later

After K steps:
    5. Optimizer step: θ ← θ − η · (Σ ∇L_k)   [or scaled appropriately]
    6. Zero gradients

Proof of equivalence to large-batch training (for SGD):

The loss for a batch B of size B_total is:

$L(θ; B) = (1/B_total) Σ_{x∈B} ℓ(θ; x)
$

(The 1/B_total factor is typically the MEAN loss.)

The gradient is:

$g_B = (1/B_total) Σ_{x∈B} ∇ℓ(θ; x)
$

Now split B into K micro-batches B₁, ..., B_K each of size B_micro = B_total/K. For each micro-batch:

$g_k = (1/B_micro) Σ_{x∈B_k} ∇ℓ(θ; x)
$

If we SUM these micro-batch gradients (without the 1/B_micro factor, i.e., we sum the raw per-sample gradients), we get the total gradient:

$Σ_{k=1}^K g_k · B_micro = Σ_{k=1}^K Σ_{x∈B_k} ∇ℓ(θ; x) = Σ_{x∈B} ∇ℓ(θ; x)
$

Now divide by B_total to get the mean gradient:

$g_effective = (1/B_total) Σ_{x∈B} ∇ℓ(θ; x)
$

This is EXACTLY the gradient you'd get from a single forward/backward pass on the full batch B. So gradient accumulation with K micro-batches of size B_micro = B_total/K is mathematically identical to training with batch size B_total. ✓

For optimizers with momentum (Adam, SGD+Momentum): The equivalence holds because the optimizer state update depends only on the ACCUMULATED gradient, not on intermediate steps. As long as the optimizer step happens AFTER all K accumulation steps, the result is identical to a single large-batch step.

Exception: BatchNorm and similar operations that compute statistics WITHIN a batch. For these, the micro-batch statistics differ from full-batch statistics. In practice, LLMs use LayerNorm (per-sample, not per-batch), so this issue is sidestepped entirely.

⚠️ THIS IS CRITICAL — Gradient accumulation is how virtually all LLMs achieve large effective batch sizes. A model that fits batch size 4 on one GPU can simulate batch size 1024, 2048, or larger by accumulating steps. Combined with distributed training, this scales to millions of tokens per step.


3. The Linear Scaling Rule: Derivation

When we increase effective batch size by factor k, how should we adjust the learning rate?

Gradient variance argument: For per-sample gradients with mean G and covariance Σ:

$g_B = (1/B) Σ_{i=1}^B g_i      where E[g_i] = G, Cov(g_i) = Σ
E[g_B] = G
Cov(g_B) = Σ / B
$

The signal-to-noise ratio scales as:

$SNR = ||G||² / (tr(Σ)/B) = B · ||G||² / tr(Σ)  ∝ B
$

Larger batch = less noise = more reliable gradient direction. With a B× more reliable gradient, we can take a B× larger step and expect the same "safety" as taking B small noisy steps.

Step equivalence argument: Consider two strategies over the same data:

Strategy A: K SGD steps with batch size B and learning rate η:

$E[Δθ_A] = −η · Σ_{k=1}^K E[g_B^{(k)}] = −η·K·G
$

Strategy B: 1 step with batch size K·B and learning rate η':

$E[Δθ_B] = −η' · E[g_{K·B}] = −η'·G
$

Setting them equal: η' = η·K. So η scales linearly with batch size.

The rule:

$η_new = η_old · B_new / B_old
$

Caveat (critical batch size): This rule has an upper limit. See Section 5.


4. Gradient Noise Scale and Critical Batch Size

The linear scaling rule works up to a point. Beyond the critical batch size B_crit, further increasing batch size provides diminishing returns — you can't keep scaling η proportionally without causing instability.

Definition (McCandlish et al., 2018): The gradient noise scale:

$B_noise = tr(Σ) / G^T G
$

where: - Σ = Cov(g_i) — the per-sample gradient covariance matrix - G = E[g_i] — the true gradient (expected over the data distribution)

Interpretation: B_noise is the batch size at which the gradient variance equals the squared signal strength. When B ≪ B_noise, adding more data reduces noise meaningfully. When B ≫ B_noise, the gradient is already "signal-dominated" and further noise reduction is marginal.

Critical batch size: B_crit ≈ B_noise. Training with B > B_crit wastes computation — you'd learn faster by training more steps with B ≈ B_crit than fewer steps with B ≫ B_crit.

Computing B_noise in practice: Since we can't compute Σ and G exactly, we estimate from training:

B_noise ≈ B · (tr(Σ̂_B) / ||g_B||²}

where Σ̂_B is the empirical covariance of micro-batch gradients across accumulation steps.

Empirical values: For language modeling, B_noise is typically in the range of 100K–1M tokens. For a model with sequence length 2048, that's roughly 50–500 sequences.


5. Memory Analysis of Gradient Accumulation

Why does gradient accumulation save memory? Because activations are NOT accumulated — they're freed after each micro-batch's backward pass.

Memory components per micro-batch (batch size B_micro): - Activations: O(L · B_micro · S · d) where S = sequence length - Gradients (accumulated): O(d_model²) — stored in fp32

Memory comparison:

Strategy Peak Activation Memory Gradient Memory Total
Full batch B O(L·B·S·d) O(d²) HUGE
Accumulate K×B_micro O(L·B_micro·S·d) O(d²) ~1/K of full batch

For B=1024, B_micro=4, K=256: peak activation memory is 256× smaller.

Maximum effective batch size given memory M:

$B_effective_max = K_max · B_micro
$

where K_max is limited by throughput (more steps = more wall-clock time), not memory.


6. Tradeoffs: Small vs Large Batch Size

Small Batch (e.g., B = 32–256)

Pros: - Lower memory per step - More "noisy" gradient → can escape sharp minima → often better generalization - More optimizer steps per epoch → faster progress in terms of parameter updates

Cons: - Lower GPU utilization (many small matrix multiplies) - High variance in gradient → slower convergence per unit of data processed - Poor scaling with distributed training (communication overhead per step)

Large Batch (e.g., B = 1024–4096, or effective B = 2M–8M tokens)

Pros: - High GPU utilization - Better scaling with distributed training (amortizes communication) - Smoother training curves - Can use larger learning rates (linear scaling)

Cons: - Higher memory (mitigated by gradient accumulation) - May converge to sharper minima → worse generalization (the "generalization gap") - Fewer optimizer steps per epoch → potentially slower convergence

The sharp vs flat minima hypothesis (Keskar et al., 2017): Large-batch methods tend to converge to sharp minima — narrow basins in the loss landscape. Small-batch methods, through their noisy exploration, tend to find flatter minima that generalize better.


7. LLM Training Configurations

Real-world configurations from published models:

Model Batch Size (tokens) Sequence Length Sequences/Batch LR
GPT-3 175B 3.2M 2048 ~1560 6e-5 → 1.2e-4
Llama 7B 4M 2048 ~2000 3e-4
Llama 2 70B 4M 4096 ~1000 1.5e-4
Chinchilla 70B 1.5M–2M 2048 ~750–1000 1e-4
Pythia 12B 2M 2048 ~1000 1.2e-4

The trend: effective batch sizes of 1–4M tokens, achieved with micro-batch sizes of 4–16 sequences per GPU and gradient accumulation across many steps.


8. Gradient Accumulation and Distributed Training

In distributed data-parallel training with N GPUs, each GPU processes B_micro sequences. After backward(), gradients are all-reduced across GPUs. The total effective batch is:

$B_effective = N · B_micro_per_gpu · K_accumulation_steps
$

Gradient accumulation is typically done LOCALLY (on each GPU, before all-reduce) to amortize communication. You accumulate for K steps, then all-reduce once, then optimizer step. This reduces communication by factor K.


Worked Examples

Example 1: Computing Effective Batch Size

Problem: You have 8 GPUs, each fitting batch size 2 (sequences). You use 128 gradient accumulation steps. Sequence length is 2048 tokens. Compute the effective batch size in sequences and tokens.

Solution:

$B_effective_sequences = 8 · 2 · 128 = 2048 sequences
B_effective_tokens = 2048 · 2048 = 4,194,304 tokens ≈ 4.19M tokens
$

This is in the typical LLM training range (2–4M tokens).


Example 2: Memory-Limited Batch Size

Problem: An A100 80GB GPU has 80GB. The model has 7B parameters (fp16: 14GB). Adam optimizer states in fp32: 7B × 8 bytes = 56GB. Gradients in fp32: 7B × 4 bytes = 28GB. Activations at batch size 1, sequence length 2048: ~2GB. How much memory is left for batch scaling? If you want effective batch size 4M tokens (sequence length 2048), how many accumulation steps are needed per GPU if the per-GPU micro-batch is 1?

Solution:

Memory breakdown: - Parameters: 14 GB - Optimizer states: 56 GB - Gradients: 28 GB - Activations (B=1): 2 GB - TOTAL: 100 GB

That EXCEEDS 80GB. Something must give. Options: 1. Use AdamW with fp32 master weights only (not fp32 m,v for all params)? No, m,v are always fp32. 2. Use activation checkpointing — recompute activations instead of storing them: saves ~1.5 GB. 3. Use ZeRO stage 1 or 2 (shard optimizer states across GPUs — see 20-04).

With ZeRO-1 (shard optimizer states across 8 GPUs): each GPU stores 56/8 = 7GB. Total: 14 + 7 + 28 + 2 = 51GB. Now we have 29GB free.

Per-GPU batch size 1: B_effective = 8 GPUs × 1 × K. For 4M tokens = ~1953 sequences:

$K = 1953 / 8 ≈ 244 accumulation steps
$

This is achievable. Gradient accumulation makes it possible to train 7B models on 8×A100 80GB.


Example 3: Estimating Gradient Noise Scale

Problem: During training, you measure the variance of micro-batch gradients (size B=16) as tr(Σ̂_16) = 0.04 and the squared norm of the mean gradient as ||g₁₆||² = 0.01. Estimate B_noise.

Solution:

For micro-batch gradients of size B:

Cov(g_B) = Σ / B
tr(Cov(g_B)) = tr(Σ) / B

So tr(Σ) = B · tr(Cov(g_B)) = 16 · 0.04 = 0.64

Also, E[||g_B||²] = ||G||² + tr(Σ)/B = ||G||² + 0.04
|||G||² = E[||g_B||²] − 0.04 ≈ 0.01 − 0.04 = −0.03

**Better approach:** Use gradient statistics across many micro-batches to get stable estimates. The raw empirical estimate of ||G||² can be noisy (sometimes even negative due to measurement noise, since Var dominates). In practice, B_noise is estimated from many micro-batches. For this problem, assume the true ||G||² ≈ 0.001 (often much smaller than the variance term):

B_noise = tr(Σ) / ||G||² = 0.64 / 0.001 = 640 ```

So B_crit ≈ 640 samples. Training beyond batch size 640 would see diminishing returns per sample.



Quiz

Q1: What does the concept of Gradient accumulation primarily refer to in this subject?

A) A historical anecdote about Gradient accumulation B) A computational error related to Gradient accumulation C) The definition and application of Gradient accumulation D) A visual representation of Gradient accumulation

Correct: C)

Q2: What is the primary purpose of The linear scaling rule?

A) It is primarily a historical notation system B) It is used to the linear scaling rule in mathematical analysis C) It replaces all other methods in this domain D) It is used only in advanced research contexts

Correct: B)

Q3: Which statement about The critical batch size is TRUE?

A) The critical batch size is mentioned only as a historical footnote B) The critical batch size is an advanced topic beyond this subject's scope C) The critical batch size is not related to this subject D) The critical batch size is a fundamental concept covered in this subject

Correct: D)

Q4: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) An unrelated numerical value C) The inverse of the correct answer D) (1/M) Σ_{all samples} ∇ℓ

Correct: D)

Q5: How are The critical batch size and Gradient accumulation saves GPU memory related?

A) The critical batch size and Gradient accumulation saves GPU memory are completely unrelated topics B) The critical batch size and Gradient accumulation saves GPU memory are closely related concepts C) The critical batch size is a special case of Gradient accumulation saves GPU memory D) The critical batch size is the inverse of Gradient accumulation saves GPU memory

Correct: B)

Q6: What is a common pitfall when working with Effective batch sizes?

A) The main error with Effective batch sizes is using it when it is not needed B) A common mistake is confusing Effective batch sizes with a similar concept C) Effective batch sizes has no common misconceptions D) Effective batch sizes is always computed the same way in all contexts

Correct: B)

Q7: When should you apply Linear scaling rule?

A) Apply Linear scaling rule to solve problems in this subject's domain B) Avoid Linear scaling rule unless explicitly instructed C) Use Linear scaling rule only in pure mathematics contexts D) Linear scaling rule is not practically useful

Correct: A)

Practice Problems

Problem 1

You train with B_micro = 4 per GPU, 16 GPUs, 50 accumulation steps, sequence length 4096. Compute the effective batch size in tokens.

Answer B_effective = 16 · 4 · 50 = 3200 sequences B_effective_tokens = 3200 · 4096 = 13,107,200 tokens ≈ 13.1M tokens. This is on the high side but reasonable for very large models.

Problem 2

Prove that for SGD without momentum, K micro-batch steps with gradient accumulation followed by one optimizer step is identical to one step with batch size K·B_micro, assuming the loss is the mean over samples.

Answer Let micro-batch k have samples x_{k,1}, ..., x_{k,M} where M = B_micro. Each micro-batch's raw loss sum (before mean) is S_k = Σ_i ℓ(θ; x_{k,i}). The mean loss is L_k = S_k/M. The gradient is g_k = (1/M) Σ_i ∇ℓ(θ; x_{k,i}) = ∇L_k. If we ACCUMULATE g_k (sum them without additional scaling), after K steps we have: G = Σ_k g_k = Σ_k (1/M) Σ_i ∇ℓ(θ; x_{k,i}) = (1/M) Σ_{all samples} ∇ℓ But this is M× too small — each micro-batch divided by M, but we need to divide by K·M for the full batch. So accumulated gradient = (1/M) · total_raw_grad. To get the mean gradient over all K·M samples: divide by K. So: mean_gradient = (1/K) Σ_k g_k = (1/(K·M)) Σ_{all} ∇ℓ This is exactly the gradient of the mean loss over the full batch of K·M samples. ✓

Problem 3

You increase batch size from 256 to 2048 (8×). Using the linear scaling rule, you set η_new = η_old × 8. Training becomes unstable. Give two mathematical reasons why.

Answer 1. **Above critical batch size:** If B_crit < 2048, the gradient noise isn't reducing meaningfully anymore. Taking an 8× larger step overcompensates — the noise scale hasn't decreased proportionally, so the step is too aggressive. 2. **Lipschitz constraint violation:** The loss landscape has a local Lipschitz constant L. For gradient descent to decrease the loss, we need η < 2/L. If η_old was already near this bound (e.g., η_old = 1/L), then 8×η_old = 8/L ≫ 2/L — guaranteed divergence regardless of batch size. 3. (Bonus) **Adam's implicit normalization interacts non-linearly:** Adam divides by √v̂_t. The linear scaling rule was derived for SGD, not adaptive methods. With Adam, larger batches don't produce proportionally larger v̂_t, so the effective step size increase is different from what the simple rule predicts.

Problem 4

Given per-sample gradients g_i ~ N(0, I_d) (zero-mean case), show that B_noise = d / 0 = ∞. Interpret what this means — can you train with zero-mean gradients?

Answer If E[g_i] = 0 (G = 0), then ||G||² = 0, and B_noise = tr(Σ)/0 → ∞. This means the signal-to-noise ratio is zero — no amount of averaging produces a meaningful direction, because the expected gradient IS zero. You can't train because there's no "correct" direction. In practice, G ≠ 0 during training (if it were, you'd be at a critical point). But G can be very small relative to noise, leading to large B_noise. This is the regime where large batch sizes help the most.

Problem 5

You have 4 GPUs with 40GB each. Model: 13B params. fp16 params = 26GB. Adam states fp32 = 104GB (impossible — need ZeRO). With ZeRO-2 (shard optimizer states + gradients), each GPU stores: params 26GB, opt states 104/4=26GB, gradients 52/4=13GB. Total 65GB > 40GB — still too much. What ZeRO stage is needed? What micro-batch size can you achieve?

Answer We need ZeRO-3 (shard parameters too). Then each GPU stores: - Parameters: 26/4 = 6.5 GB (sharded) - Optimizer states: 104/4 = 26 GB - Gradients: 52/4 = 13 GB - Activations: negligible compared to above - Total: ~45.5 GB > 40GB — still borderline. With activation checkpointing saving ~2-3GB: ~42.5GB. With CPU offloading of optimizer states: frees 26GB, leaving 19.5GB on GPU — batch size 1 easily fits. With ZeRO-3 + CPU offload, B_micro = 1-2 per GPU. Effective batch size with accumulation: whatever K you choose.

Summary

  1. Gradient accumulation achieves large effective batch sizes by summing gradients over K micro-batches before updating — mathematically equivalent to large-batch SGD for models without batch-dependent operations
  2. The linear scaling rule (η ∝ B) arises from gradient variance arguments: larger batches reduce noise by factor B, enabling proportionally larger learning rates
  3. The critical batch size B_noise = tr(Σ)/G^T G marks the point where further batch scaling gives diminishing returns — beyond this, training steps are the bottleneck, not sample efficiency
  4. Gradient accumulation saves GPU memory by freeing activations after each micro-batch while only accumulating the much smaller gradient tensor
  5. Effective batch sizes for LLM training are typically 2–4M tokens, achieved through combinations of per-GPU micro-batches, gradient accumulation, and multi-GPU data parallelism

Pitfalls


Key Terms

Term Definition
Gradient accumulation Summing gradients over K micro-batches before an optimizer step — mathematically equivalent to large-batch training
Linear scaling rule η_new = η_old · B_new/B_old; justified by gradient variance scaling as 1/B
Critical batch size (B_crit) B_noise = tr(Σ)/G^T G; beyond this, further batch scaling yields diminishing returns
Gradient noise scale The batch size at which gradient variance equals squared signal strength
Micro-batch The batch size that fits in GPU memory for one forward/backward pass
Effective batch size B_effective = N_gpus × B_micro × K_accumulation
Signal-to-noise ratio SNR = B ·
LayerNorm vs BatchNorm LayerNorm operates per-sample (accumulation-safe); BatchNorm breaks gradient accumulation equivalence

Next Steps

Continue to 20-04 — Distributed Training Mathematics to learn how training scales across hundreds or thousands of GPUs, including data parallelism, model parallelism, pipeline parallelism, and the ZeRO optimizer.