Math graphic
📐 Concept diagram

20-01 — Learning Rate Schedules

Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-01 Prerequisites: 14-02 (Gradient Descent), 14-03 (SGD Variants), 14-04 (Adaptive Learning Rate Methods), 04-03 (The Derivative — rate of change intuition), 03-05 (Exponential and Logarithmic Functions) Next subject: 20-02 — Gradient Clipping


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive and differentiate the five major learning rate schedule functions: exponential decay, step decay, cosine annealing, linear warmup, and one-cycle policy
  2. Explain WHY learning rate schedules are necessary — connecting to optimization landscape theory, saddle points, and sharp minima
  3. Compute the optimal warmup length from the gradient variance argument and derive why warmup prevents divergence
  4. Design and implement the standard LLM training schedule (linear warmup + cosine decay) with justification for each component
  5. Analyze the relationship between batch size and learning rate scaling, including the linear scaling rule and the critical batch size

Core Content

1. Why Learning Rate Schedules Matter

In gradient descent, the update rule is:

$θ_{t+1} = θ_t − η_t · ∇L(θ_t)
$

where η_t is the learning rate at step t. A constant learning rate presents a fundamental tension:

The optimization landscape changes as training progresses. Early on, parameters are far from good solutions and the gradient is large. Later, we're fine-tuning near a minimum where the curvature matters.

Mathematical intuition: Near a minimum θ*, the loss can be approximated by a quadratic:

$L(θ) ≈ L(θ*) + ½(θ − θ*)^T H (θ − θ*)
$

where H = ∇²L(θ*) is the Hessian. Gradient descent on this quadratic:

$θ_{t+1} = θ_t − η H θ_t = (I − ηH) θ_t
$

For stability, we need ||I − ηH|| < 1, which requires η < 2/λ_max where λ_max is the largest eigenvalue of H. As training progresses and we approach θ*, the relevant curvature determines the maximum safe learning rate — which is usually smaller than what was safe early on.


2. Step Decay

The simplest non-constant schedule: multiply the learning rate by a factor γ at fixed step intervals.

η_t = η_0 · γ^{⌊t / S⌋}

where: - η_0 = initial learning rate - γ = decay factor (typically 0.1, 0.5) - S = step interval (e.g., every 30 epochs)

At step t = kS: η = η_0 · γ^k

Worked derivation: If η_0 = 0.1, γ = 0.5, S = 30:

Steps k = ⌊t/30⌋ γ^k η_t
0–29 0 1.0 0.1
30–59 1 0.5 0.05
60–89 2 0.25 0.025

Why it works: Models often hit plateaus where the current learning rate is too large to make further progress. Halving it allows finer descent. The discrete jumps mean the model gets "reset" moments where it can escape suboptimal local minima because the gradient suddenly becomes proportionally larger relative to the step.

Limitation: The abrupt changes can cause transient instabilities. And the "when to step" is arbitrary — we're guessing where plateaus occur.


3. Exponential Decay

Smooth continuous decay:

η_t = η_0 · e^{−λt}

or equivalently:

η_t = η_0 · γ^t     where γ = e^{−λ}

Derivation from differential equation: Assume we want the relative rate of decrease to be constant:

$dη/dt = −λ · η
$

This is a separable ODE: dη/η = −λ dt → ln(η) = −λt + C → η = η_0 · e^{−λt}. ✓

Half-life: The time to halve: η_0/2 = η_0 · e^{−λt_{1/2}} → t_{1/2} = ln(2)/λ.

Decay factor interpretation: After each step, η_{t+1} = η_t · γ where γ = e^{−λ} ≈ 1−λ for small λ. So exponential decay is approximately "multiply by γ each step."

Pros and cons: - Smooth — no abrupt jumps (better than step decay) - The later stages have VERY small learning rates — can stall before convergence - No notion of a training "budget" — just decays forever


4. Cosine Annealing

The learning rate follows a half-cosine curve from η_max down to (near) zero:

$η_t = η_min + ½(η_max − η_min) · (1 + cos(π · t / T))
$

where T is the total number of steps.

Derivation from the cosine function: cos(0) = 1, cos(π) = −1. So:

The shape is smooth and symmetrical — steep decay in the middle, gentle at the start and end.

Why cosine specifically? The cosine schedule has an elegant property: it spends roughly equal "time" (steps) in each learning rate regime. The derivative dη/dt ∝ sin(πt/T) — highest rate of change in the middle. Early steps at high LR explore, late steps at low LR refine, and the smooth transition prevents abrupt changes that could destabilize.


5. Cosine Annealing with Warm Restarts (SGDR)

Extend cosine annealing by periodically resetting η back to η_max:

$η_t = η_min^i + ½(η_max^i − η_min^i) · (1 + cos(π · (t mod T_i) / T_i))
$

where T_i = T_0 · r^i (each cycle period grows by factor r, typically r = 2).

Why restarts help: Each restart "shakes" the optimizer out of whatever local minimum it settled into, allowing it to explore other basins. The cosine shape assures a gentle landing each time.

⚠️ THIS IS CRITICAL — Cosine annealing is the dominant LR schedule in modern LLM training. Almost all major models (GPT-3, Llama, Chinchilla, etc.) use linear warmup + cosine decay.


6. Linear Warmup

Start with a very small learning rate and linearly increase to the target:

η_t = η_target · t / T_warmup      for t ≤ T_warmup

Why warmup is mathematically necessary:

At initialization, the model weights are random. The gradient is large and noisy. The model's "direction" is essentially random. If we take a large step in a random direction, we can push the network into a region from which recovery is difficult.

More formally: Consider the variance of the parameter updates. At initialization, if layers use standard scaling (e.g., Xavier/He), the gradient norm ||∇L|| can be large. With a large learning rate:

$E[||Δθ||²] = η² · E[||∇L||²]
$

If η is too large while ||∇L|| is large, the expected step size overwhelms the parameter scale and the network "explodes" (activations/norms go to infinity or NaN).

The warmup provides a "grace period" where gradients stabilize and the optimizer (especially Adam, which tracks momentum and second-order moments) can accumulate reasonable statistics before taking large steps.

Warmup length: Typically 1-5% of total steps. For a 100K-step training run, warmup = 1000–5000 steps. Chinchilla used ~1% warmup; Llama used ~2000 steps.


7. One-Cycle Policy (Leslie Smith)

A triangular-ish schedule in three phases, all within one cycle:

Phase 1 (warmup):    η = η_min + (η_max − η_min) · t / T_peak          [0, T_peak]
Phase 2 (annealing): η = η_max − (η_max − η_min) · (t − T_peak) / (T_total − T_peak)   [T_peak, T_total]
Phase 3 (final):     η = η_min/100 · decay for last few steps

The maximum LR η_max is found via an LR range test: increase LR linearly while training, and pick the point where loss stops decreasing.

Momentum schedule is INVERSE: When LR is high, momentum is low (allows exploration). When LR is low, momentum is high (stable convergence).

Why one-cycle works: High LR in the middle acts as a regularizer — large steps prevent the model from memorizing noise patterns. The rapid annealing at the end "crystallizes" the solution.


8. The Standard LLM Training Schedule

Modern LLMs almost universally use:

$Linear warmup (first ~1-2% of steps)
    ↓
Cosine decay (remaining ~98-99% of steps)
$
η_t = {
    η_max · t / T_warmup                    if t < T_warmup
    ½η_max · (1 + cos(π·(t−T_warmup)/(T_total−T_warmup)))    if t ≥ T_warmup
}

(Here we set η_min ≈ 0, so the cosine goes from η_max to ~0.)

Justification: 1. Warmup: Prevents early training collapse (see Section 6) 2. Cosine decay: Smooth annealing with gentle early descent, faster middle-phase reduction, and a long tail of very low LR for final refinement 3. No step drops: Eliminates the need to guess optimal step-drop locations 4. Matched to optimization theory: The shape approximately follows what a theoretically optimal schedule would look like under noisy gradient assumptions


9. Batch Size and LR Scaling

Linear scaling rule (Goyal et al., 2017): When you multiply batch size by k, multiply learning rate by k:

$η_new = η_old · B_new / B_old
$

Derivation intuition: For SGD with batch size B, the gradient estimate is:

$g_B = (1/B) Σ_{i=1}^B g_i
$

The variance of this estimate: Var(g_B) = σ²/B where σ² is per-sample gradient variance. Larger batch = lower variance = "more accurate" gradient = we can take proportionally larger steps.

After k steps with batch size B, the total progress should equal 1 step with batch size k·B. The expected parameter change after k SGD steps with LR η and batch size B:

$E[Δθ_k_steps] = E[−η Σ_{j=1}^k g_B^{(j)}] = −η·k·E[g]
$

One step with batch size k·B and LR η':

$E[Δθ_1_step] = −η'·E[g]
$

For equivalence: η' = η·k. ✓

⚠️ THIS IS CRITICAL — This rule is used everywhere in large-scale training. When you see "effective batch size = 4M tokens," it's typically achieved via gradient accumulation with a smaller per-step batch.

Critical batch size (McCandlish et al., 2018): There's an upper limit. The "gradient noise scale" B_noise determines when further scaling stops helping:

$B_crit ≈ B_noise = tr(Σ) / G^T G
$

where Σ is the gradient covariance and G is the true gradient. Beyond B_crit, the return on larger batches diminishes.


Worked Examples

Example 1: Computing Learning Rate at Step t

Problem: A training run uses cosine annealing with η_max = 1e-3, η_min = 0, T = 100,000 steps. What is η at t = 37,000?

Solution:

η_t = η_min + ½(η_max − η_min) · (1 + cos(π · t / T))
    = 0 + ½(1e-3) · (1 + cos(π · 37000/100000))
    = 5e-4 · (1 + cos(0.37π))
    = 5e-4 · (1 + cos(66.6°))
    = 5e-4 · (1 + 0.397)
    = 5e-4 · 1.397
    = 6.985 × 10^{−4}

Check: At t=0: 1e-3 ✓. At t=50,000 (halfway): 5e-4·(1+cos(π/2)) = 5e-4 ✓ (halfway down). At t=100,000: 5e-4·(1+cos(π)) = 0 ✓.


Example 2: Warmup + Cosine Combined

Problem: A 100K-step training run has linear warmup for 2000 steps from 0 to η_max = 3e-4, then cosine decay to 0. What is η at step 1500? At step 50,000?

Solution:

Step 1500 (in warmup):

η = η_max · t / T_warmup = 3e-4 · 1500 / 2000 = 3e-4 · 0.75 = 2.25 × 10^{−4}

Step 50,000 (in cosine decay): The cosine portion runs from t = 2000 to t = 100,000 (T_decay = 98,000 steps). Relative position in cosine: τ = (50000 − 2000) / 98000 = 48000/98000 ≈ 0.4898

η = ½ · η_max · (1 + cos(π · τ))
  = 1.5e-4 · (1 + cos(0.4898π))
  = 1.5e-4 · (1 + cos(88.16°))
  = 1.5e-4 · (1 + 0.0321)
  = 1.5e-4 · 1.0321
  = 1.548 × 10^{−4}

Note how halfway through the decay period, the LR is only slightly above η_max/2. This shows cosine spends proportionally MORE time at moderate LRs.


Example 3: Exponential Decay With Half-Life

Problem: A schedule starts at η_0 = 0.01 with exponential decay. After 10,000 steps, η should be 0.001. Find the decay constant λ and the half-life t_{1/2}.

Solution:

η_t = η_0 · e^{−λt}
0.001 = 0.01 · e^{−λ·10000}
0.1 = e^{−λ·10000}
ln(0.1) = −λ · 10000
−2.3026 = −λ · 10000
λ = 2.3026 / 10000 = 2.3026 × 10^{−4}

Half-life:

$t_{1/2} = ln(2) / λ = 0.6931 / (2.3026×10^{−4}) ≈ 3010 steps
$

Verification: After 3010 steps: η = 0.01 · e^{−2.3026e−4 · 3010} = 0.01 · e^{−0.693} = 0.01 · 0.5 = 0.005 ✓.



Quiz

Q1: What does the concept of Cosine annealing primarily refer to in this subject?

A) A computational error related to Cosine annealing B) A historical anecdote about Cosine annealing C) The definition and application of Cosine annealing D) A visual representation of Cosine annealing

Correct: C)

Q2: What is the primary purpose of Linear warmup?

A) It is used only in advanced research contexts B) It is used to linear warmup in mathematical analysis C) It replaces all other methods in this domain D) It is primarily a historical notation system

Correct: B)

Q3: Which statement about The linear scaling rule is TRUE?

A) The linear scaling rule is not related to this subject B) The linear scaling rule is mentioned only as a historical footnote C) The linear scaling rule is an advanced topic beyond this subject's scope D) The linear scaling rule is a fundamental concept covered in this subject

Correct: D)

Q4: Based on the worked examples in this subject, what is the correct result?

A) kS + δ where 0 ≤ δ < S: B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer

Correct: A)

Q5: How are The linear scaling rule and The standard LLM schedule related?

A) The linear scaling rule is a special case of The standard LLM schedule B) The linear scaling rule and The standard LLM schedule are closely related concepts C) The linear scaling rule and The standard LLM schedule are completely unrelated topics D) The linear scaling rule is the inverse of The standard LLM schedule

Correct: B)

Q6: What is a common pitfall when working with Learning rate schedule?

A) Learning rate schedule has no common misconceptions B) The main error with Learning rate schedule is using it when it is not needed C) Learning rate schedule is always computed the same way in all contexts D) A common mistake is confusing Learning rate schedule with a similar concept

Correct: D)

Q7: When should you apply Step decay?

A) Step decay is not practically useful B) Apply Step decay to solve problems in this subject's domain C) Use Step decay only in pure mathematics contexts D) Avoid Step decay unless explicitly instructed

Correct: B)

Practice Problems

Problem 1

A cosine annealing schedule has η_max = 5e-4, η_min = 1e-5, T = 50,000. What is the learning rate at t = 12,500?

Answer
τ = 12500/50000 = 0.25
η = 1e-5 + ½(5e-4 − 1e-5) · (1 + cos(0.25π))
  = 1e-5 + 2.45e-4 · (1 + cos(45°))
  = 1e-5 + 2.45e-4 · (1 + 0.7071)
  = 1e-5 + 2.45e-4 · 1.7071
  = 1e-5 + 4.182e-4
  = 4.282 × 10^{−4}
At only 25% of training, we're already down to ~85% of max LR — cosine decays faster initially than one might expect.

Problem 2

Derive the formula for step decay after d steps and prove that the ratio of consecutive "plateau" LRs is always γ.

Answer At step t: η_t = η_0 · γ^{⌊t/S⌋}. At t = kS + δ where 0 ≤ δ < S:
η_{kS+δ} = η_0 · γ^k
η_{kS−1} = η_0 · γ^{k−1}
The ratio when crossing a step boundary: η_{kS} / η_{kS−1} = γ^k / γ^{k−1} = γ. ✓ Between boundaries, the ratio is 1 (constant). So the LR is piecewise constant with multiplicative drops of γ at each boundary.

Problem 3

A training run uses linear warmup for 1000 steps from 0 to 0.001, then 49,000 steps of cosine decay to 1e-6. What fraction of total training steps have LR > η_max / 2?

Answer During cosine decay, η_t = ½(η_max) when:
$½(η_max−η_min)·(1 + cos(π·τ)) = η_max/2
≈ ½η_max·(1 + cos(π·τ)) = η_max/2     [η_min << η_max]
1 + cos(π·τ) = 1
cos(π·τ) = 0
π·τ = π/2
τ = 0.5
$
So η > η_max/2 when τ < 0.5, i.e., for the first half of the cosine decay period. t_range = [0, 1000] + [2000, 2000 + 0.5·49000] = [0, 1000] ∪ [2000, 26500] = 1000 + 24500 = 25,500 steps Fraction = 25,500 / 50,000 = 0.51 = 51%.

Problem 4

You're scaling training from batch size 256 with LR 0.1 to batch size 2048 using the linear scaling rule. What should the new LR be? Then compute the "path length" difference: with the old setup you'd take 8× more steps — does this change the total distance traveled in parameter space?

Answer **New LR:** η_new = 0.1 · 2048/256 = 0.1 · 8 = 0.8. **Path length:** With the old setup after 8 steps: total movement = 8 · η_old · ||g|| = 8·0.1·||g|| = 0.8·||g||. With new setup after 1 step: 1 · 0.8 · ||g|| = 0.8·||g||. Same expected movement — the linear scaling preserves the effective step size. But the variance is different: 8 small noisy steps explore a larger volume (due to random walk), while 1 large step with lower noise goes more directly toward the true gradient. This is why very large batch training often requires slightly different optimization strategies.

Problem 5

A one-cycle policy uses T_total = 100,000, T_peak = 30,000, η_max = 0.01, η_min = 0.0001. At t = 90,000, compute η. Then explain why momentum should be LOW at this point.

Answer t = 90,000 is in Phase 2 (annealing, since 30,000 < 90,000 < 100,000).
η = η_max − (η_max − η_min) · (t − T_peak) / (T_total − T_peak)
  = 0.01 − (0.01 − 0.0001) · (90,000 − 30,000) / (100,000 − 30,000)
  = 0.01 − 0.0099 · 60,000 / 70,000
  = 0.01 − 0.0099 · 0.8571
  = 0.01 − 0.008486
  = 1.514 × 10^{−3}
**Momentum should be LOW (or even negative/zero):** At t=90K, we're near the end of training with very low LR. Low momentum prevents the optimizer from "coasting" past the minimum. When LR is small, we WANT each step to follow the gradient closely, not carry inertia from earlier large steps. This is the opposite of the early phase where high momentum helps smooth noisy gradients.

Summary

  1. Learning rate schedules address the fundamental tradeoff between exploration (high LR) and refinement (low LR) — no single constant LR optimizes both phases
  2. Cosine annealing is the dominant schedule for LLMs: it starts at η_max, decays smoothly to near zero, and the cosine shape naturally spends more time at intermediate LRs
  3. Linear warmup is mathematically necessary to prevent early training collapse — it gives the optimizer time to accumulate stable statistics before taking large steps
  4. The linear scaling rule (η ∝ B) means doubling batch size allows doubling learning rate, up to the critical batch size beyond which returns diminish
  5. The standard LLM schedule = linear warmup (~1% of steps) + cosine decay (~99% of steps) — used by GPT-3, Llama, Chinchilla, and most modern models

Pitfalls


Key Terms

Term Definition
Learning rate schedule A function η_t that varies the learning rate over training steps to balance exploration and refinement
Cosine annealing η_t descends along a half-cosine curve from η_max to η_min — the dominant schedule for LLM training
Linear warmup Learning rate increases linearly from near-zero to η_max over the first ~1% of steps, preventing early training collapse
Step decay Multiply LR by factor γ at fixed step intervals; creates piecewise-constant plateaus
Exponential decay η_t = η_0·e^{−λt}; smooth but decays forever with no training budget awareness
One-cycle policy Three-phase schedule (warmup → annealing → annihilation) with inverse momentum; common in vision
SGDR Cosine annealing with warm restarts — periodically resets η to η_max to escape local minima
Linear scaling rule η_new = η_old · B_new/B_old; larger batches allow proportionally larger learning rates
Critical batch size B_crit = tr(Σ)/G^T G; the batch size beyond which further scaling yields diminishing returns
Effective step size η ·

Next Steps

Continue to 20-02 — Gradient Clipping to learn how to handle the exploding gradient problem that learning rate schedules alone cannot solve.