20-01 — Learning Rate Schedules
Phase: 20 — Training & Fine-tuning Mathematics Subject: 20-01 Prerequisites: 14-02 (Gradient Descent), 14-03 (SGD Variants), 14-04 (Adaptive Learning Rate Methods), 04-03 (The Derivative — rate of change intuition), 03-05 (Exponential and Logarithmic Functions) Next subject: 20-02 — Gradient Clipping
Learning Objectives
By the end of this subject, you will be able to:
- Derive and differentiate the five major learning rate schedule functions: exponential decay, step decay, cosine annealing, linear warmup, and one-cycle policy
- Explain WHY learning rate schedules are necessary — connecting to optimization landscape theory, saddle points, and sharp minima
- Compute the optimal warmup length from the gradient variance argument and derive why warmup prevents divergence
- Design and implement the standard LLM training schedule (linear warmup + cosine decay) with justification for each component
- Analyze the relationship between batch size and learning rate scaling, including the linear scaling rule and the critical batch size
Core Content
1. Why Learning Rate Schedules Matter
In gradient descent, the update rule is:
$θ_{t+1} = θ_t − η_t · ∇L(θ_t)
$
where η_t is the learning rate at step t. A constant learning rate presents a fundamental tension:
- Too large: overshoots minima, oscillates, diverges
- Too small: slow convergence, gets stuck in local minima/saddle points
- Just right at the start: eventually becomes too large as we approach the minimum
The optimization landscape changes as training progresses. Early on, parameters are far from good solutions and the gradient is large. Later, we're fine-tuning near a minimum where the curvature matters.
Mathematical intuition: Near a minimum θ*, the loss can be approximated by a quadratic:
$L(θ) ≈ L(θ*) + ½(θ − θ*)^T H (θ − θ*) $
where H = ∇²L(θ*) is the Hessian. Gradient descent on this quadratic:
$θ_{t+1} = θ_t − η H θ_t = (I − ηH) θ_t
$
For stability, we need ||I − ηH|| < 1, which requires η < 2/λ_max where λ_max is the largest eigenvalue of H. As training progresses and we approach θ*, the relevant curvature determines the maximum safe learning rate — which is usually smaller than what was safe early on.
2. Step Decay
The simplest non-constant schedule: multiply the learning rate by a factor γ at fixed step intervals.
η_t = η_0 · γ^{⌊t / S⌋}
where: - η_0 = initial learning rate - γ = decay factor (typically 0.1, 0.5) - S = step interval (e.g., every 30 epochs)
At step t = kS: η = η_0 · γ^k
Worked derivation: If η_0 = 0.1, γ = 0.5, S = 30:
| Steps | k = ⌊t/30⌋ | γ^k | η_t |
|---|---|---|---|
| 0–29 | 0 | 1.0 | 0.1 |
| 30–59 | 1 | 0.5 | 0.05 |
| 60–89 | 2 | 0.25 | 0.025 |
Why it works: Models often hit plateaus where the current learning rate is too large to make further progress. Halving it allows finer descent. The discrete jumps mean the model gets "reset" moments where it can escape suboptimal local minima because the gradient suddenly becomes proportionally larger relative to the step.
Limitation: The abrupt changes can cause transient instabilities. And the "when to step" is arbitrary — we're guessing where plateaus occur.
3. Exponential Decay
Smooth continuous decay:
η_t = η_0 · e^{−λt}
or equivalently:
η_t = η_0 · γ^t where γ = e^{−λ}
Derivation from differential equation: Assume we want the relative rate of decrease to be constant:
$dη/dt = −λ · η $
This is a separable ODE: dη/η = −λ dt → ln(η) = −λt + C → η = η_0 · e^{−λt}. ✓
Half-life: The time to halve: η_0/2 = η_0 · e^{−λt_{1/2}} → t_{1/2} = ln(2)/λ.
Decay factor interpretation: After each step, η_{t+1} = η_t · γ where γ = e^{−λ} ≈ 1−λ for small λ. So exponential decay is approximately "multiply by γ each step."
Pros and cons: - Smooth — no abrupt jumps (better than step decay) - The later stages have VERY small learning rates — can stall before convergence - No notion of a training "budget" — just decays forever
4. Cosine Annealing
The learning rate follows a half-cosine curve from η_max down to (near) zero:
$η_t = η_min + ½(η_max − η_min) · (1 + cos(π · t / T)) $
where T is the total number of steps.
Derivation from the cosine function: cos(0) = 1, cos(π) = −1. So:
- At t=0: η_0 = η_min + ½(η_max−η_min)·(1+1) = η_max ✓
- At t=T: η_T = η_min + ½(η_max−η_min)·(1−1) = η_min ✓
The shape is smooth and symmetrical — steep decay in the middle, gentle at the start and end.
Why cosine specifically? The cosine schedule has an elegant property: it spends roughly equal "time" (steps) in each learning rate regime. The derivative dη/dt ∝ sin(πt/T) — highest rate of change in the middle. Early steps at high LR explore, late steps at low LR refine, and the smooth transition prevents abrupt changes that could destabilize.
5. Cosine Annealing with Warm Restarts (SGDR)
Extend cosine annealing by periodically resetting η back to η_max:
$η_t = η_min^i + ½(η_max^i − η_min^i) · (1 + cos(π · (t mod T_i) / T_i)) $
where T_i = T_0 · r^i (each cycle period grows by factor r, typically r = 2).
Why restarts help: Each restart "shakes" the optimizer out of whatever local minimum it settled into, allowing it to explore other basins. The cosine shape assures a gentle landing each time.
⚠️ THIS IS CRITICAL — Cosine annealing is the dominant LR schedule in modern LLM training. Almost all major models (GPT-3, Llama, Chinchilla, etc.) use linear warmup + cosine decay.
6. Linear Warmup
Start with a very small learning rate and linearly increase to the target:
η_t = η_target · t / T_warmup for t ≤ T_warmup
Why warmup is mathematically necessary:
At initialization, the model weights are random. The gradient is large and noisy. The model's "direction" is essentially random. If we take a large step in a random direction, we can push the network into a region from which recovery is difficult.
More formally: Consider the variance of the parameter updates. At initialization, if layers use standard scaling (e.g., Xavier/He), the gradient norm ||∇L|| can be large. With a large learning rate:
$E[||Δθ||²] = η² · E[||∇L||²] $
If η is too large while ||∇L|| is large, the expected step size overwhelms the parameter scale and the network "explodes" (activations/norms go to infinity or NaN).
The warmup provides a "grace period" where gradients stabilize and the optimizer (especially Adam, which tracks momentum and second-order moments) can accumulate reasonable statistics before taking large steps.
Warmup length: Typically 1-5% of total steps. For a 100K-step training run, warmup = 1000–5000 steps. Chinchilla used ~1% warmup; Llama used ~2000 steps.
7. One-Cycle Policy (Leslie Smith)
A triangular-ish schedule in three phases, all within one cycle:
Phase 1 (warmup): η = η_min + (η_max − η_min) · t / T_peak [0, T_peak]
Phase 2 (annealing): η = η_max − (η_max − η_min) · (t − T_peak) / (T_total − T_peak) [T_peak, T_total]
Phase 3 (final): η = η_min/100 · decay for last few steps
The maximum LR η_max is found via an LR range test: increase LR linearly while training, and pick the point where loss stops decreasing.
Momentum schedule is INVERSE: When LR is high, momentum is low (allows exploration). When LR is low, momentum is high (stable convergence).
Why one-cycle works: High LR in the middle acts as a regularizer — large steps prevent the model from memorizing noise patterns. The rapid annealing at the end "crystallizes" the solution.
8. The Standard LLM Training Schedule
Modern LLMs almost universally use:
$Linear warmup (first ~1-2% of steps)
↓
Cosine decay (remaining ~98-99% of steps)
$
η_t = {
η_max · t / T_warmup if t < T_warmup
½η_max · (1 + cos(π·(t−T_warmup)/(T_total−T_warmup))) if t ≥ T_warmup
}
(Here we set η_min ≈ 0, so the cosine goes from η_max to ~0.)
Justification: 1. Warmup: Prevents early training collapse (see Section 6) 2. Cosine decay: Smooth annealing with gentle early descent, faster middle-phase reduction, and a long tail of very low LR for final refinement 3. No step drops: Eliminates the need to guess optimal step-drop locations 4. Matched to optimization theory: The shape approximately follows what a theoretically optimal schedule would look like under noisy gradient assumptions
9. Batch Size and LR Scaling
Linear scaling rule (Goyal et al., 2017): When you multiply batch size by k, multiply learning rate by k:
$η_new = η_old · B_new / B_old $
Derivation intuition: For SGD with batch size B, the gradient estimate is:
$g_B = (1/B) Σ_{i=1}^B g_i
$
The variance of this estimate: Var(g_B) = σ²/B where σ² is per-sample gradient variance. Larger batch = lower variance = "more accurate" gradient = we can take proportionally larger steps.
After k steps with batch size B, the total progress should equal 1 step with batch size k·B. The expected parameter change after k SGD steps with LR η and batch size B:
$E[Δθ_k_steps] = E[−η Σ_{j=1}^k g_B^{(j)}] = −η·k·E[g]
$
One step with batch size k·B and LR η':
$E[Δθ_1_step] = −η'·E[g] $
For equivalence: η' = η·k. ✓
⚠️ THIS IS CRITICAL — This rule is used everywhere in large-scale training. When you see "effective batch size = 4M tokens," it's typically achieved via gradient accumulation with a smaller per-step batch.
Critical batch size (McCandlish et al., 2018): There's an upper limit. The "gradient noise scale" B_noise determines when further scaling stops helping:
$B_crit ≈ B_noise = tr(Σ) / G^T G $
where Σ is the gradient covariance and G is the true gradient. Beyond B_crit, the return on larger batches diminishes.
Worked Examples
Example 1: Computing Learning Rate at Step t
Problem: A training run uses cosine annealing with η_max = 1e-3, η_min = 0, T = 100,000 steps. What is η at t = 37,000?
Solution:
η_t = η_min + ½(η_max − η_min) · (1 + cos(π · t / T))
= 0 + ½(1e-3) · (1 + cos(π · 37000/100000))
= 5e-4 · (1 + cos(0.37π))
= 5e-4 · (1 + cos(66.6°))
= 5e-4 · (1 + 0.397)
= 5e-4 · 1.397
= 6.985 × 10^{−4}
Check: At t=0: 1e-3 ✓. At t=50,000 (halfway): 5e-4·(1+cos(π/2)) = 5e-4 ✓ (halfway down). At t=100,000: 5e-4·(1+cos(π)) = 0 ✓.
Example 2: Warmup + Cosine Combined
Problem: A 100K-step training run has linear warmup for 2000 steps from 0 to η_max = 3e-4, then cosine decay to 0. What is η at step 1500? At step 50,000?
Solution:
Step 1500 (in warmup):
η = η_max · t / T_warmup = 3e-4 · 1500 / 2000 = 3e-4 · 0.75 = 2.25 × 10^{−4}
Step 50,000 (in cosine decay): The cosine portion runs from t = 2000 to t = 100,000 (T_decay = 98,000 steps). Relative position in cosine: τ = (50000 − 2000) / 98000 = 48000/98000 ≈ 0.4898
η = ½ · η_max · (1 + cos(π · τ))
= 1.5e-4 · (1 + cos(0.4898π))
= 1.5e-4 · (1 + cos(88.16°))
= 1.5e-4 · (1 + 0.0321)
= 1.5e-4 · 1.0321
= 1.548 × 10^{−4}
Note how halfway through the decay period, the LR is only slightly above η_max/2. This shows cosine spends proportionally MORE time at moderate LRs.
Example 3: Exponential Decay With Half-Life
Problem: A schedule starts at η_0 = 0.01 with exponential decay. After 10,000 steps, η should be 0.001. Find the decay constant λ and the half-life t_{1/2}.
Solution:
η_t = η_0 · e^{−λt}
0.001 = 0.01 · e^{−λ·10000}
0.1 = e^{−λ·10000}
ln(0.1) = −λ · 10000
−2.3026 = −λ · 10000
λ = 2.3026 / 10000 = 2.3026 × 10^{−4}
Half-life:
$t_{1/2} = ln(2) / λ = 0.6931 / (2.3026×10^{−4}) ≈ 3010 steps
$
Verification: After 3010 steps: η = 0.01 · e^{−2.3026e−4 · 3010} = 0.01 · e^{−0.693} = 0.01 · 0.5 = 0.005 ✓.
Quiz
Q1: What does the concept of Cosine annealing primarily refer to in this subject?
A) A computational error related to Cosine annealing B) A historical anecdote about Cosine annealing C) The definition and application of Cosine annealing D) A visual representation of Cosine annealing
Correct: C)
- If you chose A: This is incorrect. Cosine annealing is defined as: the definition and application of cosine annealing. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Cosine annealing is defined as: the definition and application of cosine annealing. The other options describe different aspects that are not the primary focus.
- If you chose C: Cosine annealing is defined as: the definition and application of cosine annealing. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Cosine annealing is defined as: the definition and application of cosine annealing. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Linear warmup?
A) It is used only in advanced research contexts B) It is used to linear warmup in mathematical analysis C) It replaces all other methods in this domain D) It is primarily a historical notation system
Correct: B)
- If you chose A: This is incorrect. Linear warmup serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Linear warmup serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Linear warmup serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Linear warmup serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about The linear scaling rule is TRUE?
A) The linear scaling rule is not related to this subject B) The linear scaling rule is mentioned only as a historical footnote C) The linear scaling rule is an advanced topic beyond this subject's scope D) The linear scaling rule is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. The linear scaling rule is a fundamental concept covered in this subject. This subject covers The linear scaling rule as part of its core content.
- If you chose B: This is incorrect. The linear scaling rule is a fundamental concept covered in this subject. This subject covers The linear scaling rule as part of its core content.
- If you chose C: This is incorrect. The linear scaling rule is a fundamental concept covered in this subject. This subject covers The linear scaling rule as part of its core content.
- If you chose D: The linear scaling rule is a fundamental concept covered in this subject. This subject covers The linear scaling rule as part of its core content. Correct!
Q4: Based on the worked examples in this subject, what is the correct result?
A) kS + δ where 0 ≤ δ < S: B) An unrelated numerical value C) A different result from a common mistake D) The inverse of the correct answer
Correct: A)
- If you chose A: The worked examples show that the result is kS + δ where 0 ≤ δ < S:. The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is kS + δ where 0 ≤ δ < S:. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is kS + δ where 0 ≤ δ < S:. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is kS + δ where 0 ≤ δ < S:. The other options represent common errors.
Q5: How are The linear scaling rule and The standard LLM schedule related?
A) The linear scaling rule is a special case of The standard LLM schedule B) The linear scaling rule and The standard LLM schedule are closely related concepts C) The linear scaling rule and The standard LLM schedule are completely unrelated topics D) The linear scaling rule is the inverse of The standard LLM schedule
Correct: B)
- If you chose A: This is incorrect. Both The linear scaling rule and The standard LLM schedule are covered in this subject as interconnected topics.
- If you chose B: Both The linear scaling rule and The standard LLM schedule are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both The linear scaling rule and The standard LLM schedule are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both The linear scaling rule and The standard LLM schedule are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Learning rate schedule?
A) Learning rate schedule has no common misconceptions B) The main error with Learning rate schedule is using it when it is not needed C) Learning rate schedule is always computed the same way in all contexts D) A common mistake is confusing Learning rate schedule with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse Learning rate schedule with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Learning rate schedule with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Learning rate schedule with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse Learning rate schedule with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q7: When should you apply Step decay?
A) Step decay is not practically useful B) Apply Step decay to solve problems in this subject's domain C) Use Step decay only in pure mathematics contexts D) Avoid Step decay unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. Step decay is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Step decay is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Step decay is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Step decay is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
A cosine annealing schedule has η_max = 5e-4, η_min = 1e-5, T = 50,000. What is the learning rate at t = 12,500?
Answer
τ = 12500/50000 = 0.25
η = 1e-5 + ½(5e-4 − 1e-5) · (1 + cos(0.25π))
= 1e-5 + 2.45e-4 · (1 + cos(45°))
= 1e-5 + 2.45e-4 · (1 + 0.7071)
= 1e-5 + 2.45e-4 · 1.7071
= 1e-5 + 4.182e-4
= 4.282 × 10^{−4}
At only 25% of training, we're already down to ~85% of max LR — cosine decays faster initially than one might expect.
Problem 2
Derive the formula for step decay after d steps and prove that the ratio of consecutive "plateau" LRs is always γ.
Answer
At step t: η_t = η_0 · γ^{⌊t/S⌋}. At t = kS + δ where 0 ≤ δ < S:η_{kS+δ} = η_0 · γ^k
η_{kS−1} = η_0 · γ^{k−1}
The ratio when crossing a step boundary: η_{kS} / η_{kS−1} = γ^k / γ^{k−1} = γ. ✓
Between boundaries, the ratio is 1 (constant). So the LR is piecewise constant with multiplicative drops of γ at each boundary.
Problem 3
A training run uses linear warmup for 1000 steps from 0 to 0.001, then 49,000 steps of cosine decay to 1e-6. What fraction of total training steps have LR > η_max / 2?
Answer
During cosine decay, η_t = ½(η_max) when:$½(η_max−η_min)·(1 + cos(π·τ)) = η_max/2 ≈ ½η_max·(1 + cos(π·τ)) = η_max/2 [η_min << η_max] 1 + cos(π·τ) = 1 cos(π·τ) = 0 π·τ = π/2 τ = 0.5 $So η > η_max/2 when τ < 0.5, i.e., for the first half of the cosine decay period. t_range = [0, 1000] + [2000, 2000 + 0.5·49000] = [0, 1000] ∪ [2000, 26500] = 1000 + 24500 = 25,500 steps Fraction = 25,500 / 50,000 = 0.51 = 51%.
Problem 4
You're scaling training from batch size 256 with LR 0.1 to batch size 2048 using the linear scaling rule. What should the new LR be? Then compute the "path length" difference: with the old setup you'd take 8× more steps — does this change the total distance traveled in parameter space?
Answer
**New LR:** η_new = 0.1 · 2048/256 = 0.1 · 8 = 0.8. **Path length:** With the old setup after 8 steps: total movement = 8 · η_old · ||g|| = 8·0.1·||g|| = 0.8·||g||. With new setup after 1 step: 1 · 0.8 · ||g|| = 0.8·||g||. Same expected movement — the linear scaling preserves the effective step size. But the variance is different: 8 small noisy steps explore a larger volume (due to random walk), while 1 large step with lower noise goes more directly toward the true gradient. This is why very large batch training often requires slightly different optimization strategies.Problem 5
A one-cycle policy uses T_total = 100,000, T_peak = 30,000, η_max = 0.01, η_min = 0.0001. At t = 90,000, compute η. Then explain why momentum should be LOW at this point.
Answer
t = 90,000 is in Phase 2 (annealing, since 30,000 < 90,000 < 100,000).η = η_max − (η_max − η_min) · (t − T_peak) / (T_total − T_peak)
= 0.01 − (0.01 − 0.0001) · (90,000 − 30,000) / (100,000 − 30,000)
= 0.01 − 0.0099 · 60,000 / 70,000
= 0.01 − 0.0099 · 0.8571
= 0.01 − 0.008486
= 1.514 × 10^{−3}
**Momentum should be LOW (or even negative/zero):** At t=90K, we're near the end of training with very low LR. Low momentum prevents the optimizer from "coasting" past the minimum. When LR is small, we WANT each step to follow the gradient closely, not carry inertia from earlier large steps. This is the opposite of the early phase where high momentum helps smooth noisy gradients.
Summary
- Learning rate schedules address the fundamental tradeoff between exploration (high LR) and refinement (low LR) — no single constant LR optimizes both phases
- Cosine annealing is the dominant schedule for LLMs: it starts at η_max, decays smoothly to near zero, and the cosine shape naturally spends more time at intermediate LRs
- Linear warmup is mathematically necessary to prevent early training collapse — it gives the optimizer time to accumulate stable statistics before taking large steps
- The linear scaling rule (η ∝ B) means doubling batch size allows doubling learning rate, up to the critical batch size beyond which returns diminish
- The standard LLM schedule = linear warmup (~1% of steps) + cosine decay (~99% of steps) — used by GPT-3, Llama, Chinchilla, and most modern models
Pitfalls
- Applying the linear scaling rule beyond the critical batch size. The rule η_new = η_old · B_new/B_old only holds while batch size is below B_crit = tr(Σ)/GᵀG. Beyond B_crit, gradient variance is already negligible, and further scaling η proportionally just destabilizes training. If you double batch size from 2M to 4M tokens and training diverges, you've likely exceeded the critical batch size.
- Skipping warmup because "Adam handles initialization." Adam's m_t and v_t start at zero, and early gradient estimates are biased. The first few steps with full learning rate combine large random gradients with small second-moment estimates, producing enormous parameter updates that can cause NaN or unrecoverable divergence. Warmup gives Adam time to accumulate stable statistics — it's mathematically necessary, not just a nice-to-have.
- Using step decay for LLM training instead of cosine. Step decay causes abrupt learning rate drops that create transient instabilities at each boundary. Cosine annealing provides smooth, continuous decay that avoids these discontinuities. The standard LLM recipe (linear warmup + cosine decay) has been validated across hundreds of models; step decay is a legacy technique not suited to large-scale transformer training.
- Setting η_min to exactly zero in cosine annealing. When η approaches zero, the optimizer step becomes numerically negligible but the division by σ̂ (in LayerNorm/RMSNorm gradient computations) can produce floating-point edge cases. Always set η_min to a small positive value (e.g., 1e-7 or 10% of η_max) to avoid numerical instability at the end of training.
- Using cosine annealing with warm restarts (SGDR) without increasing cycle lengths. Each successive cycle should be LONGER (typically doubling: T_i = T₀ · 2^i). Using equal-length cycles means the model spends too much time being "shaken" and too little time converging. The increasing cycle length ensures early cycles are exploratory while later cycles provide extended refinement.
Key Terms
| Term | Definition |
|---|---|
| Learning rate schedule | A function η_t that varies the learning rate over training steps to balance exploration and refinement |
| Cosine annealing | η_t descends along a half-cosine curve from η_max to η_min — the dominant schedule for LLM training |
| Linear warmup | Learning rate increases linearly from near-zero to η_max over the first ~1% of steps, preventing early training collapse |
| Step decay | Multiply LR by factor γ at fixed step intervals; creates piecewise-constant plateaus |
| Exponential decay | η_t = η_0·e^{−λt}; smooth but decays forever with no training budget awareness |
| One-cycle policy | Three-phase schedule (warmup → annealing → annihilation) with inverse momentum; common in vision |
| SGDR | Cosine annealing with warm restarts — periodically resets η to η_max to escape local minima |
| Linear scaling rule | η_new = η_old · B_new/B_old; larger batches allow proportionally larger learning rates |
| Critical batch size | B_crit = tr(Σ)/G^T G; the batch size beyond which further scaling yields diminishing returns |
| Effective step size | η · |
Next Steps
Continue to 20-02 — Gradient Clipping to learn how to handle the exploding gradient problem that learning rate schedules alone cannot solve.