14-04 — Adaptive Learning Rate Methods
Phase: Optimization | Subject: 14-04 Prerequisites: 14-02-gradient-descent.md, 14-03-variants-gradient-descent.md Next subject: 14-05-second-order-methods.md
Learning Objectives
By the end of this subject, you will be able to:
- Explain why a single global learning rate is suboptimal for problems with varying gradient scales
- Derive the AdaGrad, RMSprop, and Adam update rules
- Compare the bias correction mechanism in Adam
- Diagnose when Adam is diverging vs converging slowly
- Choose appropriate hyperparameters ($\beta_1$, $\beta_2$, $\epsilon$) for practical use
Core Content
The Problem: One Size Doesn't Fit All
In neural networks, different parameters have vastly different gradient scales. Early layers get tiny gradients (vanishing gradient problem), while biases and output layer weights get large gradients. A single $\eta$ that works for one parameter may be too large for another (causing divergence) or too small (causing slow convergence).
Adaptive methods maintain a per-parameter learning rate, scaled by historical gradient information.
AdaGrad (Duchi et al., 2011)
AdaGrad accumulates the sum of squared gradients and scales each parameter's learning rate inversely:
$$G_{k} = G_{k-1} + \nabla f(\mathbf{x}k) \odot \nabla f(\mathbf{x}_k)$$ $$\mathbf{x}{k+1} = \mathbf{x}_k - \frac{\eta}{\sqrt{G_k + \epsilon}} \odot \nabla f(\mathbf{x}_k)$$
- $G_k$ — sum of squared gradients (per-parameter), always growing
- $\epsilon \approx 10^{-8}$ — prevents division by zero
- $\odot$ — element-wise multiplication/division
Key property: Parameters with large historical gradients get small effective learning rates; parameters with small historical gradients get large effective learning rates. This is perfect for sparse features (NLP, recommender systems) where some parameters are updated rarely.
⚠️ Critical limitation: $G_k$ grows monotonically and never decreases → effective learning rate shrinks to zero → AdaGrad stops learning on long training runs. This makes it unsuitable for non-convex deep learning where you need to keep exploring.
RMSprop (Tieleman & Hinton, 2012)
RMSprop fixes AdaGrad's decay problem by using an exponentially weighted moving average instead of a cumulative sum:
$$G_k = \rho G_{k-1} + (1-\rho)(\nabla f(\mathbf{x}k) \odot \nabla f(\mathbf{x}_k))$$ $$\mathbf{x}{k+1} = \mathbf{x}_k - \frac{\eta}{\sqrt{G_k + \epsilon}} \odot \nabla f(\mathbf{x}_k)$$
- $\rho \approx 0.9$ (sometimes 0.99) — decay rate for squared gradient average
- Older squared gradients are exponentially forgotten
Key property: $G_k$ is a moving window of recent squared gradients — it can go up AND down. This allows the effective learning rate to adapt to changing gradient scales throughout training. RMSprop does not stop learning.
Adam (Kingma & Ba, 2015)
Adam combines momentum (first moment) with RMSprop-style adaptive scaling (second moment):
$$\mathbf{m}k = \beta_1 \mathbf{m}{k-1} + (1-\beta_1)\nabla f(\mathbf{x}k) \quad \text{(first moment — momentum)}$$ $$\mathbf{v}_k = \beta_2 \mathbf{v}{k-1} + (1-\beta_2)(\nabla f(\mathbf{x}k) \odot \nabla f(\mathbf{x}_k)) \quad \text{(second moment — RMSprop)}$$ $$\hat{\mathbf{m}}_k = \frac{\mathbf{m}_k}{1-\beta_1^k} \quad \hat{\mathbf{v}}_k = \frac{\mathbf{v}_k}{1-\beta_2^k} \quad \text{(bias correction)}$$ $$\mathbf{x}{k+1} = \mathbf{x}_k - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_k} + \epsilon} \hat{\mathbf{m}}_k$$
Standard hyperparameters: $\eta = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$
⚠️ CRITICAL — Bias Correction: At $k=1$, $\mathbf{m}_1 = (1-\beta_1)\nabla f(\mathbf{x}_1)$ and $\mathbf{v}_1 = (1-\beta_2)\nabla f(\mathbf{x}_1)^2$. This is biased toward zero. Dividing by $1-\beta_1^k$ and $1-\beta_2^k$ corrects this — without it, early iterations take tiny steps. This is why you MUST use bias correction in Adam.
Why Adam Dominates Practice
| Method | Momentum | Adaptive LR | Doesn't Die | Bias Correction |
|---|---|---|---|---|
| SGD | ✗ | ✗ | ✓ | N/A |
| SGD+Momentum | ✓ | ✗ | ✓ | N/A |
| AdaGrad | ✗ | ✓ | ✗ (dies) | N/A |
| RMSprop | ✗ | ✓ | ✓ | N/A |
| Adam | ✓ | ✓ | ✓ | ✓ |
Adam combines the best of both worlds: momentum for acceleration on consistent gradients, adaptive scaling for varying gradient magnitudes, and bias correction for stable early iterations. It's the default optimizer for most deep learning tasks.
Understanding the Math: What Adam Actually Computes
The update for parameter $i$ at step $k$:
$$x_i^{(k+1)} = x_i^{(k)} - \eta \cdot \frac{\hat{m}_i^{(k)}}{\sqrt{\hat{v}_i^{(k)}} + \epsilon}$$
The ratio $\hat{m}_i/\sqrt{\hat{v}_i}$ is approximately the signal-to-noise ratio of the gradient:
- If $\nabla f_i$ is consistently large and of the same sign → $\hat{m}_i$ large, $\sqrt{\hat{v}_i}$ large, ratio near $\pm 1$ → step size $\approx \eta$
- If $\nabla f_i$ oscillates → $\hat{m}_i \approx 0$, $\sqrt{\hat{v}_i}$ large → step size $\approx 0$ (effectively stopped)
- If $\nabla f_i$ is consistently small → $\hat{m}_i$ small, $\sqrt{\hat{v}_i}$ small → step size scale depends on ratio
This is why Adam handles different parameters at different scales without manual tuning.
Key Terms
- Adam
- Adaptive methods
Worked Examples
Example 1: AdaGrad Step by Step
Apply AdaGrad to minimize $f(x) = 10x^2$ from $x_0 = 5$ with $\eta = 0.5$, $\epsilon = 0$. Run 3 iterations.
Solution: $\nabla f(x) = 20x$
$k=0$: $G_0 = 0$, $\nabla f(5) = 100$ $G_1 = 0 + 100^2 = 10000$ $x_1 = 5 - \frac{0.5}{\sqrt{10000}} \cdot 100 = 5 - \frac{0.5}{100} \cdot 100 = 5 - 0.5 = 4.5$
$k=1$: $\nabla f(4.5) = 90$ $G_2 = 10000 + 90^2 = 18100$ $x_2 = 4.5 - \frac{0.5}{\sqrt{18100}} \cdot 90 = 4.5 - \frac{0.5}{134.5} \cdot 90 = 4.5 - 0.335 = 4.165$
$k=2$: $\nabla f(4.165) = 83.3$ $G_3 = 18100 + 83.3^2 = 25039$ $x_3 = 4.165 - \frac{0.5}{\sqrt{25039}} \cdot 83.3 = 4.165 - 0.263 = 3.902$
After 3 iterations, $x_3 = 3.90$. AdaGrad reduces the effective learning rate from $\eta=0.5$ to effectively $0.5/\sqrt{G}$, which shrinks as $G$ grows.
Click for answer
$x_3 = 3.902$. Effective learning rate shrinks from $0.5$ at step 0 to $0.0032$ at step 3 as $G$ accumulates.Example 2: RMSprop vs AdaGrad on a Long Run
Compare AdaGrad and RMSprop on $f(x) = x^2$ for 1000 iterations from $x_0=10$ with $\eta=0.1$, $\rho=0.9$.
Solution: After 1000 iterations, AdaGrad's effective learning rate has shrunk to approximately $0.1/\sqrt{\sum_{t=1}^{1000} g_t^2} \approx 0.1 / \sqrt{400000} \approx 0.00016$ — essentially zero. The parameter barely moves.
RMSprop: $G_k$ is a moving average of length $\approx 1/(1-\rho) = 10$ recent squared gradients. Even after 1000 iterations, $G_k$ reflects only recent gradient magnitudes, so the effective learning rate stays healthy. $x_{1000}$ will be much closer to 0.
Click for answer
AdaGrad: effective LR decays to ~0.00016 after 1000 steps → effectively stops. RMSprop: effective LR stays ~0.1/|recent gradient| → continues converging.Example 3: Adam Bias Correction
Show that without bias correction, Adam's first step on $f(x)=x^2$ from $x_0=1$ with $\beta_1=0.9$ takes an unexpectedly small step.
Solution: $\nabla f(1) = 2$. $\mathbf{m}_1 = 0.9(0) + 0.1(2) = 0.2$ (not 2!).
Without correction: step $\propto 0.2$ — 10× smaller than the gradient magnitude suggests.
With correction: $\hat{\mathbf{m}}_1 = 0.2/(1-0.9^1) = 0.2/0.1 = 2$ — recovers the full gradient.
After many iterations, $1-\beta_1^k \to 1$, so the correction becomes negligible. But early on, it's essential for taking appropriately sized steps rather than crawling.
Click for answer
Without bias correction, $\mathbf{m}_1 = 0.2$ (10× too small). With correction, $\hat{\mathbf{m}}_1 = 2$ (correct). The correction factor $1/(1-\beta_1^k)$ decays from $1/0.1=10$ at $k=1$ to $1/0.999 \approx 1$ at large $k$.Quiz
Q1: What does the concept of Adaptive methods primarily refer to in this subject?
A) The definition and application of Adaptive methods B) A computational error related to Adaptive methods C) A historical anecdote about Adaptive methods D) A visual representation of Adaptive methods
Correct: A)
- If you chose A: Adaptive methods is defined as: the definition and application of adaptive methods. The other options describe different aspects that are not the primary focus. Correct!
- If you chose B: This is incorrect. Adaptive methods is defined as: the definition and application of adaptive methods. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Adaptive methods is defined as: the definition and application of adaptive methods. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Adaptive methods is defined as: the definition and application of adaptive methods. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) \beta_1 C) A simplified version of \beta_1... D) An unrelated formula from a different topic
Correct: B)
- If you chose A: This is incorrect. The formula \beta_1 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: The formula \beta_1 is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose C: This is incorrect. The formula \beta_1 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula \beta_1 is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of The Problem: One Size Doesn'T Fit All?
A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It is used to the problem: one size doesn't fit all in mathematical analysis D) It replaces all other methods in this domain
Correct: C)
- If you chose A: This is incorrect. The Problem: One Size Doesn'T Fit All serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. The Problem: One Size Doesn'T Fit All serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: The Problem: One Size Doesn'T Fit All serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose D: This is incorrect. The Problem: One Size Doesn'T Fit All serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Adagrad (Duchi Et Al., 2011) is TRUE?
A) Adagrad (Duchi Et Al., 2011) is mentioned only as a historical footnote B) Adagrad (Duchi Et Al., 2011) is an advanced topic beyond this subject's scope C) Adagrad (Duchi Et Al., 2011) is a fundamental concept covered in this subject D) Adagrad (Duchi Et Al., 2011) is not related to this subject
Correct: C)
- If you chose A: This is incorrect. Adagrad (Duchi Et Al., 2011) is a fundamental concept covered in this subject. This subject covers Adagrad (Duchi Et Al., 2011) as part of its core content.
- If you chose B: This is incorrect. Adagrad (Duchi Et Al., 2011) is a fundamental concept covered in this subject. This subject covers Adagrad (Duchi Et Al., 2011) as part of its core content.
- If you chose C: Adagrad (Duchi Et Al., 2011) is a fundamental concept covered in this subject. This subject covers Adagrad (Duchi Et Al., 2011) as part of its core content. Correct!
- If you chose D: This is incorrect. Adagrad (Duchi Et Al., 2011) is a fundamental concept covered in this subject. This subject covers Adagrad (Duchi Et Al., 2011) as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) RMSprop vs AdaGrad on a Long Run B) An unrelated numerical value C) The inverse of the correct answer D) A different result from a common mistake
Correct: A)
- If you chose A: The worked examples show that the result is RMSprop vs AdaGrad on a Long Run. The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is RMSprop vs AdaGrad on a Long Run. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is RMSprop vs AdaGrad on a Long Run. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is RMSprop vs AdaGrad on a Long Run. The other options represent common errors.
Q6: How are Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) related?
A) Adagrad (Duchi Et Al., 2011) is a special case of Rmsprop (Tieleman & Hinton, 2012) B) Adagrad (Duchi Et Al., 2011) is the inverse of Rmsprop (Tieleman & Hinton, 2012) C) Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) are completely unrelated topics D) Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) are covered in this subject as interconnected topics.
- If you chose D: Both Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with Adam (Kingma & Ba, 2015)?
A) Adam (Kingma & Ba, 2015) is always computed the same way in all contexts B) A common mistake is confusing Adam (Kingma & Ba, 2015) with a similar concept C) The main error with Adam (Kingma & Ba, 2015) is using it when it is not needed D) Adam (Kingma & Ba, 2015) has no common misconceptions
Correct: B)
- If you chose A: This is incorrect. Students often confuse Adam (Kingma & Ba, 2015) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse Adam (Kingma & Ba, 2015) with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse Adam (Kingma & Ba, 2015) with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Adam (Kingma & Ba, 2015) with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Why Adam Dominates Practice?
A) Avoid Why Adam Dominates Practice unless explicitly instructed B) Apply Why Adam Dominates Practice to solve problems in this subject's domain C) Use Why Adam Dominates Practice only in pure mathematics contexts D) Why Adam Dominates Practice is not practically useful
Correct: B)
- If you chose A: This is incorrect. Why Adam Dominates Practice is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Why Adam Dominates Practice is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Why Adam Dominates Practice is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Why Adam Dominates Practice is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Derive Adam's effective step size for a parameter whose gradient oscillates between $+g$ and $-g$ every iteration. Show why Adam handles this well.
Click for answer
After many iterations, $\hat{m} \approx 0$ (oscillations cancel in the momentum average), while $\sqrt{\hat{v}} \approx |g|$ (squared gradient accumulates). Effective step $\propto 0/|g| = 0$ — Adam naturally stops moving when gradients oscillate, which is correct behavior (oscillating gradients suggest you're at or near a minimum or saddle where gradient sign flips). -
For Adam, explain what happens when $\beta_2$ is set too close to 1 (e.g., 0.9999).
Click for answer
$\beta_2 = 0.9999$ means the effective window for the squared gradient moving average is $1/(1-0.9999) = 10,000$ iterations. Adam responds very slowly to changes in gradient scale — if gradients suddenly shrink (approaching a minimum), $\hat{v}$ stays large for thousands of iterations, causing tiny steps. This can cause apparent convergence stalls. The default $\beta_2 = 0.999$ gives a window of ~1000 iterations, which is typically a good balance. -
Compute the AdaGrad effective learning rate for a parameter updated at step $k=100$ if every gradient was exactly $\pm 1$. $\eta=0.1$, $\epsilon=10^{-8}$.
Click for answer
$G_{100} = \sum_{t=1}^{100} 1^2 = 100$. Effective LR = $0.1/\sqrt{100} = 0.1/10 = 0.01$ (10× smaller than initial). After step 10,000: effective LR = $0.1/\sqrt{10000} = 0.001$ (100× smaller). This monotonic decay is why AdaGrad dies. -
RMSprop with $\rho=0.9$ has an effective memory of roughly how many past gradients? What about $\rho=0.99$?
Click for answer
The half-life in iterations is approximately $\ln(0.5)/\ln(\rho)$. For $\rho=0.9$: $\ln(0.5)/\ln(0.9) \approx 6.6$ iterations. For $\rho=0.99$: $\ln(0.5)/\ln(0.99) \approx 69$ iterations. The "effective window" is roughly $1/(1-\rho)$: 10 for $\rho=0.9$, 100 for $\rho=0.99$. -
Write pseudocode for AdamW, the weight-decay variant of Adam that decouples regularization from the adaptive learning rate.
Click for answer
The key difference: instead of adding L2 penalty to the gradient (AdamL2), AdamW applies weight decay directly to the parameters: $\mathbf{m}_k = \beta_1\mathbf{m}_{k-1} + (1-\beta_1)\nabla f(\mathbf{x}_k)$ $\mathbf{v}_k = \beta_2\mathbf{v}_{k-1} + (1-\beta_2)(\nabla f(\mathbf{x}_k))^2$ Bias-correct $\hat{\mathbf{m}}_k$, $\hat{\mathbf{v}}_k$ as usual. $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta\left(\frac{\hat{\mathbf{m}}_k}{\sqrt{\hat{\mathbf{v}}_k}+\epsilon} + \lambda\mathbf{x}_k\right)$ The $\lambda\mathbf{x}_k$ term is applied directly (not through the adaptive scaling), which prevents parameters with small gradients from being over-regularized. AdamW is the standard in modern frameworks (PyTorch, JAX).
Summary
Key takeaways:
- Adaptive methods assign per-parameter learning rates, solving the "one $\eta$ fits all" problem
- AdaGrad accumulates squared gradients → effective LR monotonically decreases → dies on long runs
- RMSprop uses exponential moving average of squared gradients → adapts continuously, doesn't die
- Adam = Momentum + RMSprop + bias correction → default optimizer for most deep learning
- Bias correction divides by $1-\beta^k$, fixing the zero-initialization bias in early steps
- Standard Adam hyperparameters ($\eta=0.001$, $\beta_1=0.9$, $\beta_2=0.999$) work well across most problems
Pitfalls
-
Using AdaGrad for long training runs. AdaGrad's accumulated $G_k$ grows monotonically, driving the effective learning rate to zero. After ~10,000+ iterations, parameters stop updating. Always use RMSprop or Adam for non-convex deep learning where training runs are long.
-
Setting $\beta_2$ too close to 1 (e.g., 0.9999). The effective window for the squared gradient moving average becomes ~10,000 iterations. When gradient scales change (approaching convergence), $\hat{v}$ stays large for thousands of steps, causing tiny updates and apparent convergence stalls.
-
Forgetting bias correction when reimplementing Adam. Without dividing by $1-\beta_1^k$ and $1-\beta_2^k$, the first few steps are biased toward zero — the optimizer crawls. Use framework implementations; they handle this correctly.
-
Assuming Adam always outperforms SGD. On some well-conditioned problems (especially image classification with CNNs), SGD with momentum can generalize better than Adam. Adam's per-parameter scaling can find sharp minima that don't generalize as well.
-
Using the same $\epsilon$ for all precisions. The default $\epsilon = 10^{-8}$ works for float32 but is too large for float16 (where it dominates the denominator) and unnecessary for float64. Adjust $\epsilon$ when changing precision.
Next Steps
Next up: 14-05-second-order-methods.md — Newton's method, quasi-Newton (BFGS, L-BFGS), and why second-order information is both powerful and impractical for deep learning.