Math graphic
📐 Concept diagram

14-04 — Adaptive Learning Rate Methods

Phase: Optimization | Subject: 14-04 Prerequisites: 14-02-gradient-descent.md, 14-03-variants-gradient-descent.md Next subject: 14-05-second-order-methods.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Explain why a single global learning rate is suboptimal for problems with varying gradient scales
  2. Derive the AdaGrad, RMSprop, and Adam update rules
  3. Compare the bias correction mechanism in Adam
  4. Diagnose when Adam is diverging vs converging slowly
  5. Choose appropriate hyperparameters ($\beta_1$, $\beta_2$, $\epsilon$) for practical use

Core Content

The Problem: One Size Doesn't Fit All

In neural networks, different parameters have vastly different gradient scales. Early layers get tiny gradients (vanishing gradient problem), while biases and output layer weights get large gradients. A single $\eta$ that works for one parameter may be too large for another (causing divergence) or too small (causing slow convergence).

Adaptive methods maintain a per-parameter learning rate, scaled by historical gradient information.

AdaGrad (Duchi et al., 2011)

AdaGrad accumulates the sum of squared gradients and scales each parameter's learning rate inversely:

$$G_{k} = G_{k-1} + \nabla f(\mathbf{x}k) \odot \nabla f(\mathbf{x}_k)$$ $$\mathbf{x}{k+1} = \mathbf{x}_k - \frac{\eta}{\sqrt{G_k + \epsilon}} \odot \nabla f(\mathbf{x}_k)$$

Key property: Parameters with large historical gradients get small effective learning rates; parameters with small historical gradients get large effective learning rates. This is perfect for sparse features (NLP, recommender systems) where some parameters are updated rarely.

⚠️ Critical limitation: $G_k$ grows monotonically and never decreases → effective learning rate shrinks to zero → AdaGrad stops learning on long training runs. This makes it unsuitable for non-convex deep learning where you need to keep exploring.

RMSprop (Tieleman & Hinton, 2012)

RMSprop fixes AdaGrad's decay problem by using an exponentially weighted moving average instead of a cumulative sum:

$$G_k = \rho G_{k-1} + (1-\rho)(\nabla f(\mathbf{x}k) \odot \nabla f(\mathbf{x}_k))$$ $$\mathbf{x}{k+1} = \mathbf{x}_k - \frac{\eta}{\sqrt{G_k + \epsilon}} \odot \nabla f(\mathbf{x}_k)$$

Key property: $G_k$ is a moving window of recent squared gradients — it can go up AND down. This allows the effective learning rate to adapt to changing gradient scales throughout training. RMSprop does not stop learning.

Adam (Kingma & Ba, 2015)

Adam combines momentum (first moment) with RMSprop-style adaptive scaling (second moment):

$$\mathbf{m}k = \beta_1 \mathbf{m}{k-1} + (1-\beta_1)\nabla f(\mathbf{x}k) \quad \text{(first moment — momentum)}$$ $$\mathbf{v}_k = \beta_2 \mathbf{v}{k-1} + (1-\beta_2)(\nabla f(\mathbf{x}k) \odot \nabla f(\mathbf{x}_k)) \quad \text{(second moment — RMSprop)}$$ $$\hat{\mathbf{m}}_k = \frac{\mathbf{m}_k}{1-\beta_1^k} \quad \hat{\mathbf{v}}_k = \frac{\mathbf{v}_k}{1-\beta_2^k} \quad \text{(bias correction)}$$ $$\mathbf{x}{k+1} = \mathbf{x}_k - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_k} + \epsilon} \hat{\mathbf{m}}_k$$

Standard hyperparameters: $\eta = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$

⚠️ CRITICAL — Bias Correction: At $k=1$, $\mathbf{m}_1 = (1-\beta_1)\nabla f(\mathbf{x}_1)$ and $\mathbf{v}_1 = (1-\beta_2)\nabla f(\mathbf{x}_1)^2$. This is biased toward zero. Dividing by $1-\beta_1^k$ and $1-\beta_2^k$ corrects this — without it, early iterations take tiny steps. This is why you MUST use bias correction in Adam.

Why Adam Dominates Practice

Method Momentum Adaptive LR Doesn't Die Bias Correction
SGD N/A
SGD+Momentum N/A
AdaGrad ✗ (dies) N/A
RMSprop N/A
Adam

Adam combines the best of both worlds: momentum for acceleration on consistent gradients, adaptive scaling for varying gradient magnitudes, and bias correction for stable early iterations. It's the default optimizer for most deep learning tasks.

Understanding the Math: What Adam Actually Computes

The update for parameter $i$ at step $k$:

$$x_i^{(k+1)} = x_i^{(k)} - \eta \cdot \frac{\hat{m}_i^{(k)}}{\sqrt{\hat{v}_i^{(k)}} + \epsilon}$$

The ratio $\hat{m}_i/\sqrt{\hat{v}_i}$ is approximately the signal-to-noise ratio of the gradient:

This is why Adam handles different parameters at different scales without manual tuning.



Key Terms

Worked Examples

Example 1: AdaGrad Step by Step

Apply AdaGrad to minimize $f(x) = 10x^2$ from $x_0 = 5$ with $\eta = 0.5$, $\epsilon = 0$. Run 3 iterations.

Solution: $\nabla f(x) = 20x$

$k=0$: $G_0 = 0$, $\nabla f(5) = 100$ $G_1 = 0 + 100^2 = 10000$ $x_1 = 5 - \frac{0.5}{\sqrt{10000}} \cdot 100 = 5 - \frac{0.5}{100} \cdot 100 = 5 - 0.5 = 4.5$

$k=1$: $\nabla f(4.5) = 90$ $G_2 = 10000 + 90^2 = 18100$ $x_2 = 4.5 - \frac{0.5}{\sqrt{18100}} \cdot 90 = 4.5 - \frac{0.5}{134.5} \cdot 90 = 4.5 - 0.335 = 4.165$

$k=2$: $\nabla f(4.165) = 83.3$ $G_3 = 18100 + 83.3^2 = 25039$ $x_3 = 4.165 - \frac{0.5}{\sqrt{25039}} \cdot 83.3 = 4.165 - 0.263 = 3.902$

After 3 iterations, $x_3 = 3.90$. AdaGrad reduces the effective learning rate from $\eta=0.5$ to effectively $0.5/\sqrt{G}$, which shrinks as $G$ grows.

Click for answer $x_3 = 3.902$. Effective learning rate shrinks from $0.5$ at step 0 to $0.0032$ at step 3 as $G$ accumulates.

Example 2: RMSprop vs AdaGrad on a Long Run

Compare AdaGrad and RMSprop on $f(x) = x^2$ for 1000 iterations from $x_0=10$ with $\eta=0.1$, $\rho=0.9$.

Solution: After 1000 iterations, AdaGrad's effective learning rate has shrunk to approximately $0.1/\sqrt{\sum_{t=1}^{1000} g_t^2} \approx 0.1 / \sqrt{400000} \approx 0.00016$ — essentially zero. The parameter barely moves.

RMSprop: $G_k$ is a moving average of length $\approx 1/(1-\rho) = 10$ recent squared gradients. Even after 1000 iterations, $G_k$ reflects only recent gradient magnitudes, so the effective learning rate stays healthy. $x_{1000}$ will be much closer to 0.

Click for answer AdaGrad: effective LR decays to ~0.00016 after 1000 steps → effectively stops. RMSprop: effective LR stays ~0.1/|recent gradient| → continues converging.

Example 3: Adam Bias Correction

Show that without bias correction, Adam's first step on $f(x)=x^2$ from $x_0=1$ with $\beta_1=0.9$ takes an unexpectedly small step.

Solution: $\nabla f(1) = 2$. $\mathbf{m}_1 = 0.9(0) + 0.1(2) = 0.2$ (not 2!).

Without correction: step $\propto 0.2$ — 10× smaller than the gradient magnitude suggests.

With correction: $\hat{\mathbf{m}}_1 = 0.2/(1-0.9^1) = 0.2/0.1 = 2$ — recovers the full gradient.

After many iterations, $1-\beta_1^k \to 1$, so the correction becomes negligible. But early on, it's essential for taking appropriately sized steps rather than crawling.

Click for answer Without bias correction, $\mathbf{m}_1 = 0.2$ (10× too small). With correction, $\hat{\mathbf{m}}_1 = 2$ (correct). The correction factor $1/(1-\beta_1^k)$ decays from $1/0.1=10$ at $k=1$ to $1/0.999 \approx 1$ at large $k$.


Quiz

Q1: What does the concept of Adaptive methods primarily refer to in this subject?

A) The definition and application of Adaptive methods B) A computational error related to Adaptive methods C) A historical anecdote about Adaptive methods D) A visual representation of Adaptive methods

Correct: A)

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) \beta_1 C) A simplified version of \beta_1... D) An unrelated formula from a different topic

Correct: B)

Q3: What is the primary purpose of The Problem: One Size Doesn'T Fit All?

A) It is primarily a historical notation system B) It is used only in advanced research contexts C) It is used to the problem: one size doesn't fit all in mathematical analysis D) It replaces all other methods in this domain

Correct: C)

Q4: Which statement about Adagrad (Duchi Et Al., 2011) is TRUE?

A) Adagrad (Duchi Et Al., 2011) is mentioned only as a historical footnote B) Adagrad (Duchi Et Al., 2011) is an advanced topic beyond this subject's scope C) Adagrad (Duchi Et Al., 2011) is a fundamental concept covered in this subject D) Adagrad (Duchi Et Al., 2011) is not related to this subject

Correct: C)

Q5: Based on the worked examples in this subject, what is the correct result?

A) RMSprop vs AdaGrad on a Long Run B) An unrelated numerical value C) The inverse of the correct answer D) A different result from a common mistake

Correct: A)

Q6: How are Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) related?

A) Adagrad (Duchi Et Al., 2011) is a special case of Rmsprop (Tieleman & Hinton, 2012) B) Adagrad (Duchi Et Al., 2011) is the inverse of Rmsprop (Tieleman & Hinton, 2012) C) Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) are completely unrelated topics D) Adagrad (Duchi Et Al., 2011) and Rmsprop (Tieleman & Hinton, 2012) are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with Adam (Kingma & Ba, 2015)?

A) Adam (Kingma & Ba, 2015) is always computed the same way in all contexts B) A common mistake is confusing Adam (Kingma & Ba, 2015) with a similar concept C) The main error with Adam (Kingma & Ba, 2015) is using it when it is not needed D) Adam (Kingma & Ba, 2015) has no common misconceptions

Correct: B)

Q8: When should you apply Why Adam Dominates Practice?

A) Avoid Why Adam Dominates Practice unless explicitly instructed B) Apply Why Adam Dominates Practice to solve problems in this subject's domain C) Use Why Adam Dominates Practice only in pure mathematics contexts D) Why Adam Dominates Practice is not practically useful

Correct: B)

Practice Problems

  1. Derive Adam's effective step size for a parameter whose gradient oscillates between $+g$ and $-g$ every iteration. Show why Adam handles this well.

    Click for answer After many iterations, $\hat{m} \approx 0$ (oscillations cancel in the momentum average), while $\sqrt{\hat{v}} \approx |g|$ (squared gradient accumulates). Effective step $\propto 0/|g| = 0$ — Adam naturally stops moving when gradients oscillate, which is correct behavior (oscillating gradients suggest you're at or near a minimum or saddle where gradient sign flips).

  2. For Adam, explain what happens when $\beta_2$ is set too close to 1 (e.g., 0.9999).

    Click for answer $\beta_2 = 0.9999$ means the effective window for the squared gradient moving average is $1/(1-0.9999) = 10,000$ iterations. Adam responds very slowly to changes in gradient scale — if gradients suddenly shrink (approaching a minimum), $\hat{v}$ stays large for thousands of iterations, causing tiny steps. This can cause apparent convergence stalls. The default $\beta_2 = 0.999$ gives a window of ~1000 iterations, which is typically a good balance.

  3. Compute the AdaGrad effective learning rate for a parameter updated at step $k=100$ if every gradient was exactly $\pm 1$. $\eta=0.1$, $\epsilon=10^{-8}$.

    Click for answer $G_{100} = \sum_{t=1}^{100} 1^2 = 100$. Effective LR = $0.1/\sqrt{100} = 0.1/10 = 0.01$ (10× smaller than initial). After step 10,000: effective LR = $0.1/\sqrt{10000} = 0.001$ (100× smaller). This monotonic decay is why AdaGrad dies.

  4. RMSprop with $\rho=0.9$ has an effective memory of roughly how many past gradients? What about $\rho=0.99$?

    Click for answer The half-life in iterations is approximately $\ln(0.5)/\ln(\rho)$. For $\rho=0.9$: $\ln(0.5)/\ln(0.9) \approx 6.6$ iterations. For $\rho=0.99$: $\ln(0.5)/\ln(0.99) \approx 69$ iterations. The "effective window" is roughly $1/(1-\rho)$: 10 for $\rho=0.9$, 100 for $\rho=0.99$.

  5. Write pseudocode for AdamW, the weight-decay variant of Adam that decouples regularization from the adaptive learning rate.

    Click for answer The key difference: instead of adding L2 penalty to the gradient (AdamL2), AdamW applies weight decay directly to the parameters: $\mathbf{m}_k = \beta_1\mathbf{m}_{k-1} + (1-\beta_1)\nabla f(\mathbf{x}_k)$ $\mathbf{v}_k = \beta_2\mathbf{v}_{k-1} + (1-\beta_2)(\nabla f(\mathbf{x}_k))^2$ Bias-correct $\hat{\mathbf{m}}_k$, $\hat{\mathbf{v}}_k$ as usual. $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta\left(\frac{\hat{\mathbf{m}}_k}{\sqrt{\hat{\mathbf{v}}_k}+\epsilon} + \lambda\mathbf{x}_k\right)$ The $\lambda\mathbf{x}_k$ term is applied directly (not through the adaptive scaling), which prevents parameters with small gradients from being over-regularized. AdamW is the standard in modern frameworks (PyTorch, JAX).


Summary

Key takeaways:


Pitfalls

  1. Using AdaGrad for long training runs. AdaGrad's accumulated $G_k$ grows monotonically, driving the effective learning rate to zero. After ~10,000+ iterations, parameters stop updating. Always use RMSprop or Adam for non-convex deep learning where training runs are long.

  2. Setting $\beta_2$ too close to 1 (e.g., 0.9999). The effective window for the squared gradient moving average becomes ~10,000 iterations. When gradient scales change (approaching convergence), $\hat{v}$ stays large for thousands of steps, causing tiny updates and apparent convergence stalls.

  3. Forgetting bias correction when reimplementing Adam. Without dividing by $1-\beta_1^k$ and $1-\beta_2^k$, the first few steps are biased toward zero — the optimizer crawls. Use framework implementations; they handle this correctly.

  4. Assuming Adam always outperforms SGD. On some well-conditioned problems (especially image classification with CNNs), SGD with momentum can generalize better than Adam. Adam's per-parameter scaling can find sharp minima that don't generalize as well.

  5. Using the same $\epsilon$ for all precisions. The default $\epsilon = 10^{-8}$ works for float32 but is too large for float16 (where it dominates the denominator) and unnecessary for float64. Adjust $\epsilon$ when changing precision.



Next Steps

Next up: 14-05-second-order-methods.md — Newton's method, quasi-Newton (BFGS, L-BFGS), and why second-order information is both powerful and impractical for deep learning.