14-03 — Variants of Gradient Descent
Phase: Optimization | Subject: 14-03 Prerequisites: 14-02-gradient-descent.md Next subject: 14-04-adaptive-learning-rate-methods.md
Learning Objectives
By the end of this subject, you will be able to:
- Explain how momentum accelerates gradient descent and derive the update rule
- Contrast Nesterov accelerated gradient with classical momentum
- Describe stochastic gradient descent and why it works despite noisy gradients
- Analyze the bias-variance tradeoff of mini-batch size
- Implement SGD with momentum and tune the momentum coefficient
Core Content
The Problem: Zigzag and Slow Convergence
Vanilla gradient descent on ill-conditioned problems zigzags. For $f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2$ ($\kappa=50$), the gradient points mostly in the steep $y$-direction, causing oscillation while making slow progress in the shallow $x$-direction.
The fix: momentum — accumulate past gradients to smooth the trajectory.
Classical Momentum (Polyak, 1964)
$$\mathbf{v}{k+1} = \beta \mathbf{v}_k - \eta \nabla f(\mathbf{x}_k)$$ $$\mathbf{x}{k+1} = \mathbf{x}k + \mathbf{v}{k+1}$$
- $\mathbf{v}_k$ — velocity (exponentially decaying moving average of past gradients)
- $\beta \in [0, 1)$ — momentum coefficient (typically $\beta = 0.9$)
- $\eta$ — learning rate
Interpretation: Think of a ball rolling down a hill. Momentum keeps it moving in consistent directions and dampens oscillations. When $\beta=0$, this reduces to vanilla GD.
Why it helps with zigzag: Oscillating gradient components cancel out in the moving average, while consistent components accumulate. On $f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2$, momentum dampens the $y$-oscillations and accelerates $x$-progress.
The effective step size in steady state (when gradient is nearly constant) is $\eta/(1-\beta)$. With $\beta=0.9$, each step is roughly $10\times$ larger than vanilla GD — that's the acceleration.
Nesterov Accelerated Gradient (NAG, 1983)
Nesterov's insight: "look ahead" before computing the gradient.
$$\mathbf{v}{k+1} = \beta \mathbf{v}_k - \eta \nabla f(\mathbf{x}_k + \beta \mathbf{v}_k)$$ $$\mathbf{x}{k+1} = \mathbf{x}k + \mathbf{v}{k+1}$$
The gradient is evaluated at $\mathbf{x}_k + \beta\mathbf{v}_k$ — the lookahead position where momentum would take you — not at $\mathbf{x}_k$.
⚠️ CRITICAL: NAG is provably optimal for convex, smooth optimization — it achieves $O(1/k^2)$ convergence vs $O(1/k)$ for vanilla GD. For strongly convex functions: $O((1-1/\sqrt{\kappa})^k)$ vs $O((1-1/\kappa)^k)$. The improvement from $\kappa$ to $\sqrt{\kappa}$ is significant when $\kappa$ is large.
Comparison Table
| Method | Update | Convergence (convex) | Convergence (strongly convex) |
|---|---|---|---|
| GD | $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta\nabla f(\mathbf{x}_k)$ | $O(1/k)$ | $O((1-1/\kappa)^k)$ |
| Momentum | $\mathbf{v}_{k+1} = \beta\mathbf{v}_k - \eta\nabla f(\mathbf{x}_k)$ | $O(1/k)$ (faster constant) | $O((1-1/\sqrt{\kappa})^k)$ |
| NAG | Gradient at lookahead | $O(1/k^2)$ | $O((1-1/\sqrt{\kappa})^k)$ |
Stochastic Gradient Descent (SGD)
In ML, the loss is an average over data: $f(\mathbf{x}) = \frac{1}{N}\sum_{i=1}^N f_i(\mathbf{x})$. Computing the full gradient costs $O(N)$. SGD approximates it with a random subset (mini-batch):
$$\mathbf{x}{k+1} = \mathbf{x}_k - \eta \cdot \frac{1}{|B|}\sum{i \in B} \nabla f_i(\mathbf{x}_k)$$
where $B$ is a random mini-batch of size $|B|$.
Why SGD works: The mini-batch gradient is an unbiased estimator of the true gradient:
$$\mathbb{E}B\left[\frac{1}{|B|}\sum{i \in B} \nabla f_i(\mathbf{x})\right] = \nabla f(\mathbf{x})$$
The variance is $\propto 1/|B|$, so larger batches give more accurate gradients.
The bias-variance tradeoff of batch size:
| Batch Size | Gradient Variance | Memory | Convergence Speed (wall-clock) |
|---|---|---|---|
| 1 (pure SGD) | Very high | Minimal | Fast per-step, but noisy |
| 32–256 (mini-batch) | Moderate | Moderate | Sweet spot — good GPU utilization + noise helps escape saddle points |
| N (full batch) | Zero | Max | Slow per-step, exact gradient |
⚠️ Common Pitfall: The noise in SGD isn't always bad — it helps escape saddle points and shallow local minima. Pure full-batch GD on a non-convex neural network loss can get stuck; SGD's stochasticity acts as implicit regularization.
SGD Convergence
With decreasing learning rate $\eta_k = \eta_0 / \sqrt{k}$ (or similar schedule):
$$\mathbb{E}[f(\bar{\mathbf{x}}_K) - f(\mathbf{x}^*)] \leq O\left(\frac{1}{\sqrt{K}}\right)$$
where $\bar{\mathbf{x}}_K$ is the average of iterates. The $O(1/\sqrt{K})$ rate is slower than GD's $O(1/K)$ on the training loss, but SGD generalizes better and each iteration is $N/|B|$ times cheaper.
Key Terms
- Sweet spot
Worked Examples
Example 1: Momentum Dampens Oscillation
Compare GD ($\beta=0$) and momentum ($\beta=0.9$) on $f(x,y) = \frac{1}{2}x^2 + 25y^2$ from $(x_0,y_0) = (10, 1)$ with $\eta=0.01$. Track 5 iterations.
Solution (GD, $\beta=0$): $\nabla f = (x, 50y)$
| k | $x_k$ | $y_k$ |
|---|---|---|
| 0 | 10.000 | 1.000 |
| 1 | 9.900 | 0.500 |
| 2 | 9.801 | 0.250 |
| 3 | 9.703 | 0.125 |
| 4 | 9.606 | 0.063 |
| 5 | 9.510 | 0.031 |
$x$ barely moves (~0.5% per step); $y$ oscillates toward zero.
Solution (Momentum, $\beta=0.9$): Start with $\mathbf{v}_0 = (0,0)$.
| k | $\nabla f$ | $\mathbf{v}_k$ | $\mathbf{x}_k$ |
|---|---|---|---|
| 0 | $(10, 50)$ | $(0, 0)$ | $(10, 1)$ |
| 1 | — | $( -0.10, -0.50)$ | $(9.90, 0.50)$ |
| 2 | $(9.90, 25)$ | $( -0.19, -0.70)$ | $(9.71, -0.20)$ |
| 3 | $(9.71, -10)$ | $( -0.27, -0.53)$ | $(9.44, -0.73)$ |
Momentum overshoots $y$ but accumulates forward progress in $x$. After many iterations, $x$ converges much faster than vanilla GD.
Click for answer
GD: $x$ converges at ~0.5% per step (stuck due to small $\eta$ forced by steep $y$). Momentum: $x$ accelerates due to accumulated velocity while $y$ oscillations partially cancel.Example 2: NAG vs Classical Momentum
On a 1D quadratic $f(x) = \frac{1}{2}ax^2$ with $a=1$, compare the effective dynamics of momentum vs NAG.
Solution:
Momentum: $v_{k+1} = \beta v_k - \eta x_k$, $x_{k+1} = x_k + v_{k+1}$
NAG: $v_{k+1} = \beta v_k - \eta(x_k + \beta v_k)$, $x_{k+1} = x_k + v_{k+1}$
In NAG, the gradient term includes $-\eta\beta v_k$, which acts as a "correction" — it anticipates the overshoot and compensates. For quadratics, NAG eliminates oscillations at a wider range of $\eta$ values.
With $\beta=0.9$, $\eta=0.1$, $x_0=10$:
Momentum: oscillatory approach, takes ~40 iterations to get below 0.1. NAG: smoother descent, takes ~25 iterations to get below 0.1.
Click for answer
NAG's lookahead correction $-\eta\beta v_k$ anticipates overshoot, reducing oscillations and converging ~1.6× faster than classical momentum on this quadratic.Example 3: Mini-Batch Variance
For a dataset of 1000 points with per-example gradients $\nabla f_i$ having variance $\sigma^2$, compute the variance of a mini-batch gradient estimator of size $B=32$.
Solution: $\operatorname{Var}\left[\frac{1}{B}\sum_{i \in B} \nabla f_i\right] = \frac{1}{B^2} \sum_{i \in B} \operatorname{Var}[\nabla f_i] = \frac{1}{B^2} \cdot B\sigma^2 = \frac{\sigma^2}{B}$
For $B=32$: variance is $\sigma^2/32$. For $B=1$ (pure SGD): $\sigma^2$ (32× noisier). For $B=1000$ (full batch): $\approx 0$.
Tradeoff: Mini-batch of 32 gives 32× variance reduction at only 32× the per-step cost, which is typically dominated by GPU parallelism — so wall-clock time per step barely increases from $B=1$ to $B=32$ on a GPU.
Click for answer
$\operatorname{Var} = \sigma^2/B$. Going from $B=1$ to $B=32$ reduces gradient noise by 97% at negligible wall-clock cost on GPU hardware.Quiz
Q1: What does the concept of Sweet spot primarily refer to in this subject?
A) A computational error related to Sweet spot B) A historical anecdote about Sweet spot C) A visual representation of Sweet spot D) The definition and application of Sweet spot
Correct: D)
- If you chose A: This is incorrect. Sweet spot is defined as: the definition and application of sweet spot. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Sweet spot is defined as: the definition and application of sweet spot. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Sweet spot is defined as: the definition and application of sweet spot. The other options describe different aspects that are not the primary focus.
- If you chose D: Sweet spot is defined as: the definition and application of sweet spot. The other options describe different aspects that are not the primary focus. Correct!
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2 C) A simplified version of f(x,y) = \frac{1}{2}x^2 + \f... D) An unrelated formula from a different topic
Correct: B)
- If you chose A: This is incorrect. The formula f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: The formula f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2 is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose C: This is incorrect. The formula f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2 is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2 is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of The Problem: Zigzag And Slow Convergence?
A) It replaces all other methods in this domain B) It is used to the problem: zigzag and slow convergence in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system
Correct: B)
- If you chose A: This is incorrect. The Problem: Zigzag And Slow Convergence serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: The Problem: Zigzag And Slow Convergence serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. The Problem: Zigzag And Slow Convergence serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. The Problem: Zigzag And Slow Convergence serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Classical Momentum (Polyak, 1964) is TRUE?
A) Classical Momentum (Polyak, 1964) is a fundamental concept covered in this subject B) Classical Momentum (Polyak, 1964) is not related to this subject C) Classical Momentum (Polyak, 1964) is an advanced topic beyond this subject's scope D) Classical Momentum (Polyak, 1964) is mentioned only as a historical footnote
Correct: A)
- If you chose A: Classical Momentum (Polyak, 1964) is a fundamental concept covered in this subject. This subject covers Classical Momentum (Polyak, 1964) as part of its core content. Correct!
- If you chose B: This is incorrect. Classical Momentum (Polyak, 1964) is a fundamental concept covered in this subject. This subject covers Classical Momentum (Polyak, 1964) as part of its core content.
- If you chose C: This is incorrect. Classical Momentum (Polyak, 1964) is a fundamental concept covered in this subject. This subject covers Classical Momentum (Polyak, 1964) as part of its core content.
- If you chose D: This is incorrect. Classical Momentum (Polyak, 1964) is a fundamental concept covered in this subject. This subject covers Classical Momentum (Polyak, 1964) as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) A different result from a common mistake B) An unrelated numerical value C) The inverse of the correct answer D) Mini-Batch Variance
Correct: D)
- If you chose A: This is incorrect. The worked examples show that the result is Mini-Batch Variance. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is Mini-Batch Variance. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is Mini-Batch Variance. The other options represent common errors.
- If you chose D: The worked examples show that the result is Mini-Batch Variance. The other options represent common errors. Correct!
Q6: How are Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) related?
A) Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) are closely related concepts B) Classical Momentum (Polyak, 1964) is the inverse of Nesterov Accelerated Gradient (Nag, 1983) C) Classical Momentum (Polyak, 1964) is a special case of Nesterov Accelerated Gradient (Nag, 1983) D) Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) are completely unrelated topics
Correct: A)
- If you chose A: Both Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Comparison Table?
A) Comparison Table has no common misconceptions B) The main error with Comparison Table is using it when it is not needed C) A common mistake is confusing Comparison Table with a similar concept D) Comparison Table is always computed the same way in all contexts
Correct: C)
- If you chose A: This is incorrect. Students often confuse Comparison Table with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Comparison Table with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse Comparison Table with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse Comparison Table with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Stochastic Gradient Descent (Sgd)?
A) Stochastic Gradient Descent (Sgd) is not practically useful B) Use Stochastic Gradient Descent (Sgd) only in pure mathematics contexts C) Avoid Stochastic Gradient Descent (Sgd) unless explicitly instructed D) Apply Stochastic Gradient Descent (Sgd) to solve problems in this subject's domain
Correct: D)
- If you chose A: This is incorrect. Stochastic Gradient Descent (Sgd) is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Stochastic Gradient Descent (Sgd) is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Stochastic Gradient Descent (Sgd) is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: Stochastic Gradient Descent (Sgd) is a practical tool used throughout this subject to solve relevant problems. Correct!
Practice Problems
-
Derive the effective step size of momentum in steady state (when $\nabla f$ is approximately constant over many steps).
Click for answer
If $\nabla f(\mathbf{x}_k) \approx \mathbf{g}$ constant, then: $\mathbf{v}_{k+1} = \beta\mathbf{v}_k - \eta\mathbf{g}$. In steady state, $\mathbf{v}_{k+1} \approx \mathbf{v}_k = \mathbf{v}$. So $\mathbf{v} = \beta\mathbf{v} - \eta\mathbf{g}$, giving $\mathbf{v} = -\frac{\eta}{1-\beta}\mathbf{g}$. So $\mathbf{x}_{k+1} = \mathbf{x}_k - \frac{\eta}{1-\beta}\mathbf{g}$ — effective learning rate is $\eta/(1-\beta)$. With $\beta=0.9$, this is $10\eta$, explaining the acceleration. -
For $f(x) = \frac{1}{2}x^2$, find the optimal $\beta$ for momentum with fixed $\eta=0.1$ to minimize the spectral radius of the iteration matrix.
Click for answer
Write the momentum iteration as a linear system: $\begin{pmatrix} x_{k+1} \\ v_{k+1} \end{pmatrix} = \begin{pmatrix} 1-\eta & \beta \\ -\eta & \beta \end{pmatrix} \begin{pmatrix} x_k \\ v_k \end{pmatrix}$ For $\eta=0.1$: eigenvalues of this matrix depend on $\beta$. Spectral radius $\rho < 1$ for convergence. Optimal $\beta$ minimizes $\rho$. For $\eta=0.1$, the optimal $\beta \approx 0.729$ (the largest root of a quadratic in $\beta$). -
Explain why SGD with fixed $\eta$ does NOT converge to the exact minimum, even for strongly convex functions. What's the asymptotic error?
Click for answer
With fixed $\eta$, the stochastic gradient noise prevents exact convergence. The iterate wanders in a ball around $\mathbf{x}^*$ with radius proportional to $\eta\sigma^2/\mu$ (where $\sigma^2$ is gradient variance, $\mu$ is strong convexity). Decreasing $\eta_k \to 0$ (e.g., $\eta_k = 1/k$) is required for exact convergence. In practice, ML practitioners often use fixed $\eta$ with learning rate decay near the end, or accept the noise floor. -
A mini-batch of size $B=64$ takes 2ms on a GPU. A full batch ($N=10^6$) takes 30s. Compare the number of stochastic gradient steps you can take in 30s.
Click for answer
In 30s, mini-batch ($B=64$): 30/0.002 = 15,000 SGD steps. Full batch: 1 step. Even though each SGD step is noisier, 15,000 noisy steps typically make far more progress than 1 exact step. This is the core economic argument for SGD. -
Show that for a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T Q \mathbf{x}$, NAG with appropriate parameters achieves the optimal convergence rate $O((1-1/\sqrt{\kappa})^k)$.
Click for answer
For a quadratic, NAG reduces to a two-step recurrence in each eigen-direction. With optimal parameters $\beta = (\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)$ and $\eta = 4/(\sqrt{L}+\sqrt{\mu})^2$, the error in each eigen-direction decays as $O((1-1/\sqrt{\kappa_i})^k)$ where $\kappa_i = \lambda_i/\mu$. The overall rate is dominated by the worst-conditioned direction, giving $O((1-1/\sqrt{\kappa})^k)$. This matches the theoretical lower bound for first-order methods on smooth, strongly convex functions — NAG is optimal.
Summary
Key takeaways:
- Momentum accumulates past gradients to dampen oscillations and accelerate consistent directions
- Nesterov momentum evaluates the gradient at a "lookahead" position — provably optimal for convex smooth problems ($O(1/k^2)$)
- SGD approximates the full gradient with a mini-batch — unbiased estimator with variance $\propto 1/|B|$
- Mini-batch size trades gradient accuracy for computational efficiency; $B=32$–$256$ is the sweet spot
- SGD noise helps escape saddle points — often generalizes better than full-batch GD
- Decreasing learning rate schedules ($\eta_k \to 0$) are needed for exact SGD convergence
Pitfalls
-
Setting momentum too high ($\beta > 0.99$). High $\beta$ makes the velocity respond too slowly to gradient changes, causing massive overshoot. The effective step size $\eta/(1-\beta)$ explodes — with $\beta=0.999$, it's $1000\eta$, turning even small learning rates into dangerously large steps.
-
Treating NAG as always better than classical momentum. NAG is provably optimal for convex smooth functions but offers no theoretical advantage for non-convex problems. In deep learning, classical momentum (as used in Adam, SGD+Momentum) often works equally well.
-
Using mini-batch size too large. Beyond $B \approx 256$, the variance reduction from larger batches provides diminishing returns while wall-clock time per step increases. Very large batches ($B > 4096$) can also hurt generalization by removing beneficial gradient noise.
-
Forgetting to shuffle data between epochs. Without shuffling, mini-batches contain correlated examples, biasing the gradient estimator and potentially causing cyclic behavior. Always shuffle.
-
Decaying the learning rate too aggressively. $1/k$ decay can stall optimization early. In practice, step decay or cosine schedules that maintain a reasonable learning rate for most of training work better than overly aggressive schedules.
Next Steps
Next up: 14-04-adaptive-learning-rate-methods.md — AdaGrad, RMSprop, and Adam — methods that adapt the learning rate per-parameter, eliminating the need to hand-tune $\eta$.