Math graphic
📐 Concept diagram

14-03 — Variants of Gradient Descent

Phase: Optimization | Subject: 14-03 Prerequisites: 14-02-gradient-descent.md Next subject: 14-04-adaptive-learning-rate-methods.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Explain how momentum accelerates gradient descent and derive the update rule
  2. Contrast Nesterov accelerated gradient with classical momentum
  3. Describe stochastic gradient descent and why it works despite noisy gradients
  4. Analyze the bias-variance tradeoff of mini-batch size
  5. Implement SGD with momentum and tune the momentum coefficient

Core Content

The Problem: Zigzag and Slow Convergence

Vanilla gradient descent on ill-conditioned problems zigzags. For $f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2$ ($\kappa=50$), the gradient points mostly in the steep $y$-direction, causing oscillation while making slow progress in the shallow $x$-direction.

The fix: momentum — accumulate past gradients to smooth the trajectory.

Classical Momentum (Polyak, 1964)

$$\mathbf{v}{k+1} = \beta \mathbf{v}_k - \eta \nabla f(\mathbf{x}_k)$$ $$\mathbf{x}{k+1} = \mathbf{x}k + \mathbf{v}{k+1}$$

Interpretation: Think of a ball rolling down a hill. Momentum keeps it moving in consistent directions and dampens oscillations. When $\beta=0$, this reduces to vanilla GD.

Why it helps with zigzag: Oscillating gradient components cancel out in the moving average, while consistent components accumulate. On $f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2$, momentum dampens the $y$-oscillations and accelerates $x$-progress.

The effective step size in steady state (when gradient is nearly constant) is $\eta/(1-\beta)$. With $\beta=0.9$, each step is roughly $10\times$ larger than vanilla GD — that's the acceleration.

Nesterov Accelerated Gradient (NAG, 1983)

Nesterov's insight: "look ahead" before computing the gradient.

$$\mathbf{v}{k+1} = \beta \mathbf{v}_k - \eta \nabla f(\mathbf{x}_k + \beta \mathbf{v}_k)$$ $$\mathbf{x}{k+1} = \mathbf{x}k + \mathbf{v}{k+1}$$

The gradient is evaluated at $\mathbf{x}_k + \beta\mathbf{v}_k$ — the lookahead position where momentum would take you — not at $\mathbf{x}_k$.

⚠️ CRITICAL: NAG is provably optimal for convex, smooth optimization — it achieves $O(1/k^2)$ convergence vs $O(1/k)$ for vanilla GD. For strongly convex functions: $O((1-1/\sqrt{\kappa})^k)$ vs $O((1-1/\kappa)^k)$. The improvement from $\kappa$ to $\sqrt{\kappa}$ is significant when $\kappa$ is large.

Comparison Table

Method Update Convergence (convex) Convergence (strongly convex)
GD $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta\nabla f(\mathbf{x}_k)$ $O(1/k)$ $O((1-1/\kappa)^k)$
Momentum $\mathbf{v}_{k+1} = \beta\mathbf{v}_k - \eta\nabla f(\mathbf{x}_k)$ $O(1/k)$ (faster constant) $O((1-1/\sqrt{\kappa})^k)$
NAG Gradient at lookahead $O(1/k^2)$ $O((1-1/\sqrt{\kappa})^k)$

Stochastic Gradient Descent (SGD)

In ML, the loss is an average over data: $f(\mathbf{x}) = \frac{1}{N}\sum_{i=1}^N f_i(\mathbf{x})$. Computing the full gradient costs $O(N)$. SGD approximates it with a random subset (mini-batch):

$$\mathbf{x}{k+1} = \mathbf{x}_k - \eta \cdot \frac{1}{|B|}\sum{i \in B} \nabla f_i(\mathbf{x}_k)$$

where $B$ is a random mini-batch of size $|B|$.

Why SGD works: The mini-batch gradient is an unbiased estimator of the true gradient:

$$\mathbb{E}B\left[\frac{1}{|B|}\sum{i \in B} \nabla f_i(\mathbf{x})\right] = \nabla f(\mathbf{x})$$

The variance is $\propto 1/|B|$, so larger batches give more accurate gradients.

The bias-variance tradeoff of batch size:

Batch Size Gradient Variance Memory Convergence Speed (wall-clock)
1 (pure SGD) Very high Minimal Fast per-step, but noisy
32–256 (mini-batch) Moderate Moderate Sweet spot — good GPU utilization + noise helps escape saddle points
N (full batch) Zero Max Slow per-step, exact gradient

⚠️ Common Pitfall: The noise in SGD isn't always bad — it helps escape saddle points and shallow local minima. Pure full-batch GD on a non-convex neural network loss can get stuck; SGD's stochasticity acts as implicit regularization.

SGD Convergence

With decreasing learning rate $\eta_k = \eta_0 / \sqrt{k}$ (or similar schedule):

$$\mathbb{E}[f(\bar{\mathbf{x}}_K) - f(\mathbf{x}^*)] \leq O\left(\frac{1}{\sqrt{K}}\right)$$

where $\bar{\mathbf{x}}_K$ is the average of iterates. The $O(1/\sqrt{K})$ rate is slower than GD's $O(1/K)$ on the training loss, but SGD generalizes better and each iteration is $N/|B|$ times cheaper.



Key Terms

Worked Examples

Example 1: Momentum Dampens Oscillation

Compare GD ($\beta=0$) and momentum ($\beta=0.9$) on $f(x,y) = \frac{1}{2}x^2 + 25y^2$ from $(x_0,y_0) = (10, 1)$ with $\eta=0.01$. Track 5 iterations.

Solution (GD, $\beta=0$): $\nabla f = (x, 50y)$

k $x_k$ $y_k$
0 10.000 1.000
1 9.900 0.500
2 9.801 0.250
3 9.703 0.125
4 9.606 0.063
5 9.510 0.031

$x$ barely moves (~0.5% per step); $y$ oscillates toward zero.

Solution (Momentum, $\beta=0.9$): Start with $\mathbf{v}_0 = (0,0)$.

k $\nabla f$ $\mathbf{v}_k$ $\mathbf{x}_k$
0 $(10, 50)$ $(0, 0)$ $(10, 1)$
1 $( -0.10, -0.50)$ $(9.90, 0.50)$
2 $(9.90, 25)$ $( -0.19, -0.70)$ $(9.71, -0.20)$
3 $(9.71, -10)$ $( -0.27, -0.53)$ $(9.44, -0.73)$

Momentum overshoots $y$ but accumulates forward progress in $x$. After many iterations, $x$ converges much faster than vanilla GD.

Click for answer GD: $x$ converges at ~0.5% per step (stuck due to small $\eta$ forced by steep $y$). Momentum: $x$ accelerates due to accumulated velocity while $y$ oscillations partially cancel.

Example 2: NAG vs Classical Momentum

On a 1D quadratic $f(x) = \frac{1}{2}ax^2$ with $a=1$, compare the effective dynamics of momentum vs NAG.

Solution:

Momentum: $v_{k+1} = \beta v_k - \eta x_k$, $x_{k+1} = x_k + v_{k+1}$

NAG: $v_{k+1} = \beta v_k - \eta(x_k + \beta v_k)$, $x_{k+1} = x_k + v_{k+1}$

In NAG, the gradient term includes $-\eta\beta v_k$, which acts as a "correction" — it anticipates the overshoot and compensates. For quadratics, NAG eliminates oscillations at a wider range of $\eta$ values.

With $\beta=0.9$, $\eta=0.1$, $x_0=10$:

Momentum: oscillatory approach, takes ~40 iterations to get below 0.1. NAG: smoother descent, takes ~25 iterations to get below 0.1.

Click for answer NAG's lookahead correction $-\eta\beta v_k$ anticipates overshoot, reducing oscillations and converging ~1.6× faster than classical momentum on this quadratic.

Example 3: Mini-Batch Variance

For a dataset of 1000 points with per-example gradients $\nabla f_i$ having variance $\sigma^2$, compute the variance of a mini-batch gradient estimator of size $B=32$.

Solution: $\operatorname{Var}\left[\frac{1}{B}\sum_{i \in B} \nabla f_i\right] = \frac{1}{B^2} \sum_{i \in B} \operatorname{Var}[\nabla f_i] = \frac{1}{B^2} \cdot B\sigma^2 = \frac{\sigma^2}{B}$

For $B=32$: variance is $\sigma^2/32$. For $B=1$ (pure SGD): $\sigma^2$ (32× noisier). For $B=1000$ (full batch): $\approx 0$.

Tradeoff: Mini-batch of 32 gives 32× variance reduction at only 32× the per-step cost, which is typically dominated by GPU parallelism — so wall-clock time per step barely increases from $B=1$ to $B=32$ on a GPU.

Click for answer $\operatorname{Var} = \sigma^2/B$. Going from $B=1$ to $B=32$ reduces gradient noise by 97% at negligible wall-clock cost on GPU hardware.


Quiz

Q1: What does the concept of Sweet spot primarily refer to in this subject?

A) A computational error related to Sweet spot B) A historical anecdote about Sweet spot C) A visual representation of Sweet spot D) The definition and application of Sweet spot

Correct: D)

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) f(x,y) = \frac{1}{2}x^2 + \frac{50}{2}y^2 C) A simplified version of f(x,y) = \frac{1}{2}x^2 + \f... D) An unrelated formula from a different topic

Correct: B)

Q3: What is the primary purpose of The Problem: Zigzag And Slow Convergence?

A) It replaces all other methods in this domain B) It is used to the problem: zigzag and slow convergence in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system

Correct: B)

Q4: Which statement about Classical Momentum (Polyak, 1964) is TRUE?

A) Classical Momentum (Polyak, 1964) is a fundamental concept covered in this subject B) Classical Momentum (Polyak, 1964) is not related to this subject C) Classical Momentum (Polyak, 1964) is an advanced topic beyond this subject's scope D) Classical Momentum (Polyak, 1964) is mentioned only as a historical footnote

Correct: A)

Q5: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) An unrelated numerical value C) The inverse of the correct answer D) Mini-Batch Variance

Correct: D)

Q6: How are Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) related?

A) Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) are closely related concepts B) Classical Momentum (Polyak, 1964) is the inverse of Nesterov Accelerated Gradient (Nag, 1983) C) Classical Momentum (Polyak, 1964) is a special case of Nesterov Accelerated Gradient (Nag, 1983) D) Classical Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nag, 1983) are completely unrelated topics

Correct: A)

Q7: What is a common pitfall when working with Comparison Table?

A) Comparison Table has no common misconceptions B) The main error with Comparison Table is using it when it is not needed C) A common mistake is confusing Comparison Table with a similar concept D) Comparison Table is always computed the same way in all contexts

Correct: C)

Q8: When should you apply Stochastic Gradient Descent (Sgd)?

A) Stochastic Gradient Descent (Sgd) is not practically useful B) Use Stochastic Gradient Descent (Sgd) only in pure mathematics contexts C) Avoid Stochastic Gradient Descent (Sgd) unless explicitly instructed D) Apply Stochastic Gradient Descent (Sgd) to solve problems in this subject's domain

Correct: D)

Practice Problems

  1. Derive the effective step size of momentum in steady state (when $\nabla f$ is approximately constant over many steps).

    Click for answer If $\nabla f(\mathbf{x}_k) \approx \mathbf{g}$ constant, then: $\mathbf{v}_{k+1} = \beta\mathbf{v}_k - \eta\mathbf{g}$. In steady state, $\mathbf{v}_{k+1} \approx \mathbf{v}_k = \mathbf{v}$. So $\mathbf{v} = \beta\mathbf{v} - \eta\mathbf{g}$, giving $\mathbf{v} = -\frac{\eta}{1-\beta}\mathbf{g}$. So $\mathbf{x}_{k+1} = \mathbf{x}_k - \frac{\eta}{1-\beta}\mathbf{g}$ — effective learning rate is $\eta/(1-\beta)$. With $\beta=0.9$, this is $10\eta$, explaining the acceleration.

  2. For $f(x) = \frac{1}{2}x^2$, find the optimal $\beta$ for momentum with fixed $\eta=0.1$ to minimize the spectral radius of the iteration matrix.

    Click for answer Write the momentum iteration as a linear system: $\begin{pmatrix} x_{k+1} \\ v_{k+1} \end{pmatrix} = \begin{pmatrix} 1-\eta & \beta \\ -\eta & \beta \end{pmatrix} \begin{pmatrix} x_k \\ v_k \end{pmatrix}$ For $\eta=0.1$: eigenvalues of this matrix depend on $\beta$. Spectral radius $\rho < 1$ for convergence. Optimal $\beta$ minimizes $\rho$. For $\eta=0.1$, the optimal $\beta \approx 0.729$ (the largest root of a quadratic in $\beta$).

  3. Explain why SGD with fixed $\eta$ does NOT converge to the exact minimum, even for strongly convex functions. What's the asymptotic error?

    Click for answer With fixed $\eta$, the stochastic gradient noise prevents exact convergence. The iterate wanders in a ball around $\mathbf{x}^*$ with radius proportional to $\eta\sigma^2/\mu$ (where $\sigma^2$ is gradient variance, $\mu$ is strong convexity). Decreasing $\eta_k \to 0$ (e.g., $\eta_k = 1/k$) is required for exact convergence. In practice, ML practitioners often use fixed $\eta$ with learning rate decay near the end, or accept the noise floor.

  4. A mini-batch of size $B=64$ takes 2ms on a GPU. A full batch ($N=10^6$) takes 30s. Compare the number of stochastic gradient steps you can take in 30s.

    Click for answer In 30s, mini-batch ($B=64$): 30/0.002 = 15,000 SGD steps. Full batch: 1 step. Even though each SGD step is noisier, 15,000 noisy steps typically make far more progress than 1 exact step. This is the core economic argument for SGD.

  5. Show that for a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T Q \mathbf{x}$, NAG with appropriate parameters achieves the optimal convergence rate $O((1-1/\sqrt{\kappa})^k)$.

    Click for answer For a quadratic, NAG reduces to a two-step recurrence in each eigen-direction. With optimal parameters $\beta = (\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)$ and $\eta = 4/(\sqrt{L}+\sqrt{\mu})^2$, the error in each eigen-direction decays as $O((1-1/\sqrt{\kappa_i})^k)$ where $\kappa_i = \lambda_i/\mu$. The overall rate is dominated by the worst-conditioned direction, giving $O((1-1/\sqrt{\kappa})^k)$. This matches the theoretical lower bound for first-order methods on smooth, strongly convex functions — NAG is optimal.


Summary

Key takeaways:


Pitfalls

  1. Setting momentum too high ($\beta > 0.99$). High $\beta$ makes the velocity respond too slowly to gradient changes, causing massive overshoot. The effective step size $\eta/(1-\beta)$ explodes — with $\beta=0.999$, it's $1000\eta$, turning even small learning rates into dangerously large steps.

  2. Treating NAG as always better than classical momentum. NAG is provably optimal for convex smooth functions but offers no theoretical advantage for non-convex problems. In deep learning, classical momentum (as used in Adam, SGD+Momentum) often works equally well.

  3. Using mini-batch size too large. Beyond $B \approx 256$, the variance reduction from larger batches provides diminishing returns while wall-clock time per step increases. Very large batches ($B > 4096$) can also hurt generalization by removing beneficial gradient noise.

  4. Forgetting to shuffle data between epochs. Without shuffling, mini-batches contain correlated examples, biasing the gradient estimator and potentially causing cyclic behavior. Always shuffle.

  5. Decaying the learning rate too aggressively. $1/k$ decay can stall optimization early. In practice, step decay or cosine schedules that maintain a reasonable learning rate for most of training work better than overly aggressive schedules.



Next Steps

Next up: 14-04-adaptive-learning-rate-methods.md — AdaGrad, RMSprop, and Adam — methods that adapt the learning rate per-parameter, eliminating the need to hand-tune $\eta$.