Math graphic
📐 Concept diagram

14-02 — Gradient Descent

Phase: Optimization | Subject: 14-02 Prerequisites: 14-01-optimization-fundamentals.md, 06-03-partial-derivatives.md Next subject: 14-03-variants-gradient-descent.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the gradient descent update from the first-order Taylor expansion
  2. Choose an appropriate learning rate and explain the consequences of getting it wrong
  3. Analyze convergence rates for convex, strongly convex, and smooth functions
  4. Implement gradient descent and recognize when it's converged
  5. Explain why gradient descent works on non-convex problems in practice

Core Content

The Algorithm

Gradient descent is the workhorse of continuous optimization. Given a differentiable function $f: \mathbb{R}^n \to \mathbb{R}$, start at some $\mathbf{x}_0$ and iterate:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k)$$

Derivation: Where Does This Come From?

Consider the first-order Taylor expansion of $f$ around $\mathbf{x}_k$:

$$f(\mathbf{x}) \approx f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T (\mathbf{x} - \mathbf{x}_k)$$

We want to choose a direction $\mathbf{d} = \mathbf{x} - \mathbf{x}_k$ that decreases $f$ as much as possible, subject to $|\mathbf{d}| \leq 1$. The term $\nabla f(\mathbf{x}_k)^T \mathbf{d}$ is the directional derivative — it's maximized when $\mathbf{d}$ aligns with $\nabla f$ and minimized when $\mathbf{d}$ points opposite $\nabla f$.

The steepest descent direction (under the Euclidean norm) is:

$$\mathbf{d}^* = -\frac{\nabla f(\mathbf{x}_k)}{|\nabla f(\mathbf{x}_k)|}$$

So the update $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k)$ moves in the direction of steepest descent with step size proportional to the gradient magnitude.

⚠️ CRITICAL — Common Pitfall: Gradient descent finds a local minimum, not necessarily the global minimum. On non-convex problems (neural networks), where you start matters enormously. This is why initialization is a major research topic in deep learning.

The Learning Rate: Make or Break

The learning rate $\eta$ is the single most important hyperparameter.

$\eta$ too large $\eta$ just right $\eta$ too small
Overshoots, diverges, or oscillates Converges efficiently Converges, but painfully slowly
$f$ increases or oscillates wildly $f$ decreases monotonically $f$ decreases by tiny amounts

Example: For $f(x) = x^2$, the gradient is $2x$, so $\nabla f(x) = 2x$.

For a general quadratic $f(x) = \frac{1}{2}ax^2$:

$$x_{k+1} = x_k - \eta a x_k = (1 - \eta a)x_k$$

Convergence requires $|1 - \eta a| < 1$, i.e. $0 < \eta < 2/a$.

Convergence Theory

For Convex, L-Smooth Functions

$f$ is $L$-smooth if $|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})| \leq L|\mathbf{x} - \mathbf{y}|$ (gradient is $L$-Lipschitz). With $\eta \leq 1/L$:

$$f(\mathbf{x}_k) - f(\mathbf{x}^) \leq \frac{|\mathbf{x}_0 - \mathbf{x}^|^2}{2\eta k}$$

This is a sublinear rate: $O(1/k)$. After $k$ iterations, error halves when you quadruple $k$.

For Strongly Convex Functions

$f$ is $\mu$-strongly convex if $f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) + \frac{\mu}{2}|\mathbf{y}-\mathbf{x}|^2$. With $\eta \leq 2/(\mu+L)$:

$$|\mathbf{x}_k - \mathbf{x}^|^2 \leq \left(\frac{\kappa-1}{\kappa+1}\right)^{2k} |\mathbf{x}_0 - \mathbf{x}^|^2$$

where $\kappa = L/\mu$ is the condition number. This is a linear rate: $O((1-1/\kappa)^k)$. Error shrinks exponentially.

⚠️ CRITICAL — Condition Number: $\kappa = L/\mu$ controls convergence speed. Large $\kappa$ → function is "elongated" → gradient descent zigzags in narrow valleys → slow convergence. Ill-conditioned problems (large $\kappa$) are the main reason vanilla GD is slow in practice, motivating momentum and adaptive methods.

Practical Stopping Criteria

In theory, you stop when $\nabla f(\mathbf{x}_k) = \mathbf{0}$. In practice:

  1. Gradient norm: $|\nabla f(\mathbf{x}_k)| < \epsilon$ (e.g. $\epsilon = 10^{-6}$)
  2. Function value change: $|f(\mathbf{x}_{k+1}) - f(\mathbf{x}_k)| < \epsilon$
  3. Parameter change: $|\mathbf{x}_{k+1} - \mathbf{x}_k| < \epsilon$
  4. Max iterations: Hard cap to prevent infinite loops

For ML training, criterion 1 is most common.



Key Terms

Gradient descent - Convergence Theory - Correct: B) - Derivation: Where Does This Come From? - Example 1: Simple Quadratic - Example 2: Divergence with Large Learning Rate - Example 3: 2D Optimization - Practical Stopping Criteria - The Algorithm - The Learning Rate: Make or Break - condition number**

Worked Examples

Example 1: Simple Quadratic

Use gradient descent to minimize $f(x) = x^2 - 4x + 5$ starting from $x_0 = 0$ with $\eta = 0.1$. Run 5 iterations.

Solution: - $\nabla f(x) = 2x - 4$ - $x_0 = 0$, $\nabla f(0) = -4$ - $x_1 = 0 - 0.1(-4) = 0.4$ - $x_2 = 0.4 - 0.1(2(0.4) - 4) = 0.4 - 0.1(-3.2) = 0.72$ - $x_3 = 0.72 - 0.1(2(0.72) - 4) = 0.72 - 0.1(-2.56) = 0.976$ - $x_4 = 0.976 - 0.1(2(0.976) - 4) = 0.976 - 0.1(-2.048) = 1.1808$ - $x_5 = 1.1808 - 0.1(2(1.1808) - 4) = 1.1808 - 0.1(-1.6384) = 1.34464$

True minimum: $x^* = 2$ (complete the square: $(x-2)^2 + 1$). After 5 iterations, we're at 1.3446 — about 65% of the way there from 0.

Click for answer After 5 iterations: $x_5 = 1.34464$. True minimum at $x^* = 2$, $f(x^*) = 1$.

Example 2: Divergence with Large Learning Rate

Show that $\eta = 1.2$ causes divergence for $f(x) = x^2$.

Solution: Update: $x_{k+1} = x_k - 1.2(2x_k) = (1 - 2.4)x_k = -1.4x_k$

$|x_{k+1}| = 1.4|x_k|$ → magnitude grows by 40% per iteration → diverges.

Convergence requires $|1-2\eta| < 1 \implies 0 < \eta < 1$. At $\eta = 1$, we oscillate ($x_{k+1} = -x_k$). Above 1, we diverge.

Click for answer $\eta = 1.2$ exceeds the stability bound $\eta < 1$ for $f(x)=x^2$. Update rule becomes $x_{k+1} = -1.4x_k$, diverging exponentially.

Example 3: 2D Optimization

Minimize $f(x, y) = x^2 + 10y^2$ from $(x_0, y_0) = (10, 1)$ with $\eta = 0.05$. Track 3 iterations and observe the zigzag.

Solution: $\nabla f = (2x,\; 20y)$

k $x_k$ $y_k$ $f(x_k, y_k)$
0 10.00 1.00 110.00
1 9.00 0.00 81.00
2 8.10 0.00 65.61
3 7.29 0.00 53.14

After just 1 iteration, $y$ is essentially at the optimum! But $x$ converges slowly. The condition number $\kappa = 10$ causes the zigzag: the $y$-direction is optimized quickly, then GD crawls along the flat valley in the $x$-direction.

This illustrates why ill-conditioned problems are hard — you're forced to use a small $\eta$ to avoid overshooting in the steep direction, which slows progress in the flat direction.

Click for answer The high condition number ($\kappa=10$) causes GD to zigzag. After optimizing the steep $y$-direction in one step, it crawls slowly along the $x$-direction. At $\eta=0.05$: $x_3 = 7.29$, $y_3 \approx 0$, $f = 53.14$ (true minimum at $(0,0)$ with $f=0$).


Quiz

Q1: What does the concept of **Algorithm

Gradient descent** primarily refer to in this subject?

A) A historical anecdote about Algorithm

Gradient descent B) A visual representation of Algorithm

Gradient descent C) The definition and application of Algorithm

Gradient descent D) A computational error related to Algorithm

Gradient descent

Correct: C)

Gradient descent** is defined as: the definition and application of algorithm

gradient descent. The other options describe different aspects that are not the primary focus. - If you chose B: This is incorrect. **Algorithm

Gradient descent** is defined as: the definition and application of algorithm

gradient descent. The other options describe different aspects that are not the primary focus. - If you chose C: **Algorithm

Gradient descent** is defined as: the definition and application of algorithm

gradient descent. The other options describe different aspects that are not the primary focus. Correct! - If you chose D: This is incorrect. **Algorithm

Gradient descent** is defined as: the definition and application of algorithm

gradient descent. The other options describe different aspects that are not the primary focus.

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) f: \mathbb{R}^n \to \mathbb{R} C) An unrelated formula from a different topic D) A simplified version of f: \mathbb{R}^n \to \mathbb...

Correct: B)

Q3: What is the primary purpose of Convergence Theory?

A) It is used only in advanced research contexts B) It is used to convergence theory in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain

Correct: B)

Q4: Which statement about Practical Stopping Criteria is TRUE?

A) Practical Stopping Criteria is a fundamental concept covered in this subject B) Practical Stopping Criteria is an advanced topic beyond this subject's scope C) Practical Stopping Criteria is mentioned only as a historical footnote D) Practical Stopping Criteria is not related to this subject

Correct: A)

Q5: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) A different result from a common mistake C) 1$. D) An unrelated numerical value

Correct: C)

Q6: How are Practical Stopping Criteria and The Algorithm related?

A) Practical Stopping Criteria is the inverse of The Algorithm B) Practical Stopping Criteria and The Algorithm are closely related concepts C) Practical Stopping Criteria and The Algorithm are completely unrelated topics D) Practical Stopping Criteria is a special case of The Algorithm

Correct: B)

Q7: What is a common pitfall when working with Derivation: Where Does This Come From??

A) The main error with Derivation: Where Does This Come From? is using it when it is not needed B) Derivation: Where Does This Come From? is always computed the same way in all contexts C) A common mistake is confusing Derivation: Where Does This Come From? with a similar concept D) Derivation: Where Does This Come From? has no common misconceptions

Correct: C)

Q8: When should you apply The Learning Rate: Make Or Break?

A) Use The Learning Rate: Make Or Break only in pure mathematics contexts B) Avoid The Learning Rate: Make Or Break unless explicitly instructed C) Apply The Learning Rate: Make Or Break to solve problems in this subject's domain D) The Learning Rate: Make Or Break is not practically useful

Correct: C)

Practice Problems

  1. For $f(x) = (x-3)^4$, derive the gradient descent update and determine the range of $\eta$ for which convergence is guaranteed near the minimum $x^*=3$.

    Click for answer $\nabla f(x) = 4(x-3)^3$. Update: $x_{k+1} = x_k - 4\eta(x_k - 3)^3$. Near $x^*=3$, let $e_k = x_k - 3$. Then $e_{k+1} = e_k - 4\eta e_k^3 = e_k(1 - 4\eta e_k^2)$. Convergence when $|1 - 4\eta e_k^2| < 1$, i.e. $0 < \eta < 1/(2e_k^2)$. As $e_k \to 0$, any fixed $\eta$ eventually works — but progress is cubic, so it's slow near the optimum.

  2. Minimize $f(x, y) = (x-1)^2 + (y+2)^2$ with one step of gradient descent from $(0,0)$ using $\eta = 1$. What's special about this result?

    Click for answer $\nabla f = (2(x-1), 2(y+2))$. At $(0,0)$: $\nabla f = (-2, 4)$. $x_1 = (0,0) - 1 \cdot (-2, 4) = (2, -4)$. At $(2,-4)$: $\nabla f = (2(2-1), 2(-4+2)) = (2, -4)$. $\nabla f(x_1) \neq (0,0)$, so $(2,-4)$ is NOT the minimum. The true minimum is $(1, -2)$. With $\eta=1$, we overshot from $(0,0)$ past $(1,-2)$ to $(2,-4)$.

  3. Show that for a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T Q \mathbf{x} - \mathbf{b}^T\mathbf{x}$ with $Q \succ 0$, gradient descent with exact line search converges to $\mathbf{x}^* = Q^{-1}\mathbf{b}$.

    Click for answer $\nabla f(\mathbf{x}) = Q\mathbf{x} - \mathbf{b}$. At optimum: $Q\mathbf{x}^* - \mathbf{b} = 0 \implies \mathbf{x}^* = Q^{-1}\mathbf{b}$. With exact line search: $\eta_k = \frac{\|\nabla f(\mathbf{x}_k)\|^2}{\nabla f(\mathbf{x}_k)^T Q \nabla f(\mathbf{x}_k)}$. The error $\mathbf{e}_k = \mathbf{x}_k - \mathbf{x}^*$ satisfies $\mathbf{e}_{k+1} = (I - \eta_k Q)\mathbf{e}_k$, and with exact line search the method is globally convergent for any $Q \succ 0$.

  4. Implement one iteration of gradient descent for $f(x) = -\ln x + x$ ($x > 0$) starting at $x_0 = 0.5$ with $\eta = 0.3$. This is the negative log-likelihood for an exponential distribution.

    Click for answer $\nabla f(x) = -1/x + 1$. At $x_0 = 0.5$: $\nabla f(0.5) = -2 + 1 = -1$. $x_1 = 0.5 - 0.3(-1) = 0.8$. True minimum at $x^* = 1$ (since $-1/x + 1 = 0 \implies x = 1$).

  5. The function $f(x, y) = \frac{1}{2}x^2 + \frac{1}{2}y^2 + \epsilon xy$ has condition number depending on $\epsilon$. Find $\kappa(\epsilon)$ for small $\epsilon$, and explain why $\epsilon=0.99$ makes GD slow.

    Click for answer $\nabla^2 f = \begin{pmatrix} 1 & \epsilon \\ \epsilon & 1 \end{pmatrix}$. Eigenvalues: $\lambda = 1 \pm \epsilon$. $\kappa = \frac{1+|\epsilon|}{1-|\epsilon|}$. For $\epsilon = 0$: $\kappa = 1$ (perfect conditioning — GD converges in 1 step). For $\epsilon = 0.99$: $\kappa = 1.99/0.01 = 199$ — very ill-conditioned, GD is slow. As $\epsilon \to 1$, $\kappa \to \infty$ and the quadratic becomes singular.


Summary

Key takeaways:


Pitfalls



Next Steps

Next up: 14-03-variants-gradient-descent.md — momentum, Nesterov acceleration, and stochastic gradient descent — the techniques that fix the zigzag problem and make GD practical for ML.