14-02 â Gradient Descent
Phase: Optimization | Subject: 14-02 Prerequisites: 14-01-optimization-fundamentals.md, 06-03-partial-derivatives.md Next subject: 14-03-variants-gradient-descent.md
Learning Objectives
By the end of this subject, you will be able to:
- Derive the gradient descent update from the first-order Taylor expansion
- Choose an appropriate learning rate and explain the consequences of getting it wrong
- Analyze convergence rates for convex, strongly convex, and smooth functions
- Implement gradient descent and recognize when it's converged
- Explain why gradient descent works on non-convex problems in practice
Core Content
The Algorithm
Gradient descent is the workhorse of continuous optimization. Given a differentiable function $f: \mathbb{R}^n \to \mathbb{R}$, start at some $\mathbf{x}_0$ and iterate:
$$\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k)$$
- $\eta > 0$ â learning rate (step size)
- $\nabla f(\mathbf{x}_k)$ â gradient at current point
- Negative sign â we move downhill
Derivation: Where Does This Come From?
Consider the first-order Taylor expansion of $f$ around $\mathbf{x}_k$:
$$f(\mathbf{x}) \approx f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T (\mathbf{x} - \mathbf{x}_k)$$
We want to choose a direction $\mathbf{d} = \mathbf{x} - \mathbf{x}_k$ that decreases $f$ as much as possible, subject to $|\mathbf{d}| \leq 1$. The term $\nabla f(\mathbf{x}_k)^T \mathbf{d}$ is the directional derivative â it's maximized when $\mathbf{d}$ aligns with $\nabla f$ and minimized when $\mathbf{d}$ points opposite $\nabla f$.
The steepest descent direction (under the Euclidean norm) is:
$$\mathbf{d}^* = -\frac{\nabla f(\mathbf{x}_k)}{|\nabla f(\mathbf{x}_k)|}$$
So the update $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k)$ moves in the direction of steepest descent with step size proportional to the gradient magnitude.
â ď¸ CRITICAL â Common Pitfall: Gradient descent finds a local minimum, not necessarily the global minimum. On non-convex problems (neural networks), where you start matters enormously. This is why initialization is a major research topic in deep learning.
The Learning Rate: Make or Break
The learning rate $\eta$ is the single most important hyperparameter.
| $\eta$ too large | $\eta$ just right | $\eta$ too small |
|---|---|---|
| Overshoots, diverges, or oscillates | Converges efficiently | Converges, but painfully slowly |
| $f$ increases or oscillates wildly | $f$ decreases monotonically | $f$ decreases by tiny amounts |
Example: For $f(x) = x^2$, the gradient is $2x$, so $\nabla f(x) = 2x$.
- Update: $x_{k+1} = x_k - \eta \cdot 2x_k = (1 - 2\eta)x_k$
- If $\eta = 0.25$: $x_{k+1} = 0.5x_k$ â converges geometrically to 0.
- If $\eta = 1.0$: $x_{k+1} = -x_k$ â oscillates forever, never converges.
- If $\eta = 1.2$: $x_{k+1} = -1.4x_k$ â diverges to $\pm\infty$.
- If $\eta = 0.001$: $x_{k+1} = 0.998x_k$ â converges, but ~2300 iterations to get within 1% of 0.
For a general quadratic $f(x) = \frac{1}{2}ax^2$:
$$x_{k+1} = x_k - \eta a x_k = (1 - \eta a)x_k$$
Convergence requires $|1 - \eta a| < 1$, i.e. $0 < \eta < 2/a$.
Convergence Theory
For Convex, L-Smooth Functions
$f$ is $L$-smooth if $|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})| \leq L|\mathbf{x} - \mathbf{y}|$ (gradient is $L$-Lipschitz). With $\eta \leq 1/L$:
$$f(\mathbf{x}_k) - f(\mathbf{x}^) \leq \frac{|\mathbf{x}_0 - \mathbf{x}^|^2}{2\eta k}$$
This is a sublinear rate: $O(1/k)$. After $k$ iterations, error halves when you quadruple $k$.
For Strongly Convex Functions
$f$ is $\mu$-strongly convex if $f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) + \frac{\mu}{2}|\mathbf{y}-\mathbf{x}|^2$. With $\eta \leq 2/(\mu+L)$:
$$|\mathbf{x}_k - \mathbf{x}^|^2 \leq \left(\frac{\kappa-1}{\kappa+1}\right)^{2k} |\mathbf{x}_0 - \mathbf{x}^|^2$$
where $\kappa = L/\mu$ is the condition number. This is a linear rate: $O((1-1/\kappa)^k)$. Error shrinks exponentially.
â ď¸ CRITICAL â Condition Number: $\kappa = L/\mu$ controls convergence speed. Large $\kappa$ â function is "elongated" â gradient descent zigzags in narrow valleys â slow convergence. Ill-conditioned problems (large $\kappa$) are the main reason vanilla GD is slow in practice, motivating momentum and adaptive methods.
Practical Stopping Criteria
In theory, you stop when $\nabla f(\mathbf{x}_k) = \mathbf{0}$. In practice:
- Gradient norm: $|\nabla f(\mathbf{x}_k)| < \epsilon$ (e.g. $\epsilon = 10^{-6}$)
- Function value change: $|f(\mathbf{x}_{k+1}) - f(\mathbf{x}_k)| < \epsilon$
- Parameter change: $|\mathbf{x}_{k+1} - \mathbf{x}_k| < \epsilon$
- Max iterations: Hard cap to prevent infinite loops
For ML training, criterion 1 is most common.
Key Terms
- 14 02 Gradient Descent
- **Algorithm
Gradient descent - Convergence Theory - Correct: B) - Derivation: Where Does This Come From? - Example 1: Simple Quadratic - Example 2: Divergence with Large Learning Rate - Example 3: 2D Optimization - Practical Stopping Criteria - The Algorithm - The Learning Rate: Make or Break - condition number**
Worked Examples
Example 1: Simple Quadratic
Use gradient descent to minimize $f(x) = x^2 - 4x + 5$ starting from $x_0 = 0$ with $\eta = 0.1$. Run 5 iterations.
Solution: - $\nabla f(x) = 2x - 4$ - $x_0 = 0$, $\nabla f(0) = -4$ - $x_1 = 0 - 0.1(-4) = 0.4$ - $x_2 = 0.4 - 0.1(2(0.4) - 4) = 0.4 - 0.1(-3.2) = 0.72$ - $x_3 = 0.72 - 0.1(2(0.72) - 4) = 0.72 - 0.1(-2.56) = 0.976$ - $x_4 = 0.976 - 0.1(2(0.976) - 4) = 0.976 - 0.1(-2.048) = 1.1808$ - $x_5 = 1.1808 - 0.1(2(1.1808) - 4) = 1.1808 - 0.1(-1.6384) = 1.34464$
True minimum: $x^* = 2$ (complete the square: $(x-2)^2 + 1$). After 5 iterations, we're at 1.3446 â about 65% of the way there from 0.
Click for answer
After 5 iterations: $x_5 = 1.34464$. True minimum at $x^* = 2$, $f(x^*) = 1$.Example 2: Divergence with Large Learning Rate
Show that $\eta = 1.2$ causes divergence for $f(x) = x^2$.
Solution: Update: $x_{k+1} = x_k - 1.2(2x_k) = (1 - 2.4)x_k = -1.4x_k$
$|x_{k+1}| = 1.4|x_k|$ â magnitude grows by 40% per iteration â diverges.
Convergence requires $|1-2\eta| < 1 \implies 0 < \eta < 1$. At $\eta = 1$, we oscillate ($x_{k+1} = -x_k$). Above 1, we diverge.
Click for answer
$\eta = 1.2$ exceeds the stability bound $\eta < 1$ for $f(x)=x^2$. Update rule becomes $x_{k+1} = -1.4x_k$, diverging exponentially.Example 3: 2D Optimization
Minimize $f(x, y) = x^2 + 10y^2$ from $(x_0, y_0) = (10, 1)$ with $\eta = 0.05$. Track 3 iterations and observe the zigzag.
Solution: $\nabla f = (2x,\; 20y)$
| k | $x_k$ | $y_k$ | $f(x_k, y_k)$ |
|---|---|---|---|
| 0 | 10.00 | 1.00 | 110.00 |
| 1 | 9.00 | 0.00 | 81.00 |
| 2 | 8.10 | 0.00 | 65.61 |
| 3 | 7.29 | 0.00 | 53.14 |
After just 1 iteration, $y$ is essentially at the optimum! But $x$ converges slowly. The condition number $\kappa = 10$ causes the zigzag: the $y$-direction is optimized quickly, then GD crawls along the flat valley in the $x$-direction.
This illustrates why ill-conditioned problems are hard â you're forced to use a small $\eta$ to avoid overshooting in the steep direction, which slows progress in the flat direction.
Click for answer
The high condition number ($\kappa=10$) causes GD to zigzag. After optimizing the steep $y$-direction in one step, it crawls slowly along the $x$-direction. At $\eta=0.05$: $x_3 = 7.29$, $y_3 \approx 0$, $f = 53.14$ (true minimum at $(0,0)$ with $f=0$).Quiz
Q1: What does the concept of **Algorithm
Gradient descent** primarily refer to in this subject?
A) A historical anecdote about Algorithm
Gradient descent B) A visual representation of Algorithm
Gradient descent C) The definition and application of Algorithm
Gradient descent D) A computational error related to Algorithm
Gradient descent
Correct: C)
- If you chose A: This is incorrect. **Algorithm
Gradient descent** is defined as: the definition and application of algorithm
gradient descent. The other options describe different aspects that are not the primary focus. - If you chose B: This is incorrect. **Algorithm
Gradient descent** is defined as: the definition and application of algorithm
gradient descent. The other options describe different aspects that are not the primary focus. - If you chose C: **Algorithm
Gradient descent** is defined as: the definition and application of algorithm
gradient descent. The other options describe different aspects that are not the primary focus. Correct! - If you chose D: This is incorrect. **Algorithm
Gradient descent** is defined as: the definition and application of algorithm
gradient descent. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) f: \mathbb{R}^n \to \mathbb{R} C) An unrelated formula from a different topic D) A simplified version of f: \mathbb{R}^n \to \mathbb...
Correct: B)
- If you chose A: This is incorrect. The formula f: \mathbb{R}^n \to \mathbb{R} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: The formula f: \mathbb{R}^n \to \mathbb{R} is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose C: This is incorrect. The formula f: \mathbb{R}^n \to \mathbb{R} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula f: \mathbb{R}^n \to \mathbb{R} is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Convergence Theory?
A) It is used only in advanced research contexts B) It is used to convergence theory in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain
Correct: B)
- If you chose A: This is incorrect. Convergence Theory serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Convergence Theory serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Convergence Theory serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Convergence Theory serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Practical Stopping Criteria is TRUE?
A) Practical Stopping Criteria is a fundamental concept covered in this subject B) Practical Stopping Criteria is an advanced topic beyond this subject's scope C) Practical Stopping Criteria is mentioned only as a historical footnote D) Practical Stopping Criteria is not related to this subject
Correct: A)
- If you chose A: Practical Stopping Criteria is a fundamental concept covered in this subject. This subject covers Practical Stopping Criteria as part of its core content. Correct!
- If you chose B: This is incorrect. Practical Stopping Criteria is a fundamental concept covered in this subject. This subject covers Practical Stopping Criteria as part of its core content.
- If you chose C: This is incorrect. Practical Stopping Criteria is a fundamental concept covered in this subject. This subject covers Practical Stopping Criteria as part of its core content.
- If you chose D: This is incorrect. Practical Stopping Criteria is a fundamental concept covered in this subject. This subject covers Practical Stopping Criteria as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) A different result from a common mistake C) 1$. D) An unrelated numerical value
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is 1$.. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is 1$.. The other options represent common errors.
- If you chose C: The worked examples show that the result is 1$.. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is 1$.. The other options represent common errors.
Q6: How are Practical Stopping Criteria and The Algorithm related?
A) Practical Stopping Criteria is the inverse of The Algorithm B) Practical Stopping Criteria and The Algorithm are closely related concepts C) Practical Stopping Criteria and The Algorithm are completely unrelated topics D) Practical Stopping Criteria is a special case of The Algorithm
Correct: B)
- If you chose A: This is incorrect. Both Practical Stopping Criteria and The Algorithm are covered in this subject as interconnected topics.
- If you chose B: Both Practical Stopping Criteria and The Algorithm are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Practical Stopping Criteria and The Algorithm are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Practical Stopping Criteria and The Algorithm are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Derivation: Where Does This Come From??
A) The main error with Derivation: Where Does This Come From? is using it when it is not needed B) Derivation: Where Does This Come From? is always computed the same way in all contexts C) A common mistake is confusing Derivation: Where Does This Come From? with a similar concept D) Derivation: Where Does This Come From? has no common misconceptions
Correct: C)
- If you chose A: This is incorrect. Students often confuse Derivation: Where Does This Come From? with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Derivation: Where Does This Come From? with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse Derivation: Where Does This Come From? with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse Derivation: Where Does This Come From? with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply The Learning Rate: Make Or Break?
A) Use The Learning Rate: Make Or Break only in pure mathematics contexts B) Avoid The Learning Rate: Make Or Break unless explicitly instructed C) Apply The Learning Rate: Make Or Break to solve problems in this subject's domain D) The Learning Rate: Make Or Break is not practically useful
Correct: C)
- If you chose A: This is incorrect. The Learning Rate: Make Or Break is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. The Learning Rate: Make Or Break is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: The Learning Rate: Make Or Break is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. The Learning Rate: Make Or Break is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
For $f(x) = (x-3)^4$, derive the gradient descent update and determine the range of $\eta$ for which convergence is guaranteed near the minimum $x^*=3$.
Click for answer
$\nabla f(x) = 4(x-3)^3$. Update: $x_{k+1} = x_k - 4\eta(x_k - 3)^3$. Near $x^*=3$, let $e_k = x_k - 3$. Then $e_{k+1} = e_k - 4\eta e_k^3 = e_k(1 - 4\eta e_k^2)$. Convergence when $|1 - 4\eta e_k^2| < 1$, i.e. $0 < \eta < 1/(2e_k^2)$. As $e_k \to 0$, any fixed $\eta$ eventually works â but progress is cubic, so it's slow near the optimum. -
Minimize $f(x, y) = (x-1)^2 + (y+2)^2$ with one step of gradient descent from $(0,0)$ using $\eta = 1$. What's special about this result?
Click for answer
$\nabla f = (2(x-1), 2(y+2))$. At $(0,0)$: $\nabla f = (-2, 4)$. $x_1 = (0,0) - 1 \cdot (-2, 4) = (2, -4)$. At $(2,-4)$: $\nabla f = (2(2-1), 2(-4+2)) = (2, -4)$. $\nabla f(x_1) \neq (0,0)$, so $(2,-4)$ is NOT the minimum. The true minimum is $(1, -2)$. With $\eta=1$, we overshot from $(0,0)$ past $(1,-2)$ to $(2,-4)$. -
Show that for a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T Q \mathbf{x} - \mathbf{b}^T\mathbf{x}$ with $Q \succ 0$, gradient descent with exact line search converges to $\mathbf{x}^* = Q^{-1}\mathbf{b}$.
Click for answer
$\nabla f(\mathbf{x}) = Q\mathbf{x} - \mathbf{b}$. At optimum: $Q\mathbf{x}^* - \mathbf{b} = 0 \implies \mathbf{x}^* = Q^{-1}\mathbf{b}$. With exact line search: $\eta_k = \frac{\|\nabla f(\mathbf{x}_k)\|^2}{\nabla f(\mathbf{x}_k)^T Q \nabla f(\mathbf{x}_k)}$. The error $\mathbf{e}_k = \mathbf{x}_k - \mathbf{x}^*$ satisfies $\mathbf{e}_{k+1} = (I - \eta_k Q)\mathbf{e}_k$, and with exact line search the method is globally convergent for any $Q \succ 0$. -
Implement one iteration of gradient descent for $f(x) = -\ln x + x$ ($x > 0$) starting at $x_0 = 0.5$ with $\eta = 0.3$. This is the negative log-likelihood for an exponential distribution.
Click for answer
$\nabla f(x) = -1/x + 1$. At $x_0 = 0.5$: $\nabla f(0.5) = -2 + 1 = -1$. $x_1 = 0.5 - 0.3(-1) = 0.8$. True minimum at $x^* = 1$ (since $-1/x + 1 = 0 \implies x = 1$). -
The function $f(x, y) = \frac{1}{2}x^2 + \frac{1}{2}y^2 + \epsilon xy$ has condition number depending on $\epsilon$. Find $\kappa(\epsilon)$ for small $\epsilon$, and explain why $\epsilon=0.99$ makes GD slow.
Click for answer
$\nabla^2 f = \begin{pmatrix} 1 & \epsilon \\ \epsilon & 1 \end{pmatrix}$. Eigenvalues: $\lambda = 1 \pm \epsilon$. $\kappa = \frac{1+|\epsilon|}{1-|\epsilon|}$. For $\epsilon = 0$: $\kappa = 1$ (perfect conditioning â GD converges in 1 step). For $\epsilon = 0.99$: $\kappa = 1.99/0.01 = 199$ â very ill-conditioned, GD is slow. As $\epsilon \to 1$, $\kappa \to \infty$ and the quadratic becomes singular.
Summary
Key takeaways:
- Gradient descent moves in the direction of steepest descent: $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta\nabla f(\mathbf{x}_k)$
- The learning rate $\eta$ is critical: too large diverges, too small converges slowly
- For $L$-smooth convex functions: sublinear $O(1/k)$ convergence
- For $\mu$-strongly convex functions: linear $O((1-1/\kappa)^k)$ convergence
- Condition number $\kappa = L/\mu$ is the bottleneck â large $\kappa$ means slow GD
- Gradient descent finds local minima; for non-convex problems, initialization matters
Pitfalls
- Using a learning rate that's too large without checking for divergence: The stability condition for quadratics is 0 < Ρ < 2/L (where L is the Lipschitz constant of the gradient). Exceeding this causes oscillation or divergence. Always monitor the loss â if it increases or oscillates, reduce Ρ. A quick test: try Ρ values spanning several orders of magnitude on a log scale.
- Expecting global minima from gradient descent on non-convex problems: GD guarantees convergence to a stationary point (âf â 0) but that point could be a saddle or a poor local minimum. In deep learning, different random initializations lead to different minima with widely varying quality. Multiple restarts are essential.
- Ignoring the condition number and wondering why convergence is slow: For f(x,y) = x² + 100y² (Îş = 100), GD zigzags in the narrow valley and takes hundreds of iterations. Large Îş is the #1 reason vanilla GD is slow. If you see slow progress, check the Hessian's eigenvalue spread â you may need momentum, adaptive methods, or preconditioning.
- Using gradient norm as the only stopping criterion: ââfâ < Îľ can be satisfied on a plateau or at a saddle point where the function value is far from optimal. Always cross-check with function value change and parameter change. For ML, monitor validation loss â training loss can plateau while overfitting.
- Forgetting that gradient descent is first-order and struggles with saddle points: Near a saddle, the gradient is near zero and GD stalls. Second-order methods (Newton) use curvature to escape saddles, but vanilla GD can waste enormous time near them. In high dimensions, saddles outnumber minima exponentially â this is a major challenge for deep learning optimization.
Next Steps
Next up: 14-03-variants-gradient-descent.md â momentum, Nesterov acceleration, and stochastic gradient descent â the techniques that fix the zigzag problem and make GD practical for ML.