14-05 β Second-Order Methods
Phase: Optimization | Subject: 14-05 Prerequisites: 14-02-gradient-descent.md, 06-03-partial-derivatives.md Next subject: 14-06-convex-sets-functions.md
Learning Objectives
By the end of this subject, you will be able to:
- Derive Newton's method from the second-order Taylor expansion
- Explain why Newton's method is scale-invariant and converges quadratically
- Describe quasi-Newton methods (BFGS) and how they approximate the Hessian
- Explain the L-BFGS memory trick that makes it practical for high dimensions
- Articulate why Newton's method is NOT used for deep learning despite its superior convergence rate
Core Content
Why Second Order?
First-order methods use only the gradient $\nabla f$ β they approximate $f$ as locally linear. Second-order methods use the Hessian $\nabla^2 f$ β they approximate $f$ as locally quadratic, capturing curvature information.
Newton's Method
Start with the second-order Taylor expansion around $\mathbf{x}_k$:
$$f(\mathbf{x}_k + \mathbf{p}) \approx f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T \mathbf{p} + \frac{1}{2}\mathbf{p}^T \nabla^2 f(\mathbf{x}_k) \mathbf{p}$$
Minimize this quadratic approximation with respect to $\mathbf{p}$:
$$\frac{\partial}{\partial\mathbf{p}}\left[f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T\mathbf{p} + \frac{1}{2}\mathbf{p}^T \nabla^2 f(\mathbf{x}_k)\mathbf{p}\right] = \nabla f(\mathbf{x}_k) + \nabla^2 f(\mathbf{x}_k)\mathbf{p} = 0$$
$$\mathbf{p}_k = -[\nabla^2 f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)$$
Newton update:
$$\mathbf{x}_{k+1} = \mathbf{x}_k - [\nabla^2 f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)$$
Compare with gradient descent: $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k)$. Newton replaces the scalar $\eta$ with the inverse Hessian matrix β it scales the step differently in each direction based on curvature.
Scale Invariance
β οΈ CRITICAL β This is Newton's killer feature. Newton's method is invariant to affine transformations of the variables. If you rescale your parameters from $\mathbf{x}$ to $A\mathbf{x} + \mathbf{b}$, Newton's method produces exactly the same sequence of function values.
Gradient descent, by contrast, is highly sensitive to scaling β that's why we need adaptive methods like Adam. Newton builds scale-invariance directly into the math.
Proof sketch: Under $\mathbf{y} = A\mathbf{x}$, the Newton step in $\mathbf{y}$-space is $\mathbf{p}y = -(A^{-T}\nabla^2 f A^{-1})^{-1}(A^{-T}\nabla f) = -A \nabla^2 f^{-1} \nabla f = A\mathbf{p}_x$, so $\mathbf{y}{k+1} = A\mathbf{x}k + A\mathbf{p}_x = A\mathbf{x}{k+1}$ β the iterates correspond exactly.
Convergence: Quadratic!
For $f$ smooth and strongly convex near $\mathbf{x}^*$, with $\mathbf{x}_0$ sufficiently close:
$$|\mathbf{x}_{k+1} - \mathbf{x}^| \leq C|\mathbf{x}_k - \mathbf{x}^|^2$$
The error is squared at each step! If $|\mathbf{x}_k - \mathbf{x}^| = 10^{-3}$, then $|\mathbf{x}_{k+1} - \mathbf{x}^| \approx C \cdot 10^{-6}$. The number of correct digits doubles each iteration.
Compare: GD on strongly convex functions converges linearly (fixed fraction reduction per step); Newton converges quadratically (exponentiation of digits).
The Catch: Why Newton Isn't Used in Deep Learning
| Problem | Explanation |
|---|---|
| Hessian size | $\nabla^2 f \in \mathbb{R}^{n \times n}$ β for $n=10^7$ parameters (a small transformer), the Hessian has $10^{14}$ entries (~800 TB). Impossible to store. |
| Hessian inversion | $O(n^3)$ per step β completely impractical for large $n$. |
| Non-convexity | On non-convex problems, $\nabla^2 f$ may not be positive definite. The Newton direction could point uphill. Need modifications (damped Newton, trust-region). |
| Saddle points | Newton is attracted to saddle points! Since $\nabla f=0$ at saddles, Newton dives right into them rather than escaping like SGD does. |
Quasi-Newton Methods: BFGS
BFGS (BroydenβFletcherβGoldfarbβShanno) approximates the inverse Hessian directly from gradient history, avoiding $O(n^3)$ inversions.
Let $B_k \approx \nabla^2 f(\mathbf{x}_k)$ be the Hessian approximation. The secant condition:
$$B_{k+1}(\mathbf{x}{k+1} - \mathbf{x}_k) = \nabla f(\mathbf{x}{k+1}) - \nabla f(\mathbf{x}_k)$$
Intuitively: the change in parameters times the Hessian should equal the change in gradient (this is exact for quadratics).
The BFGS update to the inverse Hessian $H_k = B_k^{-1}$:
$$H_{k+1} = (I - \rho_k \mathbf{s}_k \mathbf{y}_k^T) H_k (I - \rho_k \mathbf{y}_k \mathbf{s}_k^T) + \rho_k \mathbf{s}_k \mathbf{s}_k^T$$
where $\mathbf{s}k = \mathbf{x}{k+1} - \mathbf{x}k$, $\mathbf{y}_k = \nabla f(\mathbf{x}{k+1}) - \nabla f(\mathbf{x}_k)$, $\rho_k = 1/(\mathbf{y}_k^T \mathbf{s}_k)$.
This is a rank-2 update to $H_k$ β computationally efficient ($O(n^2)$ vs $O(n^3)$ for full inversion) while maintaining positive definiteness.
L-BFGS: The Memory Trick
BFGS stores the full $n \times n$ matrix $H_k$ β still $O(n^2)$ memory, infeasible for large ML models.
L-BFGS (Limited-memory BFGS) stores only the last $m$ pairs of $(\mathbf{s}_k, \mathbf{y}_k)$ (typically $m = 10$β$20$) and reconstructs the matrix-vector product $H_k \nabla f(\mathbf{x}_k)$ on the fly via a two-loop recursion. Memory: $O(mn)$ instead of $O(n^2)$.
This makes L-BFGS practical for optimization problems with $n$ up to ~$10^5$, such as logistic regression or small neural networks, but still not for modern deep learning ($n \sim 10^7$β$10^{11}$).
Key Terms
- Hessian
- Hessian inversion
- Hessian size
- L-BFGS (Limited-memory BFGS)
- Non-convexity
- Saddle points
Worked Examples
Example 1: Newton on a Simple Quadratic
Apply Newton's method to $f(x,y) = x^2 + 4y^2$ from $(x_0,y_0) = (5,5)$.
Solution: $\nabla f = (2x,\; 8y)$, $\nabla^2 f = \begin{pmatrix} 2 & 0 \ 0 & 8 \end{pmatrix}$, $(\nabla^2 f)^{-1} = \begin{pmatrix} 1/2 & 0 \ 0 & 1/8 \end{pmatrix}$
Newton step: $\mathbf{p}_0 = -(\nabla^2 f)^{-1}\nabla f(5,5) = -\begin{pmatrix} 1/2 & 0 \ 0 & 1/8 \end{pmatrix} \begin{pmatrix} 10 \ 40 \end{pmatrix} = -(5, 5)$
$\mathbf{x}_1 = (5,5) - (5,5) = (0,0)$ β the exact minimum in one step!
This is expected: Newton solves quadratics in one iteration because the quadratic approximation IS the function.
Click for answer
Newton converges in one step to $(0,0)$, the exact minimum. For quadratics, the second-order Taylor expansion is exact.Example 2: Newton with Line Search
Apply one damped Newton step with line search to $f(x) = x^4$ from $x_0 = 1$.
Solution: $f'(x) = 4x^3$, $f''(x) = 12x^2$. Newton direction: $p = -f'(1)/f''(1) = -4/12 = -1/3$.
Full Newton step: $x = 1 - 1/3 = 2/3$. $f(2/3) = (2/3)^4 \approx 0.198$.
With backtracking line search ($\alpha = 1, 1/2, 1/4, \ldots$): - $\alpha=1$: $x=2/3$, $f=0.198$ β decrease β - Accept step. $x_1 = 2/3$.
Without line search, Newton can overshoot on non-quadratic functions. The line search ensures monotonic decrease.
Click for answer
Newton direction $p = -1/3$, step accepted with $\alpha=1$ since $f(2/3) < f(1)$. $x_1 = 2/3$.Example 3: BFGS Update
Given $\mathbf{s}_0 = (1, 0)$, $\mathbf{y}_0 = (2, 0)$ and initial $H_0 = I$, compute the BFGS update for $H_1$.
Solution: $\rho_0 = 1/(\mathbf{y}_0^T \mathbf{s}_0) = 1/(2 \cdot 1) = 1/2$
$H_1 = (I - \frac{1}{2}\mathbf{s}_0\mathbf{y}_0^T)I(I - \frac{1}{2}\mathbf{y}_0\mathbf{s}_0^T) + \frac{1}{2}\mathbf{s}_0\mathbf{s}_0^T$
$\mathbf{s}_0\mathbf{y}_0^T = \begin{pmatrix}1\0\end{pmatrix}(2,0) = \begin{pmatrix}2&0\0&0\end{pmatrix}$
$I - \frac{1}{2}\begin{pmatrix}2&0\0&0\end{pmatrix} = \begin{pmatrix}0&0\0&1\end{pmatrix}$
$H_1 = \begin{pmatrix}0&0\0&1\end{pmatrix} \begin{pmatrix}0&0\0&1\end{pmatrix} + \frac{1}{2}\begin{pmatrix}1&0\0&0\end{pmatrix} = \begin{pmatrix}0&0\0&1\end{pmatrix} + \begin{pmatrix}1/2&0\0&0\end{pmatrix} = \begin{pmatrix}1/2&0\0&1\end{pmatrix}$
The true inverse Hessian for $f(x,y) = x^2 + \frac{1}{2}y^2$ (consistent with $\mathbf{y}=(2,0)$ for $\mathbf{s}=(1,0)$) would be $\begin{pmatrix}1&0\0&2\end{pmatrix}$. BFGS is iteratively improving its estimate.
Click for answer
$H_1 = \begin{pmatrix}1/2&0\\0&1\end{pmatrix}$. BFGS constructs a rank-2 update that satisfies the secant condition: $H_1\mathbf{y}_0 = \mathbf{s}_0$, i.e. $H_1(2,0)^T = (1,0)^T$ β.Quiz
Q1: What does the concept of Hessian primarily refer to in this subject?
A) A computational error related to Hessian B) The definition and application of Hessian C) A visual representation of Hessian D) A historical anecdote about Hessian
Correct: B)
- If you chose A: This is incorrect. Hessian is defined as: the definition and application of hessian. The other options describe different aspects that are not the primary focus.
- If you chose B: Hessian is defined as: the definition and application of hessian. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. Hessian is defined as: the definition and application of hessian. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Hessian is defined as: the definition and application of hessian. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) A simplified version of \nabla f... D) \nabla f
Correct: D)
- If you chose A: This is incorrect. The formula \nabla f is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula \nabla f is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula \nabla f is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: The formula \nabla f is central to this subject. The other options are either simplified versions or unrelated. Correct!
Q3: What is the primary purpose of Hessian size?
A) It replaces all other methods in this domain B) It is used to hessian size in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system
Correct: B)
- If you chose A: This is incorrect. Hessian size serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Hessian size serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Hessian size serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Hessian size serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Hessian inversion is TRUE?
A) Hessian inversion is an advanced topic beyond this subject's scope B) Hessian inversion is a fundamental concept covered in this subject C) Hessian inversion is not related to this subject D) Hessian inversion is mentioned only as a historical footnote
Correct: B)
- If you chose A: This is incorrect. Hessian inversion is a fundamental concept covered in this subject. This subject covers Hessian inversion as part of its core content.
- If you chose B: Hessian inversion is a fundamental concept covered in this subject. This subject covers Hessian inversion as part of its core content. Correct!
- If you chose C: This is incorrect. Hessian inversion is a fundamental concept covered in this subject. This subject covers Hessian inversion as part of its core content.
- If you chose D: This is incorrect. Hessian inversion is a fundamental concept covered in this subject. This subject covers Hessian inversion as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) A different result from a common mistake C) Newton with Line Search D) The inverse of the correct answer
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is Newton with Line Search. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is Newton with Line Search. The other options represent common errors.
- If you chose C: The worked examples show that the result is Newton with Line Search. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is Newton with Line Search. The other options represent common errors.
Q6: How are Hessian inversion and Non-convexity related?
A) Hessian inversion and Non-convexity are completely unrelated topics B) Hessian inversion is a special case of Non-convexity C) Hessian inversion is the inverse of Non-convexity D) Hessian inversion and Non-convexity are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Hessian inversion and Non-convexity are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Hessian inversion and Non-convexity are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Hessian inversion and Non-convexity are covered in this subject as interconnected topics.
- If you chose D: Both Hessian inversion and Non-convexity are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with Saddle points?
A) A common mistake is confusing Saddle points with a similar concept B) The main error with Saddle points is using it when it is not needed C) Saddle points is always computed the same way in all contexts D) Saddle points has no common misconceptions
Correct: A)
- If you chose A: Students often confuse Saddle points with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse Saddle points with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Saddle points with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Saddle points with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply L-BFGS (Limited-memory BFGS)?
A) L-BFGS (Limited-memory BFGS) is not practically useful B) Use L-BFGS (Limited-memory BFGS) only in pure mathematics contexts C) Apply L-BFGS (Limited-memory BFGS) to solve problems in this subject's domain D) Avoid L-BFGS (Limited-memory BFGS) unless explicitly instructed
Correct: C)
- If you chose A: This is incorrect. L-BFGS (Limited-memory BFGS) is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. L-BFGS (Limited-memory BFGS) is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: L-BFGS (Limited-memory BFGS) is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. L-BFGS (Limited-memory BFGS) is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Show that for a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^TQ\mathbf{x} - \mathbf{b}^T\mathbf{x}$ with $Q \succ 0$, Newton's method converges in exactly one iteration from any starting point.
Click for answer
$\nabla f(\mathbf{x}) = Q\mathbf{x} - \mathbf{b}$, $\nabla^2 f = Q$. Newton step: $\mathbf{x}_1 = \mathbf{x}_0 - Q^{-1}(Q\mathbf{x}_0 - \mathbf{b}) = \mathbf{x}_0 - \mathbf{x}_0 + Q^{-1}\mathbf{b} = Q^{-1}\mathbf{b}$. $\nabla f(\mathbf{x}_1) = Q(Q^{-1}\mathbf{b}) - \mathbf{b} = 0$ β $\mathbf{x}_1 = \mathbf{x}^*$ in one step. -
Explain why Newton's method is scale-invariant but gradient descent is not. Demonstrate with $f(x)=ax^2$ for $a=1$ and $a=100$.
Click for answer
Gradient descent on $f(x)=ax^2$: step is $-\eta \cdot 2ax$. For $a=100$, same $\eta$ produces 100Γ larger step β must reduce $\eta$ or diverge. Newton on $f(x)=ax^2$: step is $-(2ax)/(2a) = -x$. Independent of $a$! One step to $x=0$ regardless of scaling. Newton implicitly normalizes by curvature, making it scale-invariant. -
L-BFGS stores $m$ vector pairs. What is the total memory cost in terms of $n$ and $m$, and why is $m=10$ typically sufficient?
Click for answer
Memory: $2mn$ (store $m$ pairs of $(\mathbf{s}, \mathbf{y})$, each $n$-dimensional). For $m=10$, $n=10^5$: $2 \cdot 10 \cdot 10^5 = 2 \times 10^6$ numbers β 16 MB (at float64). $m=10$ is sufficient because: (1) the most recent curvature information is most relevant; (2) the two-loop recursion implicitly encodes an $n \times n$ matrix from just $2m$ vectors; (3) empirically, $m=3$β$20$ works well on most problems. -
Derive the condition under which the Newton direction is a descent direction (i.e., $\mathbf{p}^T \nabla f < 0$).
Click for answer
$\mathbf{p} = -H^{-1}\nabla f$ (where $H = \nabla^2 f$). $\mathbf{p}^T \nabla f = -(H^{-1}\nabla f)^T \nabla f = -\nabla f^T H^{-1} \nabla f$. For this to be negative: $H^{-1} \succ 0$, i.e. $H \succ 0$ (Hessian positive definite). If $H$ has negative eigenvalues, $\nabla f^T H^{-1}\nabla f$ could be negative, making $\mathbf{p}^T \nabla f > 0$ β the Newton step would increase $f$! This is the saddle-point problem. -
Compare the per-iteration cost of GD ($O(n)$), BFGS ($O(n^2)$), and Newton ($O(n^3)$) for $n=10^6$.
Click for answer
GD: $O(n) = 10^6$ ops β trivial. BFGS: $O(n^2) = 10^{12}$ ops β heavy but possible with optimization. Newton: $O(n^3) = 10^{18}$ ops β completely infeasible. L-BFGS: $O(mn) = 10^7$ ops (with $m=10$) β very practical! This is why L-BFGS is used for medium-scale ML while Newton is strictly theoretical for large problems.
Summary
Key takeaways:
- Newton's method uses $\nabla^2 f$ to scale the step per-direction β quadratic convergence near the solution
- Scale invariance: Newton is unaffected by parameter rescaling β unlike gradient descent
- Quadratic convergence: the number of correct digits doubles each iteration near $\mathbf{x}^*$
- The $O(n^3)$ Hessian inversion and $O(n^2)$ storage make Newton infeasible for deep learning
- BFGS approximates the inverse Hessian with rank-2 updates from gradient history β $O(n^2)$ per step
- L-BFGS stores only the last $m$ gradient/step pairs β $O(mn)$ memory, practical for $n \sim 10^5$
- Newton is attracted to saddle points β a fatal flaw for non-convex optimization
Pitfalls
-
Attempting Newton's method on deep learning models. With $n \sim 10^7$β$10^{11}$ parameters, storing the $n \times n$ Hessian (~400 TB for $10^7$ parameters at float32) is impossible. Even if you could store it, the $O(n^3)$ inversion would take longer than the age of the universe.
-
Using Newton without line search or damping. On non-quadratic functions, the full Newton step can overshoot dramatically. Always use backtracking line search or add a damping term ($H + \lambda I$) to control step size.
-
Forgetting that Newton can go uphill. When $\nabla^2 f$ is not positive definite (non-convex regions), the Newton direction may not be a descent direction β $\mathbf{p}^T \nabla f$ can be positive. Damping with $\lambda I$ or trust-region methods are essential.
-
Setting L-BFGS memory $m$ too small. With $m=3$, only the most recent 3 gradient differences inform the Hessian approximation β insufficient curvature information. While $m=10$β$20$ is typical, very ill-conditioned problems may need $m=50$β$100$.
-
Expecting L-BFGS to work for modern neural networks. L-BFGS needs $O(mn)$ memory β with $n=10^8$ (a small transformer) and $m=10$, that's 8 GB at float64. Stochastic first-order methods are still the only practical choice for deep learning.
Next Steps
Next up: 14-06-convex-sets-functions.md β the theory of convex sets and convex functions, which underpins all guarantees in optimization.