Math graphic
πŸ“ Concept diagram

14-05 β€” Second-Order Methods

Phase: Optimization | Subject: 14-05 Prerequisites: 14-02-gradient-descent.md, 06-03-partial-derivatives.md Next subject: 14-06-convex-sets-functions.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive Newton's method from the second-order Taylor expansion
  2. Explain why Newton's method is scale-invariant and converges quadratically
  3. Describe quasi-Newton methods (BFGS) and how they approximate the Hessian
  4. Explain the L-BFGS memory trick that makes it practical for high dimensions
  5. Articulate why Newton's method is NOT used for deep learning despite its superior convergence rate

Core Content

Why Second Order?

First-order methods use only the gradient $\nabla f$ β€” they approximate $f$ as locally linear. Second-order methods use the Hessian $\nabla^2 f$ β€” they approximate $f$ as locally quadratic, capturing curvature information.

Newton's Method

Start with the second-order Taylor expansion around $\mathbf{x}_k$:

$$f(\mathbf{x}_k + \mathbf{p}) \approx f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T \mathbf{p} + \frac{1}{2}\mathbf{p}^T \nabla^2 f(\mathbf{x}_k) \mathbf{p}$$

Minimize this quadratic approximation with respect to $\mathbf{p}$:

$$\frac{\partial}{\partial\mathbf{p}}\left[f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T\mathbf{p} + \frac{1}{2}\mathbf{p}^T \nabla^2 f(\mathbf{x}_k)\mathbf{p}\right] = \nabla f(\mathbf{x}_k) + \nabla^2 f(\mathbf{x}_k)\mathbf{p} = 0$$

$$\mathbf{p}_k = -[\nabla^2 f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)$$

Newton update:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - [\nabla^2 f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)$$

Compare with gradient descent: $\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k)$. Newton replaces the scalar $\eta$ with the inverse Hessian matrix β€” it scales the step differently in each direction based on curvature.

Scale Invariance

⚠️ CRITICAL β€” This is Newton's killer feature. Newton's method is invariant to affine transformations of the variables. If you rescale your parameters from $\mathbf{x}$ to $A\mathbf{x} + \mathbf{b}$, Newton's method produces exactly the same sequence of function values.

Gradient descent, by contrast, is highly sensitive to scaling β€” that's why we need adaptive methods like Adam. Newton builds scale-invariance directly into the math.

Proof sketch: Under $\mathbf{y} = A\mathbf{x}$, the Newton step in $\mathbf{y}$-space is $\mathbf{p}y = -(A^{-T}\nabla^2 f A^{-1})^{-1}(A^{-T}\nabla f) = -A \nabla^2 f^{-1} \nabla f = A\mathbf{p}_x$, so $\mathbf{y}{k+1} = A\mathbf{x}k + A\mathbf{p}_x = A\mathbf{x}{k+1}$ β€” the iterates correspond exactly.

Convergence: Quadratic!

For $f$ smooth and strongly convex near $\mathbf{x}^*$, with $\mathbf{x}_0$ sufficiently close:

$$|\mathbf{x}_{k+1} - \mathbf{x}^| \leq C|\mathbf{x}_k - \mathbf{x}^|^2$$

The error is squared at each step! If $|\mathbf{x}_k - \mathbf{x}^| = 10^{-3}$, then $|\mathbf{x}_{k+1} - \mathbf{x}^| \approx C \cdot 10^{-6}$. The number of correct digits doubles each iteration.

Compare: GD on strongly convex functions converges linearly (fixed fraction reduction per step); Newton converges quadratically (exponentiation of digits).

The Catch: Why Newton Isn't Used in Deep Learning

Problem Explanation
Hessian size $\nabla^2 f \in \mathbb{R}^{n \times n}$ β€” for $n=10^7$ parameters (a small transformer), the Hessian has $10^{14}$ entries (~800 TB). Impossible to store.
Hessian inversion $O(n^3)$ per step β€” completely impractical for large $n$.
Non-convexity On non-convex problems, $\nabla^2 f$ may not be positive definite. The Newton direction could point uphill. Need modifications (damped Newton, trust-region).
Saddle points Newton is attracted to saddle points! Since $\nabla f=0$ at saddles, Newton dives right into them rather than escaping like SGD does.

Quasi-Newton Methods: BFGS

BFGS (Broyden–Fletcher–Goldfarb–Shanno) approximates the inverse Hessian directly from gradient history, avoiding $O(n^3)$ inversions.

Let $B_k \approx \nabla^2 f(\mathbf{x}_k)$ be the Hessian approximation. The secant condition:

$$B_{k+1}(\mathbf{x}{k+1} - \mathbf{x}_k) = \nabla f(\mathbf{x}{k+1}) - \nabla f(\mathbf{x}_k)$$

Intuitively: the change in parameters times the Hessian should equal the change in gradient (this is exact for quadratics).

The BFGS update to the inverse Hessian $H_k = B_k^{-1}$:

$$H_{k+1} = (I - \rho_k \mathbf{s}_k \mathbf{y}_k^T) H_k (I - \rho_k \mathbf{y}_k \mathbf{s}_k^T) + \rho_k \mathbf{s}_k \mathbf{s}_k^T$$

where $\mathbf{s}k = \mathbf{x}{k+1} - \mathbf{x}k$, $\mathbf{y}_k = \nabla f(\mathbf{x}{k+1}) - \nabla f(\mathbf{x}_k)$, $\rho_k = 1/(\mathbf{y}_k^T \mathbf{s}_k)$.

This is a rank-2 update to $H_k$ β€” computationally efficient ($O(n^2)$ vs $O(n^3)$ for full inversion) while maintaining positive definiteness.

L-BFGS: The Memory Trick

BFGS stores the full $n \times n$ matrix $H_k$ β€” still $O(n^2)$ memory, infeasible for large ML models.

L-BFGS (Limited-memory BFGS) stores only the last $m$ pairs of $(\mathbf{s}_k, \mathbf{y}_k)$ (typically $m = 10$–$20$) and reconstructs the matrix-vector product $H_k \nabla f(\mathbf{x}_k)$ on the fly via a two-loop recursion. Memory: $O(mn)$ instead of $O(n^2)$.

This makes L-BFGS practical for optimization problems with $n$ up to ~$10^5$, such as logistic regression or small neural networks, but still not for modern deep learning ($n \sim 10^7$–$10^{11}$).



Key Terms

Worked Examples

Example 1: Newton on a Simple Quadratic

Apply Newton's method to $f(x,y) = x^2 + 4y^2$ from $(x_0,y_0) = (5,5)$.

Solution: $\nabla f = (2x,\; 8y)$, $\nabla^2 f = \begin{pmatrix} 2 & 0 \ 0 & 8 \end{pmatrix}$, $(\nabla^2 f)^{-1} = \begin{pmatrix} 1/2 & 0 \ 0 & 1/8 \end{pmatrix}$

Newton step: $\mathbf{p}_0 = -(\nabla^2 f)^{-1}\nabla f(5,5) = -\begin{pmatrix} 1/2 & 0 \ 0 & 1/8 \end{pmatrix} \begin{pmatrix} 10 \ 40 \end{pmatrix} = -(5, 5)$

$\mathbf{x}_1 = (5,5) - (5,5) = (0,0)$ β€” the exact minimum in one step!

This is expected: Newton solves quadratics in one iteration because the quadratic approximation IS the function.

Click for answer Newton converges in one step to $(0,0)$, the exact minimum. For quadratics, the second-order Taylor expansion is exact.

Apply one damped Newton step with line search to $f(x) = x^4$ from $x_0 = 1$.

Solution: $f'(x) = 4x^3$, $f''(x) = 12x^2$. Newton direction: $p = -f'(1)/f''(1) = -4/12 = -1/3$.

Full Newton step: $x = 1 - 1/3 = 2/3$. $f(2/3) = (2/3)^4 \approx 0.198$.

With backtracking line search ($\alpha = 1, 1/2, 1/4, \ldots$): - $\alpha=1$: $x=2/3$, $f=0.198$ β€” decrease βœ“ - Accept step. $x_1 = 2/3$.

Without line search, Newton can overshoot on non-quadratic functions. The line search ensures monotonic decrease.

Click for answer Newton direction $p = -1/3$, step accepted with $\alpha=1$ since $f(2/3) < f(1)$. $x_1 = 2/3$.

Example 3: BFGS Update

Given $\mathbf{s}_0 = (1, 0)$, $\mathbf{y}_0 = (2, 0)$ and initial $H_0 = I$, compute the BFGS update for $H_1$.

Solution: $\rho_0 = 1/(\mathbf{y}_0^T \mathbf{s}_0) = 1/(2 \cdot 1) = 1/2$

$H_1 = (I - \frac{1}{2}\mathbf{s}_0\mathbf{y}_0^T)I(I - \frac{1}{2}\mathbf{y}_0\mathbf{s}_0^T) + \frac{1}{2}\mathbf{s}_0\mathbf{s}_0^T$

$\mathbf{s}_0\mathbf{y}_0^T = \begin{pmatrix}1\0\end{pmatrix}(2,0) = \begin{pmatrix}2&0\0&0\end{pmatrix}$

$I - \frac{1}{2}\begin{pmatrix}2&0\0&0\end{pmatrix} = \begin{pmatrix}0&0\0&1\end{pmatrix}$

$H_1 = \begin{pmatrix}0&0\0&1\end{pmatrix} \begin{pmatrix}0&0\0&1\end{pmatrix} + \frac{1}{2}\begin{pmatrix}1&0\0&0\end{pmatrix} = \begin{pmatrix}0&0\0&1\end{pmatrix} + \begin{pmatrix}1/2&0\0&0\end{pmatrix} = \begin{pmatrix}1/2&0\0&1\end{pmatrix}$

The true inverse Hessian for $f(x,y) = x^2 + \frac{1}{2}y^2$ (consistent with $\mathbf{y}=(2,0)$ for $\mathbf{s}=(1,0)$) would be $\begin{pmatrix}1&0\0&2\end{pmatrix}$. BFGS is iteratively improving its estimate.

Click for answer $H_1 = \begin{pmatrix}1/2&0\\0&1\end{pmatrix}$. BFGS constructs a rank-2 update that satisfies the secant condition: $H_1\mathbf{y}_0 = \mathbf{s}_0$, i.e. $H_1(2,0)^T = (1,0)^T$ βœ“.


Quiz

Q1: What does the concept of Hessian primarily refer to in this subject?

A) A computational error related to Hessian B) The definition and application of Hessian C) A visual representation of Hessian D) A historical anecdote about Hessian

Correct: B)

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) A simplified version of \nabla f... D) \nabla f

Correct: D)

Q3: What is the primary purpose of Hessian size?

A) It replaces all other methods in this domain B) It is used to hessian size in mathematical analysis C) It is used only in advanced research contexts D) It is primarily a historical notation system

Correct: B)

Q4: Which statement about Hessian inversion is TRUE?

A) Hessian inversion is an advanced topic beyond this subject's scope B) Hessian inversion is a fundamental concept covered in this subject C) Hessian inversion is not related to this subject D) Hessian inversion is mentioned only as a historical footnote

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) A different result from a common mistake C) Newton with Line Search D) The inverse of the correct answer

Correct: C)

Q6: How are Hessian inversion and Non-convexity related?

A) Hessian inversion and Non-convexity are completely unrelated topics B) Hessian inversion is a special case of Non-convexity C) Hessian inversion is the inverse of Non-convexity D) Hessian inversion and Non-convexity are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with Saddle points?

A) A common mistake is confusing Saddle points with a similar concept B) The main error with Saddle points is using it when it is not needed C) Saddle points is always computed the same way in all contexts D) Saddle points has no common misconceptions

Correct: A)

Q8: When should you apply L-BFGS (Limited-memory BFGS)?

A) L-BFGS (Limited-memory BFGS) is not practically useful B) Use L-BFGS (Limited-memory BFGS) only in pure mathematics contexts C) Apply L-BFGS (Limited-memory BFGS) to solve problems in this subject's domain D) Avoid L-BFGS (Limited-memory BFGS) unless explicitly instructed

Correct: C)

Practice Problems

  1. Show that for a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^TQ\mathbf{x} - \mathbf{b}^T\mathbf{x}$ with $Q \succ 0$, Newton's method converges in exactly one iteration from any starting point.

    Click for answer $\nabla f(\mathbf{x}) = Q\mathbf{x} - \mathbf{b}$, $\nabla^2 f = Q$. Newton step: $\mathbf{x}_1 = \mathbf{x}_0 - Q^{-1}(Q\mathbf{x}_0 - \mathbf{b}) = \mathbf{x}_0 - \mathbf{x}_0 + Q^{-1}\mathbf{b} = Q^{-1}\mathbf{b}$. $\nabla f(\mathbf{x}_1) = Q(Q^{-1}\mathbf{b}) - \mathbf{b} = 0$ β†’ $\mathbf{x}_1 = \mathbf{x}^*$ in one step.

  2. Explain why Newton's method is scale-invariant but gradient descent is not. Demonstrate with $f(x)=ax^2$ for $a=1$ and $a=100$.

    Click for answer Gradient descent on $f(x)=ax^2$: step is $-\eta \cdot 2ax$. For $a=100$, same $\eta$ produces 100Γ— larger step β†’ must reduce $\eta$ or diverge. Newton on $f(x)=ax^2$: step is $-(2ax)/(2a) = -x$. Independent of $a$! One step to $x=0$ regardless of scaling. Newton implicitly normalizes by curvature, making it scale-invariant.

  3. L-BFGS stores $m$ vector pairs. What is the total memory cost in terms of $n$ and $m$, and why is $m=10$ typically sufficient?

    Click for answer Memory: $2mn$ (store $m$ pairs of $(\mathbf{s}, \mathbf{y})$, each $n$-dimensional). For $m=10$, $n=10^5$: $2 \cdot 10 \cdot 10^5 = 2 \times 10^6$ numbers β‰ˆ 16 MB (at float64). $m=10$ is sufficient because: (1) the most recent curvature information is most relevant; (2) the two-loop recursion implicitly encodes an $n \times n$ matrix from just $2m$ vectors; (3) empirically, $m=3$–$20$ works well on most problems.

  4. Derive the condition under which the Newton direction is a descent direction (i.e., $\mathbf{p}^T \nabla f < 0$).

    Click for answer $\mathbf{p} = -H^{-1}\nabla f$ (where $H = \nabla^2 f$). $\mathbf{p}^T \nabla f = -(H^{-1}\nabla f)^T \nabla f = -\nabla f^T H^{-1} \nabla f$. For this to be negative: $H^{-1} \succ 0$, i.e. $H \succ 0$ (Hessian positive definite). If $H$ has negative eigenvalues, $\nabla f^T H^{-1}\nabla f$ could be negative, making $\mathbf{p}^T \nabla f > 0$ β€” the Newton step would increase $f$! This is the saddle-point problem.

  5. Compare the per-iteration cost of GD ($O(n)$), BFGS ($O(n^2)$), and Newton ($O(n^3)$) for $n=10^6$.

    Click for answer GD: $O(n) = 10^6$ ops β€” trivial. BFGS: $O(n^2) = 10^{12}$ ops β€” heavy but possible with optimization. Newton: $O(n^3) = 10^{18}$ ops β€” completely infeasible. L-BFGS: $O(mn) = 10^7$ ops (with $m=10$) β€” very practical! This is why L-BFGS is used for medium-scale ML while Newton is strictly theoretical for large problems.


Summary

Key takeaways:


Pitfalls

  1. Attempting Newton's method on deep learning models. With $n \sim 10^7$–$10^{11}$ parameters, storing the $n \times n$ Hessian (~400 TB for $10^7$ parameters at float32) is impossible. Even if you could store it, the $O(n^3)$ inversion would take longer than the age of the universe.

  2. Using Newton without line search or damping. On non-quadratic functions, the full Newton step can overshoot dramatically. Always use backtracking line search or add a damping term ($H + \lambda I$) to control step size.

  3. Forgetting that Newton can go uphill. When $\nabla^2 f$ is not positive definite (non-convex regions), the Newton direction may not be a descent direction β€” $\mathbf{p}^T \nabla f$ can be positive. Damping with $\lambda I$ or trust-region methods are essential.

  4. Setting L-BFGS memory $m$ too small. With $m=3$, only the most recent 3 gradient differences inform the Hessian approximation β€” insufficient curvature information. While $m=10$–$20$ is typical, very ill-conditioned problems may need $m=50$–$100$.

  5. Expecting L-BFGS to work for modern neural networks. L-BFGS needs $O(mn)$ memory β€” with $n=10^8$ (a small transformer) and $m=10$, that's 8 GB at float64. Stochastic first-order methods are still the only practical choice for deep learning.



Next Steps

Next up: 14-06-convex-sets-functions.md β€” the theory of convex sets and convex functions, which underpins all guarantees in optimization.