Math graphic
📐 Concept diagram

14-10 — Stochastic and Non-Convex Optimization

Phase: Optimization | Subject: 14-10 Prerequisites: 14-03-variants-gradient-descent.md, 14-06-convex-sets-functions.md Next subject: 15-01-floating-point-arithmetic.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Explain why gradient descent finds "good enough" solutions on non-convex problems in practice
  2. Describe the strict saddle property and why SGD escapes saddle points
  3. State convergence guarantees for SGD on non-convex objectives (to stationary points)
  4. Discuss the role of overparameterization in making neural network loss surfaces benign
  5. Apply practical heuristics for training non-convex models (learning rate warmup, scheduling, early stopping)

Core Content

The Non-Convex Reality

Almost every interesting ML model involves non-convex optimization. The loss landscape of a neural network has: - Many local minima (but most are nearly as good as the global minimum, especially for overparameterized models) - Countless saddle points (the real enemy) - Flat regions (plateaus) where gradients vanish

Yet gradient-based methods work remarkably well. Why?

The Strict Saddle Property

A saddle point $\mathbf{x}$ is strict if $\lambda_{\min}(\nabla^2 f(\mathbf{x})) < 0$ — the Hessian has at least one strictly negative eigenvalue, meaning there exists a direction of descent.

Key result (Ge et al., 2015; Jin et al., 2017): For functions where all saddle points are strict, gradient descent with random initialization (or small perturbations) almost surely converges to a local minimum, not a saddle point.

Why SGD escapes saddles naturally: The stochastic gradient noise acts as a built-in perturbation mechanism. At a saddle point where $\nabla f \approx \mathbf{0}$, the noise variance keeps the iterate moving. Along negative-curvature directions, this noise pushes the iterate downhill and away from the saddle.

⚠️ CRITICAL — Saddles, not local minima, are the main obstacle in high dimensions. The number of saddle points grows exponentially with dimension, while the number of "bad" local minima may be small. Most theoretical and empirical evidence suggests neural network loss surfaces are dominated by saddles, not spurious local minima.

Convergence to Stationary Points

For non-convex $L$-smooth $f$, SGD with $\eta = 1/L$ achieves:

$$\min_{k=0,\ldots,K-1} \mathbb{E}[|\nabla f(\mathbf{x}_k)|^2] \leq \frac{2L(f(\mathbf{x}_0) - f^*)}{K} + \frac{\sigma^2}{B}$$

where $\sigma^2$ is gradient variance and $B$ is batch size.

This is an $O(1/K + 1/B)$ rate to a stationary point ($\nabla f \approx 0$), not necessarily a local minimum. But combined with the strict saddle property, we can upgrade this: SGD finds an approximate local minimum in polynomial time.

The Overparameterization Miracle

Modern neural networks are massively overparameterized — they have far more parameters than training examples. This turns out to be a feature, not a bug:

  1. Wider loss basins: Overparameterized networks have larger, flatter basins of attraction around good minima
  2. Linearized regime (NTK): In the infinite-width limit, training dynamics are exactly linear and the loss is convex in parameter space
  3. All local minima are global: Under certain assumptions (sufficient width), every local minimum achieves zero training loss
  4. Implicit regularization: SGD with small step size biases toward solutions with small norm — the "implicit bias" toward simple functions

Practical Heuristics for Non-Convex Training

Technique What It Does When To Use
LR warmup Start with small $\eta$, linearly increase to target Transformers, large models — prevents early divergence
Cosine decay $\eta_k = \eta_0 \cdot \frac{1}{2}(1 + \cos(\pi k/K))$ Default schedule for image models
Early stopping Stop when validation loss stops improving Always — prevents overfitting
Gradient clipping $\mathbf{g} \leftarrow \mathbf{g} \cdot \min(1, c/|\mathbf{g}|)$ RNNs, transformers — prevents gradient explosion
Label smoothing Soften one-hot targets: $y \leftarrow (1-\epsilon)y + \epsilon/K$ Improves generalization, makes loss surface smoother
Batch normalization Normalize activations per mini-batch Speeds training, allows higher LR, smooths landscape


Key Terms

Worked Examples

Example 1: SGD Escapes a Saddle

Consider $f(x, y) = x^2 - y^2$ (saddle at origin). Show that SGD with noise has a non-zero chance of escaping.

Solution: $\nabla f = (2x, -2y)$. At $(0,0)$, $\nabla f = (0,0)$ — the saddle.

Deterministic GD initialized at $(0, 0)$: stays at $(0, 0)$ forever.

SGD with noisy gradient $\hat{\nabla}f = \nabla f + \boldsymbol{\xi}$ where $\boldsymbol{\xi} \sim \mathcal{N}(0, \sigma^2 I)$: At $(0,0)$, the update is $\mathbf{x}_{k+1} = -\eta \boldsymbol{\xi}_k$. The noise pushes the iterate away from the saddle. Once slightly displaced, the deterministic gradient component takes over: in the $y$-direction, $\partial f/\partial y = -2y$, so if $\xi_y \neq 0$, the iterate is pushed away and $-2y$ amplifies the displacement (unstable direction). In the $x$-direction, $2x$ pushes back toward 0 (stable direction).

The net effect: SGD is repelled from the saddle along the negative-curvature $y$-direction.

Click for answer SGD noise displaces the iterate from the saddle. The negative Hessian eigenvalue along $y$ amplifies the displacement, and the iterate escapes. Deterministic GD would remain stuck forever.

Example 2: Gradient Clipping

Compute the clipped gradient for $\mathbf{g} = (3, 4)$ with max norm $c = 5$.

Solution: $|\mathbf{g}| = \sqrt{3^2 + 4^2} = 5$. Since $|\mathbf{g}| = 5 \leq c = 5$, no clipping needed — return $\mathbf{g}$.

For $\mathbf{g} = (6, 8)$: $|\mathbf{g}| = 10$, so $\mathbf{g}_{\text{clipped}} = (6, 8) \cdot \frac{5}{10} = (3, 4)$.

This prevents a single large gradient from destabilizing training while preserving the direction.

Click for answer $\mathbf{g}=(3,4)$: $\|\mathbf{g}\|=5 \leq 5$, no clipping. $\mathbf{g}=(6,8)$: $\|\mathbf{g}\|=10$, clipped to $(3,4)$.

Example 3: Cosine Learning Rate Decay

Compute the learning rate at step $k = 750$ of $K = 1000$ total steps, with $\eta_0 = 0.1$.

Solution: $\eta_k = \frac{0.1}{2}(1 + \cos(\pi \cdot 750/1000)) = 0.05(1 + \cos(0.75\pi)) = 0.05(1 - 0.7071) = 0.05(0.2929) = 0.0146$.

At $k=500$ (halfway): $\eta = 0.05(1 + \cos(\pi/2)) = 0.05(1+0) = 0.05$. At $k=0$: $\eta = 0.1$. At $k=1000$: $\eta = 0.05(1 + \cos(\pi)) = 0.05(1-1) = 0$.

The schedule starts at $\eta_0$, decays smoothly to 0.

Click for answer $\eta_{750} \approx 0.0146$. The rate follows a cosine curve from $\eta_0=0.1$ at $k=0$ to $0$ at $k=1000$.


Quiz

Q1: What does the concept of LR warmup primarily refer to in this subject?

A) The definition and application of LR warmup B) A historical anecdote about LR warmup C) A visual representation of LR warmup D) A computational error related to LR warmup

Correct: A)

Q2: Which of the following is the key formula discussed in this subject?

A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) A simplified version of \mathbf{x}... D) \mathbf{x}

Correct: D)

Q3: What is the primary purpose of Cosine decay?

A) It is used to cosine decay in mathematical analysis B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used only in advanced research contexts

Correct: A)

Q4: Which statement about Early stopping is TRUE?

A) Early stopping is not related to this subject B) Early stopping is a fundamental concept covered in this subject C) Early stopping is an advanced topic beyond this subject's scope D) Early stopping is mentioned only as a historical footnote

Correct: B)

Q5: Based on the worked examples in this subject, what is the correct result?

A) Gradient Clipping B) The inverse of the correct answer C) An unrelated numerical value D) A different result from a common mistake

Correct: A)

Q6: How are Early stopping and Gradient clipping related?

A) Early stopping is the inverse of Gradient clipping B) Early stopping is a special case of Gradient clipping C) Early stopping and Gradient clipping are completely unrelated topics D) Early stopping and Gradient clipping are closely related concepts

Correct: D)

Q7: What is a common pitfall when working with Label smoothing?

A) Label smoothing has no common misconceptions B) Label smoothing is always computed the same way in all contexts C) The main error with Label smoothing is using it when it is not needed D) A common mistake is confusing Label smoothing with a similar concept

Correct: D)

Q8: When should you apply Batch normalization?

A) Use Batch normalization only in pure mathematics contexts B) Apply Batch normalization to solve problems in this subject's domain C) Avoid Batch normalization unless explicitly instructed D) Batch normalization is not practically useful

Correct: B)

Practice Problems

  1. Explain why a random initialization helps gradient descent avoid saddle points, even without stochastic noise.

    Click for answer The set of initializations that converge to a strict saddle has measure zero (it's a lower-dimensional manifold — the stable manifold). A random initialization almost surely lands off this manifold, so deterministic GD also escapes saddles if properly initialized. The probability of landing exactly on the stable manifold of a saddle is zero for any continuous distribution. This is the theoretical basis for the "random initialization + GD works" result.

  2. For a neural network with $N$ training examples and $P \gg N$ parameters, explain the "interpolation" regime where training loss reaches zero.

    Click for answer When $P > N$, the system is underdetermined — there are more parameters than constraints. For many architectures, it's possible to exactly fit (interpolate) the training data, achieving zero training loss. The interesting question becomes: which of the infinitely many zero-loss solutions does SGD find? Empirically, SGD finds solutions that generalize well (the "implicit bias" or "implicit regularization" phenomenon).

  3. Compute the effective step in a negative-curvature direction under SGD for $f(x) = -\frac{1}{2}ax^2$ near $x=0$ with noise variance $\sigma^2$.

    Click for answer $\nabla f = -ax$. SGD update: $x_{k+1} = x_k - \eta(-ax_k + \xi_k) = (1 + \eta a)x_k - \eta\xi_k$. Since $1 + \eta a > 1$, the deterministic component amplifies $|x|$ — this is the escape from the saddle. The noise $\xi_k$ provides the initial displacement from exactly zero. $\mathbb{E}[x_{k+1}^2] = (1+\eta a)^2\mathbb{E}[x_k^2] + \eta^2\sigma^2$. The iterate diverges away from the saddle exponentially fast (in mean square).

  4. A loss surface has "good" local minima at loss 0.1 and "bad" local minima at loss 5.0. What property of overparameterized networks makes the bad minima rare or unreachable?

    Click for answer In overparameterized networks, the loss landscape is "benign": (1) high-loss critical points tend to be saddles, not minima — there's always some direction of escape; (2) the basin of attraction for bad minima (if they exist) shrinks with overparameterization; (3) SGD's noise and optimization trajectory naturally bias toward wide, flat minima (which tend to have lower loss and better generalization). The "loss landscape is mostly saddles plus a connected manifold of good minima" is the dominant picture supported by empirical and theoretical work.

  5. Design a learning rate warmup schedule: $\eta$ starts at 0.0001, linearly increases to 0.001 over 1000 steps, then cosine-decays to 0 over 9000 more steps. What is $\eta$ at step 500? At step 5000?

    Click for answer Warmup (steps 0–1000): $\eta_k = 0.0001 + (0.001 - 0.0001) \cdot k/1000$ At $k=500$: $\eta_{500} = 0.0001 + 0.0009 \cdot 0.5 = 0.00055$. Cosine decay (steps 1000–10000): $\eta_k = 0.001 \cdot \frac{1}{2}(1 + \cos(\pi \cdot \frac{k-1000}{9000}))$ At $k=5000$: $t = (5000-1000)/9000 = 4/9 \approx 0.4444$. $\eta_{5000} = 0.001 \cdot 0.5(1 + \cos(0.4444\pi)) = 0.0005(1 + 0.1736) = 0.000587$.


Summary

Key takeaways:


Pitfalls

  1. Assuming all local minima in neural networks are bad. In overparameterized networks ($P \gg N$), most local minima achieve near-optimal loss. The real enemy is saddle points, which are exponentially more numerous — and SGD's noise helps escape them.

  2. Using full-batch gradient descent on non-convex problems thinking it's safer. Without stochastic noise, deterministic GD stalls exactly at saddle points (gradient is zero). SGD escapes because noise displaces the iterate, allowing negative-curvature directions to amplify the escape.

  3. Setting the learning rate too low to escape plateaus. A learning rate that's too small means the iterate barely moves from one flat region to the next. Slightly higher learning rates help SGD traverse the landscape, but too high causes divergence — the sweet spot is problem-dependent.

  4. Forgetting gradient clipping on recurrent architectures. RNNs, LSTMs, and transformers frequently produce gradient norms exceeding $10^3$ due to the chain rule through time. Without clipping, a single large gradient can irreversibly damage the parameters. Clip at $c=1$–$5$ as a starting point.

  5. Treating early stopping as only a regularization technique. Early stopping prevents overfitting, but it also prevents SGD from converging to sharp minima. Combined with the implicit bias of SGD, stopping before the training loss hits zero often yields better generalization than training to completion.



Next Steps

Next up: 15-01-floating-point-arithmetic.md — IEEE 754, numerical precision, and why $0.1 + 0.2 \neq 0.3$ in floating point.