14-10 — Stochastic and Non-Convex Optimization
Phase: Optimization | Subject: 14-10 Prerequisites: 14-03-variants-gradient-descent.md, 14-06-convex-sets-functions.md Next subject: 15-01-floating-point-arithmetic.md
Learning Objectives
By the end of this subject, you will be able to:
- Explain why gradient descent finds "good enough" solutions on non-convex problems in practice
- Describe the strict saddle property and why SGD escapes saddle points
- State convergence guarantees for SGD on non-convex objectives (to stationary points)
- Discuss the role of overparameterization in making neural network loss surfaces benign
- Apply practical heuristics for training non-convex models (learning rate warmup, scheduling, early stopping)
Core Content
The Non-Convex Reality
Almost every interesting ML model involves non-convex optimization. The loss landscape of a neural network has: - Many local minima (but most are nearly as good as the global minimum, especially for overparameterized models) - Countless saddle points (the real enemy) - Flat regions (plateaus) where gradients vanish
Yet gradient-based methods work remarkably well. Why?
The Strict Saddle Property
A saddle point $\mathbf{x}$ is strict if $\lambda_{\min}(\nabla^2 f(\mathbf{x})) < 0$ — the Hessian has at least one strictly negative eigenvalue, meaning there exists a direction of descent.
Key result (Ge et al., 2015; Jin et al., 2017): For functions where all saddle points are strict, gradient descent with random initialization (or small perturbations) almost surely converges to a local minimum, not a saddle point.
Why SGD escapes saddles naturally: The stochastic gradient noise acts as a built-in perturbation mechanism. At a saddle point where $\nabla f \approx \mathbf{0}$, the noise variance keeps the iterate moving. Along negative-curvature directions, this noise pushes the iterate downhill and away from the saddle.
⚠️ CRITICAL — Saddles, not local minima, are the main obstacle in high dimensions. The number of saddle points grows exponentially with dimension, while the number of "bad" local minima may be small. Most theoretical and empirical evidence suggests neural network loss surfaces are dominated by saddles, not spurious local minima.
Convergence to Stationary Points
For non-convex $L$-smooth $f$, SGD with $\eta = 1/L$ achieves:
$$\min_{k=0,\ldots,K-1} \mathbb{E}[|\nabla f(\mathbf{x}_k)|^2] \leq \frac{2L(f(\mathbf{x}_0) - f^*)}{K} + \frac{\sigma^2}{B}$$
where $\sigma^2$ is gradient variance and $B$ is batch size.
This is an $O(1/K + 1/B)$ rate to a stationary point ($\nabla f \approx 0$), not necessarily a local minimum. But combined with the strict saddle property, we can upgrade this: SGD finds an approximate local minimum in polynomial time.
The Overparameterization Miracle
Modern neural networks are massively overparameterized — they have far more parameters than training examples. This turns out to be a feature, not a bug:
- Wider loss basins: Overparameterized networks have larger, flatter basins of attraction around good minima
- Linearized regime (NTK): In the infinite-width limit, training dynamics are exactly linear and the loss is convex in parameter space
- All local minima are global: Under certain assumptions (sufficient width), every local minimum achieves zero training loss
- Implicit regularization: SGD with small step size biases toward solutions with small norm — the "implicit bias" toward simple functions
Practical Heuristics for Non-Convex Training
| Technique | What It Does | When To Use |
|---|---|---|
| LR warmup | Start with small $\eta$, linearly increase to target | Transformers, large models — prevents early divergence |
| Cosine decay | $\eta_k = \eta_0 \cdot \frac{1}{2}(1 + \cos(\pi k/K))$ | Default schedule for image models |
| Early stopping | Stop when validation loss stops improving | Always — prevents overfitting |
| Gradient clipping | $\mathbf{g} \leftarrow \mathbf{g} \cdot \min(1, c/|\mathbf{g}|)$ | RNNs, transformers — prevents gradient explosion |
| Label smoothing | Soften one-hot targets: $y \leftarrow (1-\epsilon)y + \epsilon/K$ | Improves generalization, makes loss surface smoother |
| Batch normalization | Normalize activations per mini-batch | Speeds training, allows higher LR, smooths landscape |
Key Terms
- Batch normalization
- Cosine decay
- Early stopping
- Gradient clipping
- LR warmup
- Label smoothing
Worked Examples
Example 1: SGD Escapes a Saddle
Consider $f(x, y) = x^2 - y^2$ (saddle at origin). Show that SGD with noise has a non-zero chance of escaping.
Solution: $\nabla f = (2x, -2y)$. At $(0,0)$, $\nabla f = (0,0)$ — the saddle.
Deterministic GD initialized at $(0, 0)$: stays at $(0, 0)$ forever.
SGD with noisy gradient $\hat{\nabla}f = \nabla f + \boldsymbol{\xi}$ where $\boldsymbol{\xi} \sim \mathcal{N}(0, \sigma^2 I)$: At $(0,0)$, the update is $\mathbf{x}_{k+1} = -\eta \boldsymbol{\xi}_k$. The noise pushes the iterate away from the saddle. Once slightly displaced, the deterministic gradient component takes over: in the $y$-direction, $\partial f/\partial y = -2y$, so if $\xi_y \neq 0$, the iterate is pushed away and $-2y$ amplifies the displacement (unstable direction). In the $x$-direction, $2x$ pushes back toward 0 (stable direction).
The net effect: SGD is repelled from the saddle along the negative-curvature $y$-direction.
Click for answer
SGD noise displaces the iterate from the saddle. The negative Hessian eigenvalue along $y$ amplifies the displacement, and the iterate escapes. Deterministic GD would remain stuck forever.Example 2: Gradient Clipping
Compute the clipped gradient for $\mathbf{g} = (3, 4)$ with max norm $c = 5$.
Solution: $|\mathbf{g}| = \sqrt{3^2 + 4^2} = 5$. Since $|\mathbf{g}| = 5 \leq c = 5$, no clipping needed — return $\mathbf{g}$.
For $\mathbf{g} = (6, 8)$: $|\mathbf{g}| = 10$, so $\mathbf{g}_{\text{clipped}} = (6, 8) \cdot \frac{5}{10} = (3, 4)$.
This prevents a single large gradient from destabilizing training while preserving the direction.
Click for answer
$\mathbf{g}=(3,4)$: $\|\mathbf{g}\|=5 \leq 5$, no clipping. $\mathbf{g}=(6,8)$: $\|\mathbf{g}\|=10$, clipped to $(3,4)$.Example 3: Cosine Learning Rate Decay
Compute the learning rate at step $k = 750$ of $K = 1000$ total steps, with $\eta_0 = 0.1$.
Solution: $\eta_k = \frac{0.1}{2}(1 + \cos(\pi \cdot 750/1000)) = 0.05(1 + \cos(0.75\pi)) = 0.05(1 - 0.7071) = 0.05(0.2929) = 0.0146$.
At $k=500$ (halfway): $\eta = 0.05(1 + \cos(\pi/2)) = 0.05(1+0) = 0.05$. At $k=0$: $\eta = 0.1$. At $k=1000$: $\eta = 0.05(1 + \cos(\pi)) = 0.05(1-1) = 0$.
The schedule starts at $\eta_0$, decays smoothly to 0.
Click for answer
$\eta_{750} \approx 0.0146$. The rate follows a cosine curve from $\eta_0=0.1$ at $k=0$ to $0$ at $k=1000$.Quiz
Q1: What does the concept of LR warmup primarily refer to in this subject?
A) The definition and application of LR warmup B) A historical anecdote about LR warmup C) A visual representation of LR warmup D) A computational error related to LR warmup
Correct: A)
- If you chose A: LR warmup is defined as: the definition and application of lr warmup. The other options describe different aspects that are not the primary focus. Correct!
- If you chose B: This is incorrect. LR warmup is defined as: the definition and application of lr warmup. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. LR warmup is defined as: the definition and application of lr warmup. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. LR warmup is defined as: the definition and application of lr warmup. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) An unrelated formula from a different topic C) A simplified version of \mathbf{x}... D) \mathbf{x}
Correct: D)
- If you chose A: This is incorrect. The formula \mathbf{x} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula \mathbf{x} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula \mathbf{x} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: The formula \mathbf{x} is central to this subject. The other options are either simplified versions or unrelated. Correct!
Q3: What is the primary purpose of Cosine decay?
A) It is used to cosine decay in mathematical analysis B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used only in advanced research contexts
Correct: A)
- If you chose A: Cosine decay serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. Cosine decay serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Cosine decay serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Cosine decay serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Early stopping is TRUE?
A) Early stopping is not related to this subject B) Early stopping is a fundamental concept covered in this subject C) Early stopping is an advanced topic beyond this subject's scope D) Early stopping is mentioned only as a historical footnote
Correct: B)
- If you chose A: This is incorrect. Early stopping is a fundamental concept covered in this subject. This subject covers Early stopping as part of its core content.
- If you chose B: Early stopping is a fundamental concept covered in this subject. This subject covers Early stopping as part of its core content. Correct!
- If you chose C: This is incorrect. Early stopping is a fundamental concept covered in this subject. This subject covers Early stopping as part of its core content.
- If you chose D: This is incorrect. Early stopping is a fundamental concept covered in this subject. This subject covers Early stopping as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) Gradient Clipping B) The inverse of the correct answer C) An unrelated numerical value D) A different result from a common mistake
Correct: A)
- If you chose A: The worked examples show that the result is Gradient Clipping. The other options represent common errors. Correct!
- If you chose B: This is incorrect. The worked examples show that the result is Gradient Clipping. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is Gradient Clipping. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is Gradient Clipping. The other options represent common errors.
Q6: How are Early stopping and Gradient clipping related?
A) Early stopping is the inverse of Gradient clipping B) Early stopping is a special case of Gradient clipping C) Early stopping and Gradient clipping are completely unrelated topics D) Early stopping and Gradient clipping are closely related concepts
Correct: D)
- If you chose A: This is incorrect. Both Early stopping and Gradient clipping are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Early stopping and Gradient clipping are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Early stopping and Gradient clipping are covered in this subject as interconnected topics.
- If you chose D: Both Early stopping and Gradient clipping are covered in this subject as interconnected topics. Correct!
Q7: What is a common pitfall when working with Label smoothing?
A) Label smoothing has no common misconceptions B) Label smoothing is always computed the same way in all contexts C) The main error with Label smoothing is using it when it is not needed D) A common mistake is confusing Label smoothing with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse Label smoothing with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Label smoothing with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Label smoothing with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse Label smoothing with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q8: When should you apply Batch normalization?
A) Use Batch normalization only in pure mathematics contexts B) Apply Batch normalization to solve problems in this subject's domain C) Avoid Batch normalization unless explicitly instructed D) Batch normalization is not practically useful
Correct: B)
- If you chose A: This is incorrect. Batch normalization is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Batch normalization is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Batch normalization is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Batch normalization is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Explain why a random initialization helps gradient descent avoid saddle points, even without stochastic noise.
Click for answer
The set of initializations that converge to a strict saddle has measure zero (it's a lower-dimensional manifold — the stable manifold). A random initialization almost surely lands off this manifold, so deterministic GD also escapes saddles if properly initialized. The probability of landing exactly on the stable manifold of a saddle is zero for any continuous distribution. This is the theoretical basis for the "random initialization + GD works" result. -
For a neural network with $N$ training examples and $P \gg N$ parameters, explain the "interpolation" regime where training loss reaches zero.
Click for answer
When $P > N$, the system is underdetermined — there are more parameters than constraints. For many architectures, it's possible to exactly fit (interpolate) the training data, achieving zero training loss. The interesting question becomes: which of the infinitely many zero-loss solutions does SGD find? Empirically, SGD finds solutions that generalize well (the "implicit bias" or "implicit regularization" phenomenon). -
Compute the effective step in a negative-curvature direction under SGD for $f(x) = -\frac{1}{2}ax^2$ near $x=0$ with noise variance $\sigma^2$.
Click for answer
$\nabla f = -ax$. SGD update: $x_{k+1} = x_k - \eta(-ax_k + \xi_k) = (1 + \eta a)x_k - \eta\xi_k$. Since $1 + \eta a > 1$, the deterministic component amplifies $|x|$ — this is the escape from the saddle. The noise $\xi_k$ provides the initial displacement from exactly zero. $\mathbb{E}[x_{k+1}^2] = (1+\eta a)^2\mathbb{E}[x_k^2] + \eta^2\sigma^2$. The iterate diverges away from the saddle exponentially fast (in mean square). -
A loss surface has "good" local minima at loss 0.1 and "bad" local minima at loss 5.0. What property of overparameterized networks makes the bad minima rare or unreachable?
Click for answer
In overparameterized networks, the loss landscape is "benign": (1) high-loss critical points tend to be saddles, not minima — there's always some direction of escape; (2) the basin of attraction for bad minima (if they exist) shrinks with overparameterization; (3) SGD's noise and optimization trajectory naturally bias toward wide, flat minima (which tend to have lower loss and better generalization). The "loss landscape is mostly saddles plus a connected manifold of good minima" is the dominant picture supported by empirical and theoretical work. -
Design a learning rate warmup schedule: $\eta$ starts at 0.0001, linearly increases to 0.001 over 1000 steps, then cosine-decays to 0 over 9000 more steps. What is $\eta$ at step 500? At step 5000?
Click for answer
Warmup (steps 0–1000): $\eta_k = 0.0001 + (0.001 - 0.0001) \cdot k/1000$ At $k=500$: $\eta_{500} = 0.0001 + 0.0009 \cdot 0.5 = 0.00055$. Cosine decay (steps 1000–10000): $\eta_k = 0.001 \cdot \frac{1}{2}(1 + \cos(\pi \cdot \frac{k-1000}{9000}))$ At $k=5000$: $t = (5000-1000)/9000 = 4/9 \approx 0.4444$. $\eta_{5000} = 0.001 \cdot 0.5(1 + \cos(0.4444\pi)) = 0.0005(1 + 0.1736) = 0.000587$.
Summary
Key takeaways:
- Non-convex optimization is the norm in deep learning — convexity is the exception
- The strict saddle property means SGD almost surely escapes saddles (unlike deterministic GD at exact critical points)
- SGD converges to approximate stationary points at rate $O(1/K + 1/B)$ even for non-convex objectives
- Overparameterization makes loss landscapes benign: most local minima are good, most high-loss critical points are saddles
- Practical heuristics (warmup, cosine decay, gradient clipping) stabilize non-convex training
- Random initialization + SGD implicitly biases toward solutions that generalize well
Pitfalls
-
Assuming all local minima in neural networks are bad. In overparameterized networks ($P \gg N$), most local minima achieve near-optimal loss. The real enemy is saddle points, which are exponentially more numerous — and SGD's noise helps escape them.
-
Using full-batch gradient descent on non-convex problems thinking it's safer. Without stochastic noise, deterministic GD stalls exactly at saddle points (gradient is zero). SGD escapes because noise displaces the iterate, allowing negative-curvature directions to amplify the escape.
-
Setting the learning rate too low to escape plateaus. A learning rate that's too small means the iterate barely moves from one flat region to the next. Slightly higher learning rates help SGD traverse the landscape, but too high causes divergence — the sweet spot is problem-dependent.
-
Forgetting gradient clipping on recurrent architectures. RNNs, LSTMs, and transformers frequently produce gradient norms exceeding $10^3$ due to the chain rule through time. Without clipping, a single large gradient can irreversibly damage the parameters. Clip at $c=1$–$5$ as a starting point.
-
Treating early stopping as only a regularization technique. Early stopping prevents overfitting, but it also prevents SGD from converging to sharp minima. Combined with the implicit bias of SGD, stopping before the training loss hits zero often yields better generalization than training to completion.
Next Steps
Next up: 15-01-floating-point-arithmetic.md — IEEE 754, numerical precision, and why $0.1 + 0.2 \neq 0.3$ in floating point.