14-01 — Optimization Fundamentals
Phase: Optimization | Subject: 14-01 Prerequisites: 04-08-optimization (single-variable), 06-07-optimization-multivariable Next subject: 14-02-gradient-descent.md
Learning Objectives
By the end of this subject, you will be able to:
- Formulate an optimization problem in standard form (objective, variables, constraints)
- Distinguish local minima, global minima, and saddle points
- Classify problems as convex vs non-convex and explain why convexity matters
- Read an optimization landscape and identify its qualitative features
- State the first-order and second-order necessary conditions for optimality
Core Content
What Is Optimization?
Optimization is the mathematical study of finding the best choice from a set of alternatives. In machine learning, nearly everything is optimization: training a model means minimizing a loss function; hyperparameter tuning means maximizing validation accuracy.
The standard form of a continuous optimization problem:
$$\min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x}) \quad \text{subject to} \quad \begin{cases} g_i(\mathbf{x}) \leq 0, & i = 1, \ldots, m \ h_j(\mathbf{x}) = 0, & j = 1, \ldots, p \end{cases}$$
- $\mathbf{x} \in \mathbb{R}^n$ — decision variables (the parameters you can control)
- $f: \mathbb{R}^n \to \mathbb{R}$ — objective function (what you minimize; "loss" or "cost" in ML)
- $g_i(\mathbf{x}) \leq 0$ — inequality constraints
- $h_j(\mathbf{x}) = 0$ — equality constraints
⚠️ CRITICAL: In ML, you almost always minimize (not maximize). If a paper says "maximize likelihood," they actually minimize the negative log-likelihood. Train your brain to think in minimization — it's the universal convention.
The Optimization Landscape
Think of $f(\mathbf{x})$ as a surface in $\mathbb{R}^{n+1}$ (height = function value). For $n=2$, it looks like a 3D terrain. Key features:
| Feature | Definition | First-Order Check |
|---|---|---|
| Global minimum | $\mathbf{x}^$ such that $f(\mathbf{x}^) \leq f(\mathbf{x})$ for all feasible $\mathbf{x}$ | $\nabla f(\mathbf{x}^*) = \mathbf{0}$ |
| Local minimum | $\mathbf{x}^$ such that $f(\mathbf{x}^) \leq f(\mathbf{x})$ for all $\mathbf{x}$ in some neighborhood | $\nabla f(\mathbf{x}^*) = \mathbf{0}$ |
| Saddle point | $\nabla f = \mathbf{0}$ but neither min nor max (some directions go up, some go down) | $\nabla f(\mathbf{x}^*) = \mathbf{0}$ |
| Plateau | Region where gradient is (near) zero — optimization gets stuck | $\nabla f \approx \mathbf{0}$ |
First-Order Necessary Condition
If $\mathbf{x}^*$ is a local minimum and $f$ is differentiable, then:
$$\nabla f(\mathbf{x}^*) = \mathbf{0}$$
This is necessary but NOT sufficient — saddle points also satisfy it.
Why it's necessary: If $\nabla f(\mathbf{x}^) \neq \mathbf{0}$, then the directional derivative $-\nabla f(\mathbf{x}^)$ points downhill, so you can take a small step to decrease $f$, contradicting that $\mathbf{x}^*$ is a local minimum. There must be no direction of decrease.
Second-Order Necessary Condition
If $\mathbf{x}^*$ is a local minimum and $f$ is twice differentiable:
$$\nabla f(\mathbf{x}^) = \mathbf{0} \quad \text{AND} \quad \nabla^2 f(\mathbf{x}^) \succeq 0 \quad \text{(Hessian is positive semidefinite)}$$
Second-order sufficient condition (guarantees a strict local minimum):
$$\nabla f(\mathbf{x}^) = \mathbf{0} \quad \text{AND} \quad \nabla^2 f(\mathbf{x}^) \succ 0 \quad \text{(Hessian is positive definite)}$$
⚠️ CRITICAL: The Hessian being PSD is NECESSARY. If $\nabla^2 f$ has a negative eigenvalue at a critical point, that point is a saddle — gradient descent will escape along the negative-curvature direction.
Convex vs Non-Convex
A set $C$ is convex if for any $\mathbf{x}, \mathbf{y} \in C$, the line segment $\lambda\mathbf{x} + (1-\lambda)\mathbf{y}$ is also in $C$ for all $\lambda \in [0,1]$.
A function $f$ is convex if its epigraph (the set of points above its graph) is convex, equivalently:
$$f(\lambda\mathbf{x} + (1-\lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1-\lambda)f(\mathbf{y}) \quad \forall \lambda \in [0,1]$$
Why convexity is a big deal:
| Property | Convex | Non-Convex |
|---|---|---|
| Local minima | Every local min is global | Many spurious local minima |
| Gradient descent | Guaranteed to reach global min (with appropriate step size) | No guarantees; gets stuck |
| Duality gap | Zero (under mild conditions) | Non-zero |
| ML examples | Linear regression, SVM, logistic regression (convex loss) | Neural networks, matrix factorization, clustering |
⚠️ CRITICAL — Common Pitfall: Logistic regression's loss (cross-entropy) IS convex, but the overall neural network loss is NOT convex because composing convex functions with nonlinear activations can produce non-convexity. Don't confuse "the loss function is convex" with "the model training problem is convex."
Feasible Region
The feasible set $\mathcal{F} = {\mathbf{x} \mid g_i(\mathbf{x}) \leq 0,\; h_j(\mathbf{x}) = 0}$ is the set of points that satisfy all constraints. If $\mathcal{F}$ is compact (closed + bounded) and $f$ is continuous, the Weierstrass theorem guarantees a global minimum exists. This is why regularization (which bounds the parameter space) can make ill-posed problems well-posed.
Key Terms
- Global minimum
- Local minimum
- Plateau
- Saddle point
- Second-order sufficient condition
- Weierstrass theorem
Worked Examples
Example 1: Is It a Minimum?
Consider $f(x, y) = x^2 + y^4$. Find and classify all critical points.
Solution: 1. Gradient: $\nabla f = (2x,\; 4y^3)$ 2. Set to zero: $2x = 0 \implies x = 0$, $4y^3 = 0 \implies y = 0$ 3. Unique critical point: $(0, 0)$ 4. Hessian: $\nabla^2 f = \begin{pmatrix} 2 & 0 \ 0 & 12y^2 \end{pmatrix}$ 5. At $(0, 0)$: $\nabla^2 f = \begin{pmatrix} 2 & 0 \ 0 & 0 \end{pmatrix}$ — eigenvalues are 2 and 0. PSD but not PD. 6. Since $f(0,0)=0$ and $f(x,y) = x^2 + y^4 \geq 0$ everywhere, $(0,0)$ is the global minimum.
Lesson: The Hessian gives you sufficient conditions. When it's PSD but not PD, you need to reason about the function directly (here, both terms are non-negative).
Click for answer
Global minimum at $(0,0)$ with value 0. Hessian is PSD but not PD — the sufficient condition fails, but global optimality holds by direct inspection.Example 2: Saddle Point Detection
Show that $f(x, y) = x^2 - y^2$ has a saddle point at the origin.
Solution: 1. $\nabla f = (2x,\; -2y)$; setting to zero gives $(0,0)$ as the only critical point. 2. Hessian: $\nabla^2 f = \begin{pmatrix} 2 & 0 \ 0 & -2 \end{pmatrix}$ 3. Eigenvalues: $\lambda_1 = 2$ (positive curvature along $x$), $\lambda_2 = -2$ (negative curvature along $y$). 4. Since one eigenvalue is negative, the Hessian is indefinite → $(0,0)$ is a saddle point.
Geometric intuition: Along the $x$-axis, $f(x,0) = x^2$ (bowl shape — minimum). Along the $y$-axis, $f(0,y) = -y^2$ (upside-down bowl — maximum). The surface looks like a horse saddle.
Click for answer
Saddle point. Hessian has eigenvalues $(2, -2)$ — indefinite. Critical point is a min in one direction and a max in another.Example 3: Convexity Check
Is $f(\mathbf{x}) = |\mathbf{x}|2^2 = \sum{i=1}^n x_i^2$ convex?
Solution: For any $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ and $\lambda \in [0,1]$:
$f(\lambda\mathbf{x} + (1-\lambda)\mathbf{y}) = |\lambda\mathbf{x} + (1-\lambda)\mathbf{y}|_2^2$
Expand using $|\mathbf{a}+\mathbf{b}|^2 = |\mathbf{a}|^2 + |\mathbf{b}|^2 + 2\mathbf{a}^T\mathbf{b}$:
$= |\lambda\mathbf{x}|^2 + |(1-\lambda)\mathbf{y}|^2 + 2\lambda(1-\lambda)\mathbf{x}^T\mathbf{y}$
We need to check: $f(\lambda\mathbf{x} + (1-\lambda)\mathbf{y}) \leq \lambda|\mathbf{x}|^2 + (1-\lambda)|\mathbf{y}|^2$
The difference $\lambda|\mathbf{x}|^2 + (1-\lambda)|\mathbf{y}|^2 - f(\lambda\mathbf{x} + (1-\lambda)\mathbf{y})$
$= \lambda|\mathbf{x}|^2 + (1-\lambda)|\mathbf{y}|^2 - \lambda^2|\mathbf{x}|^2 - (1-\lambda)^2|\mathbf{y}|^2 - 2\lambda(1-\lambda)\mathbf{x}^T\mathbf{y}$
$= \lambda(1-\lambda)(|\mathbf{x}|^2 + |\mathbf{y}|^2 - 2\mathbf{x}^T\mathbf{y}) = \lambda(1-\lambda)|\mathbf{x} - \mathbf{y}|^2 \geq 0$
Since $\lambda(1-\lambda) \geq 0$ for $\lambda \in [0,1]$, the difference is non-negative → the inequality holds → $f$ is convex. (It's actually strongly convex, with $\nabla^2 f = 2I \succ 0$.)
Click for answer
Yes, $f(\mathbf{x}) = \|\mathbf{x}\|_2^2$ is convex (in fact, strongly convex). The Hessian $\nabla^2 f = 2I$ is positive definite everywhere.Quiz
Q1: What does the concept of Global minimum primarily refer to in this subject?
A) The definition and application of Global minimum B) A historical anecdote about Global minimum C) A computational error related to Global minimum D) A visual representation of Global minimum
Correct: A)
- If you chose A: Global minimum is defined as: the definition and application of global minimum. The other options describe different aspects that are not the primary focus. Correct!
- If you chose B: This is incorrect. Global minimum is defined as: the definition and application of global minimum. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Global minimum is defined as: the definition and application of global minimum. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. Global minimum is defined as: the definition and application of global minimum. The other options describe different aspects that are not the primary focus.
Q2: Which of the following is the key formula discussed in this subject?
A) The inverse operation of the formula in question B) \min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x}) \quad \text{subject to} \quad \begin{cases} g_i(\mathbf{x}) \leq 0, & i = 1, \ldots, m \\ h_j(\mathbf{x}) = 0, & j = 1, \ldots, p \end{cases} C) An unrelated formula from a different topic D) A simplified version of \min_{\mathbf{x} \in \math...
Correct: B)
- If you chose A: This is incorrect. The formula \min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x}) \quad \text{subject to} \quad \begin{cases} g_i(\mathbf{x}) \leq 0, & i = 1, \ldots, m \\ h_j(\mathbf{x}) = 0, & j = 1, \ldots, p \end{cases} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: The formula \min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x}) \quad \text{subject to} \quad \begin{cases} g_i(\mathbf{x}) \leq 0, & i = 1, \ldots, m \\ h_j(\mathbf{x}) = 0, & j = 1, \ldots, p \end{cases} is central to this subject. The other options are either simplified versions or unrelated. Correct!
- If you chose C: This is incorrect. The formula \min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x}) \quad \text{subject to} \quad \begin{cases} g_i(\mathbf{x}) \leq 0, & i = 1, \ldots, m \\ h_j(\mathbf{x}) = 0, & j = 1, \ldots, p \end{cases} is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: This is incorrect. The formula \min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x}) \quad \text{subject to} \quad \begin{cases} g_i(\mathbf{x}) \leq 0, & i = 1, \ldots, m \\ h_j(\mathbf{x}) = 0, & j = 1, \ldots, p \end{cases} is central to this subject. The other options are either simplified versions or unrelated.
Q3: What is the primary purpose of Local minimum?
A) It is used only in advanced research contexts B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used to local minimum in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Local minimum serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Local minimum serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Local minimum serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Local minimum serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q4: Which statement about Saddle point is TRUE?
A) Saddle point is an advanced topic beyond this subject's scope B) Saddle point is mentioned only as a historical footnote C) Saddle point is a fundamental concept covered in this subject D) Saddle point is not related to this subject
Correct: C)
- If you chose A: This is incorrect. Saddle point is a fundamental concept covered in this subject. This subject covers Saddle point as part of its core content.
- If you chose B: This is incorrect. Saddle point is a fundamental concept covered in this subject. This subject covers Saddle point as part of its core content.
- If you chose C: Saddle point is a fundamental concept covered in this subject. This subject covers Saddle point as part of its core content. Correct!
- If you chose D: This is incorrect. Saddle point is a fundamental concept covered in this subject. This subject covers Saddle point as part of its core content.
Q5: Based on the worked examples in this subject, what is the correct result?
A) A different result from a common mistake B) An unrelated numerical value C) The inverse of the correct answer D) Saddle Point Detection
Correct: D)
- If you chose A: This is incorrect. The worked examples show that the result is Saddle Point Detection. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is Saddle Point Detection. The other options represent common errors.
- If you chose C: This is incorrect. The worked examples show that the result is Saddle Point Detection. The other options represent common errors.
- If you chose D: The worked examples show that the result is Saddle Point Detection. The other options represent common errors. Correct!
Q6: How are Saddle point and Plateau related?
A) Saddle point and Plateau are closely related concepts B) Saddle point is the inverse of Plateau C) Saddle point and Plateau are completely unrelated topics D) Saddle point is a special case of Plateau
Correct: A)
- If you chose A: Both Saddle point and Plateau are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Saddle point and Plateau are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Saddle point and Plateau are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Saddle point and Plateau are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Second-order sufficient condition?
A) Second-order sufficient condition is always computed the same way in all contexts B) Second-order sufficient condition has no common misconceptions C) A common mistake is confusing Second-order sufficient condition with a similar concept D) The main error with Second-order sufficient condition is using it when it is not needed
Correct: C)
- If you chose A: This is incorrect. Students often confuse Second-order sufficient condition with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Second-order sufficient condition with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse Second-order sufficient condition with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse Second-order sufficient condition with similar-sounding or related concepts. Pay attention to the precise definitions.
Q8: When should you apply Weierstrass theorem?
A) Use Weierstrass theorem only in pure mathematics contexts B) Apply Weierstrass theorem to solve problems in this subject's domain C) Weierstrass theorem is not practically useful D) Avoid Weierstrass theorem unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. Weierstrass theorem is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Weierstrass theorem is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Weierstrass theorem is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Weierstrass theorem is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
-
Find all critical points of $f(x, y) = x^3 - 3x + y^2$ and classify each (min, max, saddle).
Click for answer
$\nabla f = (3x^2 - 3,\; 2y)$. Critical points: $(1, 0)$ and $(-1, 0)$. Hessian: $\begin{pmatrix} 6x & 0 \\ 0 & 2 \end{pmatrix}$. At $(1,0)$: $\nabla^2 f = \begin{pmatrix} 6 & 0 \\ 0 & 2 \end{pmatrix}$ — PD → local minimum. At $(-1,0)$: $\nabla^2 f = \begin{pmatrix} -6 & 0 \\ 0 & 2 \end{pmatrix}$ — indefinite → saddle point. -
Determine whether $f(x) = e^x$ is convex on $\mathbb{R}$.
Click for answer
$f''(x) = e^x > 0$ for all $x$, so $f$ is strictly convex on $\mathbb{R}$. -
Show that $f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x}$ is convex if and only if $A$ is positive semidefinite.
Click for answer
$\nabla^2 f = A + A^T$. For a quadratic form, if $A$ is symmetric, $\nabla^2 f = 2A$. The Hessian being PSD ($A \succeq 0$) is the second-order condition for convexity. If $A$ is not symmetric, replace it with $(A+A^T)/2$ since $\mathbf{x}^T A \mathbf{x} = \mathbf{x}^T(\frac{A+A^T}{2})\mathbf{x}$. -
The function $f(x, y) = \sin(x)\sin(y)$ has infinitely many critical points. Find the pattern and classify one maximum, one minimum, and one saddle.
Click for answer
$\nabla f = (\cos x \sin y,\; \sin x \cos y)$. Setting to zero: $\cos x \sin y = 0$ and $\sin x \cos y = 0$. Solutions: $(m\pi, n\pi)$ for $m,n \in \mathbb{Z}$, and $(m\pi + \pi/2, n\pi + \pi/2)$. At $(0,0)$: $\nabla^2 f = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$, eigenvalues $\pm 1$ → saddle. At $(\pi/2, \pi/2)$: $\nabla^2 f = \begin{pmatrix} -1 & 0 \\ 0 & -1 \end{pmatrix}$, eigenvalues $-1,-1$ → local max, $f=1$. At $(\pi/2, -\pi/2)$: $\nabla^2 f = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$, eigenvalues $1,1$ → local min, $f=-1$. -
For $f(\mathbf{x}) = \frac{1}{2}|\mathbf{x}|^2 + \log\sum_{i=1}^n e^{x_i}$, compute the gradient and verify it's convex (this is the log-sum-exp function, the "smooth max").
Click for answer
Let $s = \sum_i e^{x_i}$. Then $\frac{\partial f}{\partial x_j} = x_j + \frac{e^{x_j}}{s}$. The Hessian is $I + \operatorname{diag}(p) - pp^T$ where $p_i = e^{x_i}/s$ (a probability vector). This is $I$ plus the covariance matrix of a categorical distribution, which is PSD → convex.
Summary
Key takeaways:
- An optimization problem has an objective $f$, variables $\mathbf{x}$, and optional constraints $g_i$, $h_j$
- $\nabla f(\mathbf{x}^) = \mathbf{0}$ is necessary* for a local minimum but not sufficient (saddles also satisfy it)
- The Hessian being positive definite is sufficient for a strict local minimum
- In convex optimization, every local minimum is global — this is why convexity is the holy grail
- Non-convex problems (neural networks) have no convergence guarantees but work well in practice
- A compact feasible set + continuous $f$ guarantees existence of a global minimum (Weierstrass)
Pitfalls
- Assuming ∇f = 0 guarantees a minimum: The gradient being zero is necessary but not sufficient. Saddle points and local maxima also satisfy ∇f = 0. Always check the Hessian (or use higher-order tests) before concluding you've found a minimum.
- Checking second-order conditions only along coordinate axes: The Hessian must be positive semidefinite in all directions, not just along the coordinate axes. A 2×2 Hessian with positive diagonal entries can still have a negative eigenvalue if the off-diagonal coupling is strong enough (e.g., [[1, 2], [2, 1]] has eigenvalues −1 and 3).
- Thinking convexity is required for gradient descent to work: Many non-convex problems in practice (neural networks, matrix factorization) are solved successfully with gradient-based methods. Convexity guarantees global convergence but is not a prerequisite — it just means there are no spurious local minima to get stuck in.
- Confusing "the loss function is convex" with "the optimization problem is convex": Logistic regression's cross-entropy loss is convex in the parameters, but composing that loss with a nonlinear feature map or neural network layers makes the overall objective non-convex. Always check convexity of the full composition, not just the final loss.
- Forgetting to verify the feasible set is compact before invoking Weierstrass: The theorem requires both closedness and boundedness. A continuous function on ℝ (not bounded) or (0,1] (not closed) may have no minimum. Regularization often makes the feasible set compact, guaranteeing existence.
Next Steps
Next up: 14-02-gradient-descent.md — where you'll learn the workhorse algorithm that uses gradients to actually find those minima.