14-06 — Convex Sets and Functions
Phase: Optimization | Subject: 14-06 Prerequisites: 14-01-optimization-fundamentals.md, 06-01-functions-of-several-variables.md Next subject: 14-07-convex-optimization-problems.md
Learning Objectives
By the end of this subject, you will be able to:
- Define convex sets and prove a set is convex from the definition
- Define convex functions and test convexity using first-order and second-order conditions
- Recognize operations that preserve convexity (intersection, affine composition, pointwise maximum)
- Explain why convexity is the central property in optimization theory
- Identify common convex and non-convex functions in ML
Core Content
Convex Sets
A set $C \subseteq \mathbb{R}^n$ is convex if for any $\mathbf{x}, \mathbf{y} \in C$ and any $\lambda \in [0, 1]$:
$$\lambda\mathbf{x} + (1-\lambda)\mathbf{y} \in C$$
In plain language: the line segment between any two points in $C$ is entirely contained in $C$.
⚠️ CRITICAL — This is the fundamental building block of optimization theory. Every guarantee in convex optimization — global optimality, zero duality gap, convergence proofs — traces back to this definition.
Examples of convex sets: - $\mathbb{R}^n$ (the whole space) - ${\mathbf{x} : |\mathbf{x}| \leq 1}$ (the unit ball) - ${\mathbf{x} : A\mathbf{x} = \mathbf{b}}$ (affine subspace — a line, plane, hyperplane) - ${\mathbf{x} : A\mathbf{x} \leq \mathbf{b}}$ (polyhedron — intersection of halfspaces) - $\mathbb{S}^n_+ = {X \in \mathbb{R}^{n \times n} : X = X^T, X \succeq 0}$ (positive semidefinite cone)
Non-convex sets: - ${0, 1}^n$ (binary hypercube — discrete, not continuous) - ${\mathbf{x} : |\mathbf{x}| = 1}$ (sphere surface — the chord goes inside, not on the surface) - ${\mathbf{x} : x_1x_2 \geq 1}$ (non-convex region defined by a non-convex inequality)
Proving a Set Is Convex
Take any $\mathbf{x}, \mathbf{y} \in C$ and $\lambda \in [0,1]$. Show $\lambda\mathbf{x} + (1-\lambda)\mathbf{y} \in C$ using the definition of $C$.
Example: Prove $C = {\mathbf{x} : |\mathbf{x}| \leq 1}$ is convex. Take $\mathbf{x}, \mathbf{y}$ with $|\mathbf{x}| \leq 1$, $|\mathbf{y}| \leq 1$. For $\lambda \in [0,1]$: $|\lambda\mathbf{x} + (1-\lambda)\mathbf{y}| \leq \lambda|\mathbf{x}| + (1-\lambda)|\mathbf{y}| \leq \lambda(1) + (1-\lambda)(1) = 1$ ✓
Convex Functions
A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if $\operatorname{dom} f$ is convex and for all $\mathbf{x}, \mathbf{y} \in \operatorname{dom} f$, $\lambda \in [0,1]$:
$$f(\lambda\mathbf{x} + (1-\lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1-\lambda)f(\mathbf{y})$$
Geometric interpretation: the function lies below the chord connecting any two points.
$f$ is strictly convex if the inequality is strict ($<$) for $\mathbf{x} \neq \mathbf{y}$, $\lambda \in (0,1)$.
$f$ is concave if $-f$ is convex (function lies above the chord).
First-Order Condition (Gradient Characterization)
For differentiable $f$:
$$f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x}) \quad \forall \mathbf{x}, \mathbf{y}$$
The first-order Taylor approximation is a global underestimator of $f$. The tangent plane lies below the function everywhere.
⚠️ CRITICAL — This is how you prove optimality. If $\nabla f(\mathbf{x}^) = \mathbf{0}$, then $f(\mathbf{y}) \geq f(\mathbf{x}^)$ for all $\mathbf{y}$ — $\mathbf{x}^*$ is a global minimum. For convex functions, vanishing gradient $\implies$ global optimum.
Second-Order Condition (Hessian Characterization)
For twice-differentiable $f$:
$$f \text{ is convex } \iff \nabla^2 f(\mathbf{x}) \succeq 0 \quad \forall \mathbf{x} \in \operatorname{dom} f$$
The Hessian is positive semidefinite everywhere. This is usually the easiest way to test convexity — check eigenvalues.
$f$ is strictly convex if $\nabla^2 f(\mathbf{x}) \succ 0$ (PD), but this is sufficient, not necessary ($f(x)=x^4$ is strictly convex but $f''(0)=0$).
Operations That Preserve Convexity
| Operation | Rule |
|---|---|
| Nonnegative weighted sum | If $f_i$ are convex and $w_i \geq 0$, then $\sum_i w_i f_i$ is convex |
| Affine composition | If $f$ is convex, then $g(\mathbf{x}) = f(A\mathbf{x} + \mathbf{b})$ is convex |
| Pointwise maximum | If $f_1, \ldots, f_m$ are convex, then $f(\mathbf{x}) = \max_i f_i(\mathbf{x})$ is convex |
| Composition with convex nondecreasing | If $g$ is convex nondecreasing and $h$ is convex, $g \circ h$ is convex |
| Partial infimum | If $f(\mathbf{x}, \mathbf{y})$ is jointly convex, then $g(\mathbf{x}) = \inf_{\mathbf{y}} f(\mathbf{x}, \mathbf{y})$ is convex |
⚠️ Common Pitfall: The sum of convex functions is convex, but the product of convex functions is generally NOT convex. $f(x)=x^2$ and $g(x)=x^2$ are convex — their product $h(x)=x^4$ happens to also be convex, but that's coincidence, not a product rule. For a genuine counterexample: $f(x)=x^2$ and $g(x)=(x-1)^2$ are both convex, but their product $h(x)=x^2(x-1)^2$ is non-convex (its second derivative changes sign). Similarly, $f(x)=e^x$ and $g(x)=e^x$ are convex and $e^{2x}$ is convex, but again this is luck — not a general property.
ML Functions and Their Convexity Status
| Function | Convex? | Notes |
|---|---|---|
| $|X\mathbf{w} - \mathbf{y}|_2^2$ (MSE loss) | ✓ | Quadratic with PSD Hessian $2X^TX$ |
| $-\log p(y | \mathbf{x};\mathbf{w})$ (NLL/cross-entropy) | ✓ (in $\mathbf{w}$ for linear models) |
| $| \mathbf{w}|_1$ (L1 regularization) | ✓ | Convex, nonsmooth at 0 |
| $| \mathbf{w}|_2^2$ (L2 regularization) | ✓ | Strongly convex |
| Neural network loss (in all params) | ✗ | Non-convex composition of convex functions |
| $\min_i f_i(\mathbf{x})$ (min of convex) | ✗ | Min breaks convexity; max preserves it |
Key Terms
- Affine composition
- Composition with convex nondecreasing
- Nonnegative weighted sum
- Partial infimum
- Pointwise maximum
Worked Examples
Example 1: Proving a Function Is Convex
Prove $f(x) = -\ln x$ is convex on $(0, \infty)$.
Solution (second-order): $f'(x) = -1/x$, $f''(x) = 1/x^2 > 0$ for all $x > 0$. Since $f''(x) \geq 0$, $f$ is convex (in fact, strictly convex). ✓
Solution (definition): For $x, y > 0$, $\lambda \in [0,1]$: Need to show $-\ln(\lambda x + (1-\lambda)y) \leq -\lambda\ln x - (1-\lambda)\ln y$. Multiply by $-1$ and exponentiate: $\lambda x + (1-\lambda)y \geq x^\lambda y^{1-\lambda}$ — this is the weighted AM-GM inequality, which holds for all $x,y>0$. ✓
Click for answer
$f''(x) = 1/x^2 \geq 0$ for all $x > 0$, proving convexity. Alternatively, follows from the weighted AM-GM inequality.Example 2: Pointwise Maximum
Show that $f(\mathbf{x}) = \max{x_1^2 + x_2^2,\; (x_1-1)^2 + x_2^2}$ is convex.
Solution: Both $f_1(\mathbf{x}) = x_1^2 + x_2^2$ and $f_2(\mathbf{x}) = (x_1-1)^2 + x_2^2$ are convex quadratics (Hessian of each is $2I \succeq 0$). The pointwise maximum of convex functions is convex. ✓
Geometrically, $f$ is the maximum of two paraboloids centered at $(0,0)$ and $(1,0)$ — the function looks like two bowls glued together.
Click for answer
Both components are convex (Hessian $= 2I$ for each). Pointwise maximum preserves convexity.Example 3: Intersection of Convex Sets
Prove the feasible set of $x^2 + y^2 \leq 1$, $x \geq 0$ is convex.
Solution: $C_1 = {(x,y) : x^2+y^2 \leq 1}$ is convex (unit disk — proved earlier). $C_2 = {(x,y) : x \geq 0}$ is convex (halfspace). $C = C_1 \cap C_2$ — intersection of convex sets is convex. This is the right half of the unit disk. ✓
Click for answer
The intersection of convex sets is convex. $C_1$ (unit disk) and $C_2$ (halfspace $x \geq 0$) are both convex, so their intersection (right half-disk) is convex.Quiz
Q1: What does the concept of Nonnegative weighted sum primarily refer to in this subject?
A) A computational error related to Nonnegative weighted sum B) A historical anecdote about Nonnegative weighted sum C) A visual representation of Nonnegative weighted sum D) The definition and application of Nonnegative weighted sum
Correct: D)
- If you chose A: This is incorrect. Nonnegative weighted sum is defined as: the definition and application of nonnegative weighted sum. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Nonnegative weighted sum is defined as: the definition and application of nonnegative weighted sum. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Nonnegative weighted sum is defined as: the definition and application of nonnegative weighted sum. The other options describe different aspects that are not the primary focus.
- If you chose D: Nonnegative weighted sum is defined as: the definition and application of nonnegative weighted sum. The other options describe different aspects that are not the primary focus. Correct!
Q2: Which of the following is the key formula discussed in this subject?
A) A simplified version of C \subseteq \mathbb{R}^n... B) An unrelated formula from a different topic C) The inverse operation of the formula in question D) C \subseteq \mathbb{R}^n
Correct: D)
- If you chose A: This is incorrect. The formula C \subseteq \mathbb{R}^n is central to this subject. The other options are either simplified versions or unrelated.
- If you chose B: This is incorrect. The formula C \subseteq \mathbb{R}^n is central to this subject. The other options are either simplified versions or unrelated.
- If you chose C: This is incorrect. The formula C \subseteq \mathbb{R}^n is central to this subject. The other options are either simplified versions or unrelated.
- If you chose D: The formula C \subseteq \mathbb{R}^n is central to this subject. The other options are either simplified versions or unrelated. Correct!
Q3: What is the primary purpose of Affine composition?
A) It replaces all other methods in this domain B) It is used only in advanced research contexts C) It is used to affine composition in mathematical analysis D) It is primarily a historical notation system
Correct: C)
- If you chose A: This is incorrect. Affine composition serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Affine composition serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: Affine composition serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose D: This is incorrect. Affine composition serves the purpose described in the correct answer. The other options misrepresent its role.
Q4: Which statement about Pointwise maximum is TRUE?
A) Pointwise maximum is mentioned only as a historical footnote B) Pointwise maximum is not related to this subject C) Pointwise maximum is an advanced topic beyond this subject's scope D) Pointwise maximum is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Pointwise maximum is a fundamental concept covered in this subject. This subject covers Pointwise maximum as part of its core content.
- If you chose B: This is incorrect. Pointwise maximum is a fundamental concept covered in this subject. This subject covers Pointwise maximum as part of its core content.
- If you chose C: This is incorrect. Pointwise maximum is a fundamental concept covered in this subject. This subject covers Pointwise maximum as part of its core content.
- If you chose D: Pointwise maximum is a fundamental concept covered in this subject. This subject covers Pointwise maximum as part of its core content. Correct!
Q5: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) An unrelated numerical value C) Pointwise Maximum D) A different result from a common mistake
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is Pointwise Maximum. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is Pointwise Maximum. The other options represent common errors.
- If you chose C: The worked examples show that the result is Pointwise Maximum. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is Pointwise Maximum. The other options represent common errors.
Q6: How are Pointwise maximum and Composition with convex nondecreasing related?
A) Pointwise maximum is a special case of Composition with convex nondecreasing B) Pointwise maximum is the inverse of Composition with convex nondecreasing C) Pointwise maximum and Composition with convex nondecreasing are closely related concepts D) Pointwise maximum and Composition with convex nondecreasing are completely unrelated topics
Correct: C)
- If you chose A: This is incorrect. Both Pointwise maximum and Composition with convex nondecreasing are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both Pointwise maximum and Composition with convex nondecreasing are covered in this subject as interconnected topics.
- If you chose C: Both Pointwise maximum and Composition with convex nondecreasing are covered in this subject as interconnected topics. Correct!
- If you chose D: This is incorrect. Both Pointwise maximum and Composition with convex nondecreasing are covered in this subject as interconnected topics.
Q7: What is a common pitfall when working with Partial infimum?
A) Partial infimum has no common misconceptions B) Partial infimum is always computed the same way in all contexts C) The main error with Partial infimum is using it when it is not needed D) A common mistake is confusing Partial infimum with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse Partial infimum with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Partial infimum with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Partial infimum with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse Partial infimum with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q8: When should you apply Convex Sets?
A) Use Convex Sets only in pure mathematics contexts B) Avoid Convex Sets unless explicitly instructed C) Convex Sets is not practically useful D) Apply Convex Sets to solve problems in this subject's domain
Correct: D)
- If you chose A: This is incorrect. Convex Sets is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Convex Sets is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Convex Sets is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: Convex Sets is a practical tool used throughout this subject to solve relevant problems. Correct!
Practice Problems
-
Prove that the set of probability vectors $\Delta_n = {\mathbf{p} \in \mathbb{R}^n : p_i \geq 0,\; \sum_i p_i = 1}$ is convex.
Click for answer
Take $\mathbf{p}, \mathbf{q} \in \Delta_n$, $\lambda \in [0,1]$. Let $\mathbf{r} = \lambda\mathbf{p} + (1-\lambda)\mathbf{q}$. $r_i = \lambda p_i + (1-\lambda)q_i \geq 0$ (nonnegative combination of nonnegative numbers). $\sum_i r_i = \lambda\sum_i p_i + (1-\lambda)\sum_i q_i = \lambda(1) + (1-\lambda)(1) = 1$. So $\mathbf{r} \in \Delta_n$ ✓ — this is the probability simplex. -
Is $f(x) = |x|$ convex? Prove it using the definition.
Click for answer
For $x, y \in \mathbb{R}$, $\lambda \in [0,1]$: $f(\lambda x + (1-\lambda)y) = |\lambda x + (1-\lambda)y|$ $\leq \lambda|x| + (1-\lambda)|y|$ (triangle inequality) $= \lambda f(x) + (1-\lambda)f(y)$ ✓ $|x|$ is convex (and subdifferentiable at 0 with subgradient in $[-1, 1]$). -
Determine if $f(x, y) = x^2/y$ is convex on ${(x, y) : y > 0}$.
Click for answer
Hessian: $\nabla^2 f = \begin{pmatrix} 2/y & -2x/y^2 \\ -2x/y^2 & 2x^2/y^3 \end{pmatrix} = \frac{2}{y^3}\begin{pmatrix} y^2 & -xy \\ -xy & x^2 \end{pmatrix}$. The matrix $\begin{pmatrix} y^2 & -xy \\ -xy & x^2 \end{pmatrix}$ has eigenvalues $0$ and $x^2+y^2 \geq 0$, so the Hessian is PSD for $y>0$. The function IS convex on this domain. (This is the "quadratic-over-linear" function, a standard convex function in Boyd & Vandenberghe.) -
Show that the set of PSD matrices $\mathbb{S}^n_+$ is a convex cone.
Click for answer
Take $A, B \succeq 0$, $\lambda \in [0,1]$. For any $\mathbf{v}$: $\mathbf{v}^T(\lambda A + (1-\lambda)B)\mathbf{v} = \lambda\mathbf{v}^TA\mathbf{v} + (1-\lambda)\mathbf{v}^TB\mathbf{v} \geq 0$ since both terms are nonnegative. So $\lambda A + (1-\lambda)B \succeq 0$, hence $\mathbb{S}^n_+$ is convex. Also, for $\alpha \geq 0$, $\alpha A \succeq 0$ (cone property). -
$f(x) = e^x$ and $g(x) = x^2$ are convex. Is $h(x) = f(g(x)) = e^{x^2}$ convex on $\mathbb{R}$?
Click for answer
$h'(x) = 2x e^{x^2}$, $h''(x) = 2e^{x^2} + 4x^2 e^{x^2} = 2e^{x^2}(1 + 2x^2) \geq 0$ for all $x$. Yes, $h$ is convex. This follows from the composition rule: $f$ is convex and nondecreasing on $[0,\infty)$, and $g(x) = x^2$ maps into $[0,\infty)$ and is convex — so $f \circ g$ is convex.
Summary
Key takeaways:
- A set is convex if it contains the line segment between any two points in it
- A function is convex if it lies below the chord: $f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$
- First-order condition: tangent plane is a global underestimator
- Second-order condition: Hessian is PSD everywhere — the easiest convexity test
- Convexity is preserved under: nonnegative weighted sums, affine composition, pointwise max
- For convex functions: $\nabla f(\mathbf{x}^) = \mathbf{0} \iff \mathbf{x}^$ is a global minimum
- ML loss functions for linear/logistic regression are convex; neural network losses are not
Pitfalls
-
Assuming the product of convex functions is convex. $f(x)=x^2$ and $g(x)=(x-1)^2$ are both convex, but their product is non-convex (second derivative changes sign). There is no product rule for convexity — only nonnegative weighted sums, affine composition, and pointwise maximum are safe.
-
Using $\min$ instead of $\max$ when combining convex functions. The pointwise maximum of convex functions is convex; the pointwise minimum is generally not. For example, $\min{x^2, (x-1)^2}$ has a dip in the middle and is non-convex.
-
Checking convexity only at a single point. A function with $f''(x) > 0$ at one point is not necessarily convex. $f''(x)$ must be $\geq 0$ everywhere in the domain — one region of negative curvature destroys convexity.
-
Confusing strict convexity ($f'' > 0$) with convexity ($f'' \geq 0$). $f(x)=x^4$ is strictly convex but $f''(0)=0$. Strict convexity does not require $f''>0$ everywhere — it's a sufficient but not necessary condition.
-
Forgetting that convexity is domain-dependent. $f(x)=1/x$ is convex on $(0,\infty)$ but not on $\mathbb{R}\setminus{0}$ (it's concave on $(-\infty,0)$). Always specify the domain when claiming convexity.
Next Steps
Next up: 14-07-convex-optimization-problems.md — standard forms (LP, QP, SOCP, SDP), optimality conditions, and why convex problems are "easy."