Math graphic
📐 Concept diagram

14-09 — Lagrangian Duality

Phase: Optimization | Subject: 14-09 Prerequisites: 14-08-constrained-optimization-theory.md Next subject: 14-10-stochastic-nonconvex-optimization.md


Learning Objectives

By the end of this subject, you will be able to:

  1. Construct the Lagrangian dual function and the dual problem
  2. Explain weak duality and prove why the dual provides a lower bound
  3. State conditions for strong duality (zero duality gap)
  4. Interpret the dual problem as finding the best lower bound
  5. Recognize duality in ML contexts (SVM dual, GANs, variational inference)

Core Content

The Dual Function

For the primal problem $\min f(\mathbf{x})$ s.t. $g_i(\mathbf{x}) \leq 0$, $h_j(\mathbf{x}) = 0$, define the Lagrangian dual function:

$$\theta(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = \inf_{\mathbf{x}} \left[f(\mathbf{x}) + \sum_i \lambda_i g_i(\mathbf{x}) + \sum_j \nu_j h_j(\mathbf{x})\right]$$

Key property: $\theta(\boldsymbol{\lambda}, \boldsymbol{\nu})$ is concave (even when the primal is non-convex!) because it's the pointwise infimum of affine functions in $(\boldsymbol{\lambda}, \boldsymbol{\nu})$.

Weak Duality

For any $\boldsymbol{\lambda} \geq \mathbf{0}$ and any $\boldsymbol{\nu}$, and any feasible $\mathbf{x}$:

$$\theta(\boldsymbol{\lambda}, \boldsymbol{\nu}) \leq f(\mathbf{x})$$

Proof: For feasible $\mathbf{x}$: $g_i(\mathbf{x}) \leq 0$, $h_j(\mathbf{x}) = 0$, $\lambda_i \geq 0$, so: $\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) = f(\mathbf{x}) + \underbrace{\sum\lambda_i g_i(\mathbf{x})}{\leq 0} + \underbrace{\sum\nu_j h_j(\mathbf{x})}{=0} \leq f(\mathbf{x})$

Since $\theta$ is the infimum over all $\mathbf{x}$: $\theta(\boldsymbol{\lambda}, \boldsymbol{\nu}) \leq \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) \leq f(\mathbf{x})$. ∎

The Dual Problem

Weak duality says $\theta$ is always a lower bound. The dual problem seeks the best (largest) lower bound:

$$\max_{\boldsymbol{\lambda} \geq \mathbf{0},\; \boldsymbol{\nu}} \theta(\boldsymbol{\lambda}, \boldsymbol{\nu})$$

Let $p^$ = primal optimal value, $d^$ = dual optimal value. Weak duality: $d^ \leq p^$.

The difference $p^ - d^ \geq 0$ is the duality gap.

⚠️ CRITICAL — The dual is ALWAYS a concave maximization problem, regardless of the primal's convexity. The dual variables $\boldsymbol{\lambda}$ are constrained to be nonnegative, making this a tractable problem even when the primal is non-convex.

Strong Duality

Strong duality holds when $d^ = p^$ (zero duality gap). For convex problems, strong duality holds under Slater's condition: there exists a strictly feasible point ($g_i(\mathbf{x}) < 0$ for all $i$).

When strong duality holds, you can solve either the primal or the dual — they give the same optimal value. The dual is often easier (fewer variables, simpler constraints).

Duality in Machine Learning

SVMs: The SVM primal is a QP in $\mathbf{w}$ ($n$ variables). The dual is a QP in $\boldsymbol{\alpha}$ ($N$ variables — one per data point). The dual reveals that only support vectors ($\alpha_i > 0$) matter, enables the kernel trick, and is what SMO actually solves.

GANs: The minimax game $\min_G \max_D V(D,G)$ is a primal-dual formulation. The discriminator plays the role of the dual variable — tightening the lower bound on the Jensen-Shannon divergence.

Variational Inference: The ELBO (Evidence Lower BOund) $\mathcal{L}(q) = \mathbb{E}q[\ln p(\mathbf{x},\mathbf{z})] - \mathbb{E}_q[\ln q(\mathbf{z})]$ is a dual function. Maximizing the ELBO is solving the dual of $\min{q} \text{KL}(q | p(\mathbf{z}|\mathbf{x}))$.

The Dual of a Linear Program

The LP $\min \mathbf{c}^T\mathbf{x}$ s.t. $A\mathbf{x} = \mathbf{b}$, $\mathbf{x} \geq \mathbf{0}$ has dual:

$$\max \mathbf{b}^T\mathbf{y} \quad \text{s.t.} \quad A^T\mathbf{y} \leq \mathbf{c}$$

Strong duality holds for LPs (always, no CQ needed). The dual LP has one variable per primal constraint and one constraint per primal variable — a perfect symmetry.



Key Terms

Worked Examples

Example 1: Dual of a Simple QP

Find the dual of $\min \frac{1}{2}x^2$ s.t. $x \geq a$.

Solution: $f(x) = \frac{1}{2}x^2$, $g(x) = a - x \leq 0$. $\mathcal{L}(x, \lambda) = \frac{1}{2}x^2 + \lambda(a - x)$, $\lambda \geq 0$.

For fixed $\lambda$, minimize over $x$: $\frac{\partial\mathcal{L}}{\partial x} = x - \lambda = 0 \implies x = \lambda$.

$\theta(\lambda) = \frac{1}{2}\lambda^2 + \lambda(a - \lambda) = \lambda a - \frac{1}{2}\lambda^2$

Dual: $\max_{\lambda \geq 0} \lambda a - \frac{1}{2}\lambda^2$. The optimum: $a - \lambda = 0 \implies \lambda = a$ (if $a \geq 0$).

If $a = 1$: $\lambda^ = 1$, $d^ = 1 - 0.5 = 0.5$. Primal: $x^ = 1$, $p^ = 0.5$. Strong duality holds ✓.

If $a = -1$: $\lambda^ = 0$ (constrained by $\lambda \geq 0$), $d^ = 0$. Primal: unconstrained minimum at $x=0$, $p^* = 0$. Strong duality holds ✓.

Click for answer Dual: $\max_{\lambda \geq 0} [\lambda a - \frac{1}{2}\lambda^2]$. Strong duality holds (Slater satisfied for any finite $a$).

Example 2: LP Duality

Primal: $\min 2x_1 + 3x_2$ s.t. $x_1 + x_2 \geq 5$, $x_1 \geq 0$, $x_2 \geq 0$.

Solution: Convert to standard form: $\min 2x_1 + 3x_2$ s.t. $-x_1 - x_2 \leq -5$, $-x_1 \leq 0$, $-x_2 \leq 0$. $\mathcal{L} = 2x_1 + 3x_2 + \lambda_1(-x_1-x_2+5) + \lambda_2(-x_1) + \lambda_3(-x_2)$. $= (2 - \lambda_1 - \lambda_2)x_1 + (3 - \lambda_1 - \lambda_3)x_2 + 5\lambda_1$

For the infimum over $x_1, x_2 \geq 0$: if $2-\lambda_1-\lambda_2 < 0$, infimum is $-\infty$ (drive $x_1 \to \infty$). So we need $2-\lambda_1-\lambda_2 \geq 0$ and $3-\lambda_1-\lambda_3 \geq 0$.

$\theta(\boldsymbol{\lambda}) = 5\lambda_1$ (when constraints hold), $-\infty$ otherwise.

Dual: $\max 5\lambda_1$ s.t. $\lambda_1 + \lambda_2 \leq 2$, $\lambda_1 + \lambda_3 \leq 3$, $\lambda_i \geq 0$.

Optimal: $\lambda_1 = 2$ (binding first constraint), $\lambda_2 = 0$, $\lambda_3 = 1$. $d^* = 10$.

Primal optimum: $x_1 = 5$, $x_2 = 0$, $p^* = 10$. Strong duality ✓.

Click for answer Dual: $\max_{\boldsymbol{\lambda} \geq 0} 5\lambda_1$ s.t. $\lambda_1+\lambda_2 \leq 2$, $\lambda_1+\lambda_3 \leq 3$. $d^* = p^* = 10$.

Example 3: Duality Gap with Non-Convex Problem

$\min x^3 - 3x$ s.t. $x \in [-2, 2]$.

Solution: This is non-convex ($x^3$ is not convex). The Lagrangian with constraints $-x-2 \leq 0$, $x-2 \leq 0$: $\mathcal{L} = x^3 - 3x + \lambda_1(-x-2) + \lambda_2(x-2)$.

Dual function: $\theta(\lambda_1, \lambda_2) = \inf_{x} [x^3 - 3x + (\lambda_2 - \lambda_1)x - 2(\lambda_1 + \lambda_2)]$.

For each fixed $(\lambda_1, \lambda_2)$, as $x \to -\infty$, $x^3 \to -\infty$ (dominates linear terms) → $\theta = -\infty$ for ALL $(\lambda_1, \lambda_2)$! The dual is $-\infty$ and provides a trivial bound.

Even though $p^*$ is finite (the function has a minimum on $[-2,2]$), the dual is useless. This is the cost of non-convexity — duality can collapse entirely.

Click for answer The dual function is identically $-\infty$ because the cubic term dominates as $x \to -\infty$. Duality provides no useful bound for this non-convex problem.


Quiz

Q1: What does the concept of Lagrangian dual function primarily refer to in this subject?

A) A historical anecdote about Lagrangian dual function B) A visual representation of Lagrangian dual function C) The definition and application of Lagrangian dual function D) A computational error related to Lagrangian dual function

Correct: C)

Q2: Which of the following is the key formula discussed in this subject?

A) A simplified version of \min f(\mathbf{x})... B) \min f(\mathbf{x}) C) The inverse operation of the formula in question D) An unrelated formula from a different topic

Correct: B)

Q3: What is the primary purpose of Strong duality?

A) It is used to strong duality in mathematical analysis B) It is primarily a historical notation system C) It replaces all other methods in this domain D) It is used only in advanced research contexts

Correct: A)

Q4: Which statement about The Dual Function is TRUE?

A) The Dual Function is an advanced topic beyond this subject's scope B) The Dual Function is not related to this subject C) The Dual Function is a fundamental concept covered in this subject D) The Dual Function is mentioned only as a historical footnote

Correct: C)

Q5: Based on the worked examples in this subject, what is the correct result?

A) $1 - y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0$. B) The inverse of the correct answer C) A different result from a common mistake D) An unrelated numerical value

Correct: A)

Q6: How are The Dual Function and Weak Duality related?

A) The Dual Function and Weak Duality are closely related concepts B) The Dual Function is a special case of Weak Duality C) The Dual Function and Weak Duality are completely unrelated topics D) The Dual Function is the inverse of Weak Duality

Correct: A)

Q7: What is a common pitfall when working with The Dual Problem?

A) A common mistake is confusing The Dual Problem with a similar concept B) The main error with The Dual Problem is using it when it is not needed C) The Dual Problem is always computed the same way in all contexts D) The Dual Problem has no common misconceptions

Correct: A)

Q8: When should you apply Duality In Machine Learning?

A) Apply Duality In Machine Learning to solve problems in this subject's domain B) Duality In Machine Learning is not practically useful C) Avoid Duality In Machine Learning unless explicitly instructed D) Use Duality In Machine Learning only in pure mathematics contexts

Correct: A)

Practice Problems

  1. Derive the dual of $\min \frac{1}{2}|\mathbf{w}|^2$ s.t. $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ (SVM primal).

    Click for answer Convert to $\leq$ form: $1 - y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0$. $\mathcal{L} = \frac{1}{2}\|\mathbf{w}\|^2 + \sum_i \alpha_i(1 - y_i(\mathbf{w}^T\mathbf{x}_i + b))$, $\alpha_i \geq 0$. Stationarity: $\partial\mathcal{L}/\partial\mathbf{w} = \mathbf{w} - \sum\alpha_i y_i\mathbf{x}_i = 0 \implies \mathbf{w} = \sum\alpha_i y_i\mathbf{x}_i$. $\partial\mathcal{L}/\partial b = -\sum\alpha_i y_i = 0$. Substitute back: $\theta(\boldsymbol{\alpha}) = \sum\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$. Dual: $\max_{\boldsymbol{\alpha} \geq 0} \sum\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$ s.t. $\sum\alpha_i y_i = 0$. This is the SVM dual — a QP in $N$ variables with box constraints.

  2. Show that the dual function $\theta(\boldsymbol{\lambda}, \boldsymbol{\nu})$ is always concave, even for non-convex primals.

    Click for answer $\theta(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \inf_{\mathbf{x}} [f(\mathbf{x}) + \sum\lambda_i g_i(\mathbf{x}) + \sum\nu_j h_j(\mathbf{x})]$. For each fixed $\mathbf{x}$, the function inside the infimum is affine in $(\boldsymbol{\lambda}, \boldsymbol{\nu})$. The pointwise infimum of affine functions is concave (epigraph is intersection of halfspaces). This holds regardless of $f$, $g_i$, $h_j$ — the dual function is always concave.

  3. Primal: $\min e^x$ s.t. $x \geq 0$. Compute the dual and verify strong duality.

    Click for answer $g(x) = -x \leq 0$. $\mathcal{L} = e^x - \lambda x$, $\lambda \geq 0$. $\partial\mathcal{L}/\partial x = e^x - \lambda = 0 \implies x = \ln\lambda$ (if $\lambda > 0$). If $\lambda = 0$: the infimum in the dual function is over ALL $x$, unconstrained. So $\theta(0) = \inf_x e^x = 0$ (limit, not attained). For $\lambda > 0$: $\theta(\lambda) = e^{\ln\lambda} - \lambda\ln\lambda = \lambda - \lambda\ln\lambda = \lambda(1 - \ln\lambda)$. Maximize over $\lambda \geq 0$: $d\theta/d\lambda = 1 - \ln\lambda - 1 = -\ln\lambda = 0 \implies \lambda = 1$. $d^* = 1(1-0) = 1$. Primal: $x^* = 0$, $p^* = 1$. Strong duality ✓.

  4. Explain why the dual of a non-convex problem might still be useful.

    Click for answer Even with a duality gap ($d^* < p^*$), the dual provides: (1) a lower bound on the optimal value (certificate of quality — if you find a primal feasible point with value close to $d^*$, you know you're near optimal); (2) a convex relaxation — the dual is always a convex problem, solvable with standard tools; (3) for some non-convex problems (e.g., certain QCQPs, trust-region subproblems), strong duality holds despite non-convexity. In ML, the dual perspective often reveals structure the primal hides.

  5. For the LP dual pair, show that if the primal is unbounded below, the dual must be infeasible.

    Click for answer By weak duality: for any dual feasible $(\boldsymbol{\lambda}, \boldsymbol{\nu})$, $\theta(\boldsymbol{\lambda}, \boldsymbol{\nu}) \leq p^*$. If $p^* = -\infty$ (primal unbounded), then $\theta(\boldsymbol{\lambda}, \boldsymbol{\nu}) \leq -\infty$ for all dual feasible points — impossible unless there are NO dual feasible points. So the dual must be infeasible. Similarly, if the dual is unbounded above, the primal must be infeasible. This is the duality relationship for LPs.


Summary

Key takeaways:


Pitfalls

  1. Assuming strong duality always holds. Strong duality ($d^ = p^$) is guaranteed for convex problems under Slater's condition, but non-convex problems can have arbitrarily large duality gaps. The dual of $\min x^3 - 3x$ on $[-2,2]$ is identically $-\infty$, providing zero useful information.

  2. Forgetting the dual is always concave. The dual function $\theta(\boldsymbol{\lambda}, \boldsymbol{\nu}) = \inf_{\mathbf{x}} \mathcal{L}$ is a pointwise infimum of affine functions and therefore concave — even when the primal is non-convex. This means the dual problem is always a tractable concave maximization.

  3. Building the dual without checking whether $\theta$ is finite. If the Lagrangian is unbounded below (common with non-convex primals), $\theta(\boldsymbol{\lambda}, \boldsymbol{\nu}) = -\infty$ and the dual is useless. The dual only provides a nontrivial bound when the Lagrangian infimum is finite.

  4. Confusing the direction of weak duality. The dual provides a lower bound: $d^ \leq p^$. A common error is writing $d^ \geq p^$ or thinking the dual is an upper bound. The dual maximizes the lower bound; the primal minimizes the objective.

  5. Thinking a zero duality gap means the problem is convex. Some non-convex problems (e.g., trust-region subproblems, certain QCQPs) exhibit strong duality despite non-convexity. A zero gap is a property of the specific problem instance, not a guarantee of convexity.



Next Steps

Next up: 14-10-stochastic-nonconvex-optimization.md — stochastic approximation, non-convex convergence theory, and why deep learning works despite non-convexity.