21-05 — Exponential Family
Phase: 21 — Probability & Statistics for ML (Advanced) Subject: 21-05 Prerequisites: 21-01 (Bayesian Inference), 10-05 (Continuous Distributions), 12-01 (Maximum Likelihood Estimation), 12-03 (Sufficient Statistics), 14-06 (Convex Functions) Next subject: 21-06 — Causality and Causal Inference
Learning Objectives
By the end of this subject, you will be able to:
- Write any exponential family distribution in canonical form p(x|η) = h(x)·exp(η^T T(x) − A(η)) and identify its components
- Prove that the log-partition function A(η) is the cumulant-generating function — its derivatives give moments of the sufficient statistics
- Derive the MLE for an exponential family and prove it matches the moment-matching condition: E[T(x)] = (1/n)Σ T(x_i)
- Show that all conjugate priors for exponential family likelihoods are themselves exponential family distributions
- Explain why exponential families are central to ML — they are the ONLY distributions with finite-dimensional sufficient statistics (Pitman-Koopman-Darmois theorem)
Core Content
1. The Canonical Form
The exponential family unifies most distributions used in ML into a single mathematical framework:
$p(x | η) = h(x) · exp(η^T T(x) − A(η)) $
where: - η = natural (canonical) parameters - T(x) = sufficient statistics - h(x) = base measure (depends only on x) - A(η) = log-partition function (normalizer): A(η) = log ∫ h(x) exp(η^T T(x)) dx
⚠️ THIS IS CRITICAL — The exponential family is the mathematical foundation of generalized linear models, maximum entropy modeling, and conjugate Bayesian inference. Understanding its structure unifies your understanding of distributions across all of ML.
2. Examples in Canonical Form
Bernoulli (coin flip):
$p(x|π) = π^x (1−π)^{1−x}
= exp(x log(π/(1−π)) + log(1−π))
$
So: η = log(π/(1−π)) (log-odds), T(x) = x, h(x) = 1, A(η) = −log(1−π) = log(1+e^η).
Gaussian (known variance σ²):
$p(x|μ) = (1/√(2πσ²)) exp(−(x−μ)²/(2σ²))
= (1/√(2πσ²)) exp(−x²/(2σ²)) · exp(μx/σ² − μ²/(2σ²))
$
So: η = μ/σ², T(x) = x, h(x) = exp(−x²/(2σ²))/√(2πσ²), A(η) = μ²/(2σ²) = σ²η²/2.
Poisson:
$p(x|λ) = λ^x e^{−λ} / x!
= (1/x!) · exp(x log λ − λ)
$
η = log λ, T(x) = x, h(x) = 1/x!, A(η) = λ = e^η.
Categorical (K classes): Uses K−1 natural parameters (one class as reference). With one-hot encoding T(x) = [I(x=1), ..., I(x=K−1)]:
η_k = log(π_k/π_K) for k=1,...,K−1. A(η) = −log π_K where π_K = 1/(1+Σ e^{η_j}).
3. The Log-Partition Function and Moments
A(η) is a remarkable function — its derivatives give ALL moments of T(x):
$∇_η A(η) = E[T(x)] [first derivative = mean] ∇²_η A(η) = Cov(T(x)) [second derivative = covariance] $
Proof (for first derivative):
$∇_η A(η) = ∇_η log ∫ h(x)exp(η^T T(x)) dx
= [∫ h(x) T(x) exp(η^T T(x)) dx] / [∫ h(x) exp(η^T T(x)) dx]
= ∫ T(x) · h(x)exp(η^T T(x)−A(η)) dx
= ∫ T(x) p(x|η) dx
= E[T(x)] ✓
$
Higher cumulants: The k-th cumulant of T(x) is the k-th derivative of A(η). This is why A is called the cumulant-generating function.
Example: For Bernoulli, A(η) = log(1+e^η). Then A'(η) = e^η/(1+e^η) = π = E[x] ✓. A''(η) = π(1−π) = Var(x) ✓.
4. Convexity of A(η)
A(η) is STRICTLY CONVEX (its Hessian is Cov(T(x)), which is positive definite for minimal representations). This means:
- The mapping η ↔ E[T(x)] is one-to-one (∇A is invertible)
- The log-likelihood log p(x|η) = log h(x) + η^T T(x) − A(η) is CONCAVE in η (since −A is concave)
- MLE has a unique solution (given by moment matching)
The convexity of A(η) makes exponential family MLE globally well-behaved — no local maxima, no pathological loss surfaces.
5. Maximum Likelihood Estimation
For i.i.d. data x₁, ..., x_n:
$log p(x₁,...,x_n | η) = Σ log h(x_i) + η^T Σ T(x_i) − n·A(η) $
Gradient condition:
$∇_η log p = Σ T(x_i) − n·E_η[T(x)] = 0 $
So the MLE satisfies:
$E_η̂[T(x)] = (1/n) Σ_{i=1}^n T(x_i)
$
Moment matching: The MLE sets the expected sufficient statistics equal to their empirical averages. For the Bernoulli: E[x] = π̂ = sample mean. For the Gaussian: E[x] = μ̂ = sample mean.
This equation has a unique solution because A is strictly convex ⇒ ∇A is invertible ⇒ η̂ = (∇A)^{−1}((1/n)Σ T(x_i)).
6. Conjugate Priors
If the likelihood is exponential family, the conjugate prior is ALSO exponential family:
$p(η | χ, ν) ∝ exp(χ^T η − ν·A(η)) $
Prior parameters: - χ = prior pseudo-observations of sufficient statistics - ν = prior pseudo-sample-size
Posterior update (conjugacy in action):
$p(η | x₁,...,x_n) ∝ p(x|η) · p(η|χ,ν)
∝ exp(η^T Σ T(x_i) − n·A(η)) · exp(χ^T η − ν·A(η))
∝ exp((χ + Σ T(x_i))^T η − (ν+n)·A(η))
$
Posterior has the same form with:
$χ_post = χ + Σ T(x_i) ν_post = ν + n $
Interpretation: The prior acts like having seen ν prior observations with sufficient statistics χ/ν. The posterior just adds the real data's sufficient statistics.
7. Maximum Entropy Principle
Among all distributions with a given set of expected sufficient statistics E[T(x)] = μ, the exponential family distribution maximizes entropy:
$max_p −∫ p(x) log p(x) dx subject to E_p[T(x)] = μ, ∫ p(x)dx = 1 $
Solution: p(x) ∝ exp(η^T T(x)) — exactly the exponential family. The Lagrange multipliers η are chosen to satisfy the moment constraints.
Why this matters: If all you know about a distribution is its mean and variance, the maximum-entropy distribution is Gaussian. If all you know is the mean of a non-negative variable, it's Exponential. The exponential family is the LEAST COMMITTAL distribution consistent with the specified constraints — it assumes nothing beyond the given moments.
8. Generalized Linear Models (GLMs)
GLMs model a response y given predictors x through three components:
- Random component: y | x ~ ExponentialFamily(η)
- Systematic component: η = θ^T x (linear predictor)
- Link function: g(E[y|x]) = η (connects mean to linear predictor)
Canonical link: g = (∇A)^{−1} (maps natural parameter to mean). With canonical links:
$∇_θ log p(y|x,θ) = (y − E[y]) · x $
The gradient is prediction error × features — the same simple form across ALL exponential family GLMs. This is why logistic regression (Bernoulli + logit link), Poisson regression (Poisson + log link), and linear regression (Gaussian + identity link) share the same IRLS (Iteratively Reweighted Least Squares) optimization algorithm.
Worked Examples
Example 1: Writing Gamma in Canonical Form
Problem: The Gamma(α, β) density is p(x|α,β) = (β^α/Γ(α)) x^{α−1} e^{−βx} for x > 0. Write this in exponential family canonical form with natural parameters.
Solution:
$p(x|α,β) = exp(α log β − log Γ(α) + (α−1)log x − βx)
= exp([α, −β]^T [log x, x] − (−α log β + log Γ(α)))
$
Natural parameters: η = [α−1, −β]^T Sufficient statistics: T(x) = [log x, x]^T Base measure: h(x) = 1 (for x > 0) Log-partition: A(η) = log Γ(η₁+1) − (η₁+1)log(−η₂)
This shows the Gamma is a two-parameter exponential family with sufficient statistics (log x, x).
Example 2: MLE via Moment Matching
Problem: For n i.i.d. Poisson(λ) observations, use the exponential family MLE condition to find λ̂.
Solution:
Poisson in exponential form: η = log λ, T(x) = x, A(η) = e^η.
Moment matching condition:
$E_η̂[T(x)] = (1/n) Σ T(x_i) λ̂ = (1/n) Σ x_i = x̄ $
So λ̂ = x̄. More formally: η̂ = log λ̂ satisfies A'(η̂) = e^{η̂} = λ̂ = x̄. The moment matching equation directly gives the MLE.
For 5 observations [2, 5, 3, 7, 4]: λ̂ = (2+5+3+7+4)/5 = 21/5 = 4.2.
Example 3: Conjugate Prior for Poisson
Problem: Derive the conjugate prior for the Poisson likelihood in terms of the prior pseudo-count χ and pseudo-sample-size ν. Express the posterior after observing n=10 with Σx_i = 42, given prior χ=5, ν=2.
Solution:
Poisson: p(x|λ) = (1/x!) exp(x log λ − λ). η = log λ, T(x) = x, A(η) = e^η = λ.
Conjugate prior for η: p(η|χ,ν) ∝ exp(χ·η − ν·e^η) In terms of λ: p(λ|χ,ν) ∝ λ^χ e^{−νλ}
This is Gamma(χ+1, ν). Prior mean: (χ+1)/ν.
Given prior χ=5, ν=2: prior ~ Gamma(6, 2), mean = 3.
Posterior after seeing Σx_i = 42, n = 10:
$χ_post = 5 + 42 = 47 ν_post = 2 + 10 = 12 $
Posterior ~ Gamma(48, 12), mean = 48/12 = 4.0. The data (mean 4.2) pulled the estimate up from the prior mean of 3.
Quiz
Q1: What does the concept of The exponential family primarily refer to in this subject?
A) A visual representation of The exponential family B) A historical anecdote about The exponential family C) The definition and application of The exponential family D) A computational error related to The exponential family
Correct: C)
- If you chose A: This is incorrect. The exponential family is defined as: the definition and application of the exponential family. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. The exponential family is defined as: the definition and application of the exponential family. The other options describe different aspects that are not the primary focus.
- If you chose C: The exponential family is defined as: the definition and application of the exponential family. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. The exponential family is defined as: the definition and application of the exponential family. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Conjugate priors?
A) It is used only in advanced research contexts B) It is used to conjugate priors in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain
Correct: B)
- If you chose A: This is incorrect. Conjugate priors serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: Conjugate priors serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose C: This is incorrect. Conjugate priors serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Conjugate priors serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Exponential family is TRUE?
A) Exponential family is a fundamental concept covered in this subject B) Exponential family is not related to this subject C) Exponential family is an advanced topic beyond this subject's scope D) Exponential family is mentioned only as a historical footnote
Correct: A)
- If you chose A: Exponential family is a fundamental concept covered in this subject. This subject covers Exponential family as part of its core content. Correct!
- If you chose B: This is incorrect. Exponential family is a fundamental concept covered in this subject. This subject covers Exponential family as part of its core content.
- If you chose C: This is incorrect. Exponential family is a fundamental concept covered in this subject. This subject covers Exponential family as part of its core content.
- If you chose D: This is incorrect. Exponential family is a fundamental concept covered in this subject. This subject covers Exponential family as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) An unrelated numerical value B) β−1 C) The inverse of the correct answer D) A different result from a common mistake
Correct: B)
- If you chose A: This is incorrect. The worked examples show that the result is β−1. The other options represent common errors.
- If you chose B: The worked examples show that the result is β−1. The other options represent common errors. Correct!
- If you chose C: This is incorrect. The worked examples show that the result is β−1. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is β−1. The other options represent common errors.
Q5: How are Exponential family and Natural parameter related?
A) Exponential family and Natural parameter are closely related concepts B) Exponential family is a special case of Natural parameter C) Exponential family and Natural parameter are completely unrelated topics D) Exponential family is the inverse of Natural parameter
Correct: A)
- If you chose A: Both Exponential family and Natural parameter are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Exponential family and Natural parameter are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Exponential family and Natural parameter are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Exponential family and Natural parameter are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Sufficient statistic?
A) A common mistake is confusing Sufficient statistic with a similar concept B) The main error with Sufficient statistic is using it when it is not needed C) Sufficient statistic is always computed the same way in all contexts D) Sufficient statistic has no common misconceptions
Correct: A)
- If you chose A: Students often confuse Sufficient statistic with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose B: This is incorrect. Students often confuse Sufficient statistic with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Sufficient statistic with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Sufficient statistic with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Log-partition function?
A) Avoid Log-partition function unless explicitly instructed B) Use Log-partition function only in pure mathematics contexts C) Apply Log-partition function to solve problems in this subject's domain D) Log-partition function is not practically useful
Correct: C)
- If you chose A: This is incorrect. Log-partition function is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: This is incorrect. Log-partition function is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: Log-partition function is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose D: This is incorrect. Log-partition function is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
Write the Beta(α, β) distribution in exponential family canonical form. What are the sufficient statistics?
Answer
$p(x|α,β) = [Γ(α+β)/(Γ(α)Γ(β))] x^{α−1} (1−x)^{β−1}
= exp((α−1)log x + (β−1)log(1−x) − [log Γ(α)+log Γ(β)−log Γ(α+β)])
$
Natural parameters: η₁ = α−1, η₂ = β−1
Sufficient statistics: T(x) = [log x, log(1−x)]^T
Base measure: h(x) = 1
A(η) = log Γ(η₁+1) + log Γ(η₂+1) − log Γ(η₁+η₂+2)
The sufficient statistics for the Beta are (log x, log(1−x)) — which is why Beta updates in Bayesian inference use log-likelihoods, not raw counts.
Problem 2
Prove that the Hessian of A(η) equals Cov(T(x)). Use this to prove A is convex.
Answer
First derivative:$∂A/∂η_j = E[T_j(x)] $Second derivative:
$∂²A/∂η_j∂η_k = ∂/∂η_k ∫ T_j(x) h(x) exp(η^T T(x) − A(η)) dx
= ∫ T_j(x) h(x) exp(η^T T(x)) · (T_k(x) − ∂A/∂η_k) · exp(−A(η)) dx
− ∫ T_j(x) p(x|η) dx · ∂A/∂η_k
$
Wait — let me be more careful. Using the product rule:
$= ∫ T_j(x) T_k(x) p(x|η) dx − (∂A/∂η_k) ∫ T_j(x) p(x|η) dx − (∂A/∂η_j) ∫ T_k(x) p(x|η) dx + A(η) terms... $Actually, the cleanest proof uses the identity ∂p/∂η_j = (T_j − E[T_j])p:
$∂²A/∂η_j∂η_k = ∂/∂η_k E[T_j]
= ∂/∂η_k ∫ T_j(x) exp(η^T T(x) − A(η)) h(x) dx
$
Using ∂(exp(η^T T − A))/∂η_k = (T_k − ∂A/∂η_k)exp(η^T T − A):
$= ∫ T_j(x)(T_k(x) − E[T_k]) p(x|η) dx = E[T_j T_k] − E[T_j]E[T_k] = Cov(T_j, T_k) $Since Cov(T) is positive semidefinite (and strictly positive definite for minimal representations), ∇²A is positive definite ⇒ A is strictly convex. ✓
Problem 3
Show that the product of two exponential family distributions is not generally in the exponential family, but the PRODUCT of the density with a conjugate prior IS.
Answer
Exponential family likelihood: p(x|η) = h(x) exp(η^T T(x) − A(η)) Conjugate prior: p(η|χ,ν) = g(χ,ν) exp(χ^T η − ν A(η)) Product (posterior kernel):$p(η|x,χ,ν) ∝ exp(η^T T(x) − A(η)) · exp(χ^T η − ν A(η))
= exp((χ+T(x))^T η − (ν+1)A(η))
$
This has the SAME functional form in η — same sufficient statistics in η-space, same log-partition A(η). ✓
However, product of two exponential family distributions in x (different η₁, η₂):
$p₁(x|η₁) p₂(x|η₂) ∝ h₁(x)h₂(x) exp((η₁^T T₁(x) + η₂^T T₂(x)) − (A₁(η₁)+A₂(η₂))) $This is NOT generally exponential family because the combined "sufficient statistic" would need to capture the interaction of η₁^T T₁(x) + η₂^T T₂(x), which requires the union of T₁ and T₂. Unless T₁ and T₂ span the same space, the product isn't in the same family.
Problem 4
Using the maximum entropy principle, derive the Gaussian distribution as the maximum-entropy distribution with fixed mean μ and variance σ².
Answer
Constraints: 1. ∫ p(x) dx = 1 2. ∫ x p(x) dx = μ 3. ∫ (x−μ)² p(x) dx = σ² Lagrangian:$L = −∫ p(x)log p(x)dx + λ₀(∫ p−1) + λ₁(∫ xp−μ) + λ₂(∫ x²p−(μ²+σ²)) $Functional derivative w.r.t. p(x):
$δL/δp = −log p(x) − 1 + λ₀ + λ₁x + λ₂x² = 0 $So: p(x) ∝ exp(λ₁x + λ₂x²) = exp(−(x−μ)²/(2σ²)) after solving for λ₁, λ₂ from the constraints. The Gaussian is the distribution that assumes NOTHING beyond mean and variance — it maximizes entropy subject to these constraints. Any other assumption would reduce entropy, implying knowledge we don't have.
Problem 5
Explain why logistic regression is a GLM and identify its three components (random, systematic, link).
Answer
1. **Random component:** y | x ~ Bernoulli(π(x)) Bernoulli is exponential family with η = log(π/(1−π)), T(y)=y, A(η)=log(1+e^η) 2. **Systematic component:** η = β^T x (linear in parameters) The log-odds are a linear function of predictors. 3. **Link function:** g(π) = log(π/(1−π)) = η The logit link maps the mean π to the natural parameter η. This is the CANONICAL link because g = (∇A)^{−1}: ∇A(η) = e^η/(1+e^η) = π, so (∇A)^{−1}(π) = log(π/(1−π)) = g(π). With the canonical link, the gradient simplifies to:$∇_β log p(y|x,β) = (y − π(x)) · x $The update is proportional to (actual − predicted) × features — the same intuitive form as linear regression. This is the hallmark of GLMs with canonical links.
Summary
- The exponential family p(x|η) = h(x)exp(η^T T(x) − A(η)) unifies Bernoulli, Gaussian, Poisson, Gamma, Beta, Dirichlet, Categorical, and more under one framework
- The log-partition function A(η) is the cumulant-generating function — its gradient gives E[T(x)] and its Hessian gives Cov(T(x))
- MLE reduces to moment matching: E_η̂[T(x)] = (1/n)Σ T(x_i) — the expected sufficient statistics match their empirical averages
- Conjugate priors are also exponential family — Bayesian updating is simple addition in natural parameter space
- GLMs combine exponential family likelihoods with linear predictors — logistic, Poisson, and linear regression share the same IRLS solver
Pitfalls
- Assuming every distribution is in the exponential family. Student's t, Cauchy, uniform, mixture distributions, and many others are NOT exponential family. These distributions lack finite-dimensional sufficient statistics and don't share the nice properties (convex log-likelihood, moment-matching MLE, conjugate priors in the standard form). Always verify membership before applying exponential family properties — the Pitman-Koopman-Darmois theorem characterizes exactly which distributions qualify.
- Forgetting that A(η) must be finite. The log-partition function A(η) = log ∫ h(x)exp(η^T T(x))dx must be finite for the density to be normalizable. The set of η where A(η) < ∞ is called the natural parameter space. For the Gamma distribution, η₂ (related to −β) must be negative. Trying to evaluate or optimize at η values outside the natural parameter space leads to undefined densities and numerical errors.
- Confusing the mean parameterization with the natural parameterization. The mean parameter μ = E[T(x)] and the natural parameter η are related by μ = ∇A(η), but they parameterize the distribution differently. The likelihood is concave in η (natural parameters) but NOT necessarily concave in the mean parameterization. MLE is globally well-behaved in η-space but can have local maxima in μ-space. Always optimize in η-space when possible.
- Assuming conjugacy means any prior works with any likelihood. Conjugacy is a specific algebraic relationship: the prior must have the form p(η) ∝ exp(χ^T η − ν A(η)) to be conjugate to the exponential family likelihood p(x|η). A Gamma prior is conjugate to a Poisson likelihood (both exponential family) but NOT to a Beta likelihood (different sufficient statistics). Check that the prior's sufficient statistics in η-space match the likelihood's log-partition function.
- Applying the maximum entropy principle without verifying constraints. The maximum-entropy distribution given E[T(x)] = μ is exponential family — but ONLY if those constraints are correct. If you constrain the wrong moments (e.g., mean and variance when the data is heavy-tailed), the resulting Gaussian will be a poor model. The max-ent principle gives the LEAST committal distribution given the constraints — it doesn't guarantee the constraints are the right ones. Validate your choice of constraints against the data.
Key Terms
| Term | Definition |
|---|---|
| Exponential family | p(x |
| Natural parameter | η — the canonical parameterization; MLE and Bayesian updating are simplest in this form |
| Sufficient statistic | T(x) — captures ALL information about η in the data; (1/n)Σ T(x_i) is a sufficient statistic for η |
| Log-partition function | A(η) = log ∫ h(x)exp(η^T T(x))dx — cumulant-generating function; its derivatives give moments of T(x) |
| Convexity | A(η) is strictly convex, so MLE is unique and log-likelihood is concave |
| Moment matching | MLE sets E_η̂[T(x)] = (1/n)Σ T(x_i) — a single equation that characterizes the solution |
| Conjugate prior | p(η |
| Maximum entropy | Among all distributions matching specified moments, the exponential family maximizes entropy |
| GLM | Generalized Linear Model — y |
Next Steps
Continue to 21-06 — Causality and Causal Inference to learn the mathematical framework for distinguishing correlation from causation — do-calculus, structural causal models, and counterfactual reasoning.