Math graphic
📐 Concept diagram

21-05 — Exponential Family

Phase: 21 — Probability & Statistics for ML (Advanced) Subject: 21-05 Prerequisites: 21-01 (Bayesian Inference), 10-05 (Continuous Distributions), 12-01 (Maximum Likelihood Estimation), 12-03 (Sufficient Statistics), 14-06 (Convex Functions) Next subject: 21-06 — Causality and Causal Inference


Learning Objectives

By the end of this subject, you will be able to:

  1. Write any exponential family distribution in canonical form p(x|η) = h(x)·exp(η^T T(x) − A(η)) and identify its components
  2. Prove that the log-partition function A(η) is the cumulant-generating function — its derivatives give moments of the sufficient statistics
  3. Derive the MLE for an exponential family and prove it matches the moment-matching condition: E[T(x)] = (1/n)Σ T(x_i)
  4. Show that all conjugate priors for exponential family likelihoods are themselves exponential family distributions
  5. Explain why exponential families are central to ML — they are the ONLY distributions with finite-dimensional sufficient statistics (Pitman-Koopman-Darmois theorem)

Core Content

1. The Canonical Form

The exponential family unifies most distributions used in ML into a single mathematical framework:

$p(x | η) = h(x) · exp(η^T T(x) − A(η))
$

where: - η = natural (canonical) parameters - T(x) = sufficient statistics - h(x) = base measure (depends only on x) - A(η) = log-partition function (normalizer): A(η) = log ∫ h(x) exp(η^T T(x)) dx

⚠️ THIS IS CRITICAL — The exponential family is the mathematical foundation of generalized linear models, maximum entropy modeling, and conjugate Bayesian inference. Understanding its structure unifies your understanding of distributions across all of ML.


2. Examples in Canonical Form

Bernoulli (coin flip):

$p(x|π) = π^x (1−π)^{1−x}
       = exp(x log(π/(1−π)) + log(1−π))
$

So: η = log(π/(1−π)) (log-odds), T(x) = x, h(x) = 1, A(η) = −log(1−π) = log(1+e^η).

Gaussian (known variance σ²):

$p(x|μ) = (1/√(2πσ²)) exp(−(x−μ)²/(2σ²))
       = (1/√(2πσ²)) exp(−x²/(2σ²)) · exp(μx/σ² − μ²/(2σ²))
$

So: η = μ/σ², T(x) = x, h(x) = exp(−x²/(2σ²))/√(2πσ²), A(η) = μ²/(2σ²) = σ²η²/2.

Poisson:

$p(x|λ) = λ^x e^{−λ} / x!
       = (1/x!) · exp(x log λ − λ)
$

η = log λ, T(x) = x, h(x) = 1/x!, A(η) = λ = e^η.

Categorical (K classes): Uses K−1 natural parameters (one class as reference). With one-hot encoding T(x) = [I(x=1), ..., I(x=K−1)]:

η_k = log(π_k/π_K) for k=1,...,K−1. A(η) = −log π_K where π_K = 1/(1+Σ e^{η_j}).


3. The Log-Partition Function and Moments

A(η) is a remarkable function — its derivatives give ALL moments of T(x):

$∇_η A(η) = E[T(x)]                    [first derivative = mean]
∇²_η A(η) = Cov(T(x))                [second derivative = covariance]
$

Proof (for first derivative):

$∇_η A(η) = ∇_η log ∫ h(x)exp(η^T T(x)) dx
         = [∫ h(x) T(x) exp(η^T T(x)) dx] / [∫ h(x) exp(η^T T(x)) dx]
         = ∫ T(x) · h(x)exp(η^T T(x)−A(η)) dx
         = ∫ T(x) p(x|η) dx
         = E[T(x)]  ✓
$

Higher cumulants: The k-th cumulant of T(x) is the k-th derivative of A(η). This is why A is called the cumulant-generating function.

Example: For Bernoulli, A(η) = log(1+e^η). Then A'(η) = e^η/(1+e^η) = π = E[x] ✓. A''(η) = π(1−π) = Var(x) ✓.


4. Convexity of A(η)

A(η) is STRICTLY CONVEX (its Hessian is Cov(T(x)), which is positive definite for minimal representations). This means:

  1. The mapping η ↔ E[T(x)] is one-to-one (∇A is invertible)
  2. The log-likelihood log p(x|η) = log h(x) + η^T T(x) − A(η) is CONCAVE in η (since −A is concave)
  3. MLE has a unique solution (given by moment matching)

The convexity of A(η) makes exponential family MLE globally well-behaved — no local maxima, no pathological loss surfaces.


5. Maximum Likelihood Estimation

For i.i.d. data x₁, ..., x_n:

$log p(x₁,...,x_n | η) = Σ log h(x_i) + η^T Σ T(x_i) − n·A(η)
$

Gradient condition:

$∇_η log p = Σ T(x_i) − n·E_η[T(x)] = 0
$

So the MLE satisfies:

$E_η̂[T(x)] = (1/n) Σ_{i=1}^n T(x_i)
$

Moment matching: The MLE sets the expected sufficient statistics equal to their empirical averages. For the Bernoulli: E[x] = π̂ = sample mean. For the Gaussian: E[x] = μ̂ = sample mean.

This equation has a unique solution because A is strictly convex ⇒ ∇A is invertible ⇒ η̂ = (∇A)^{−1}((1/n)Σ T(x_i)).


6. Conjugate Priors

If the likelihood is exponential family, the conjugate prior is ALSO exponential family:

$p(η | χ, ν) ∝ exp(χ^T η − ν·A(η))
$

Prior parameters: - χ = prior pseudo-observations of sufficient statistics - ν = prior pseudo-sample-size

Posterior update (conjugacy in action):

$p(η | x₁,...,x_n) ∝ p(x|η) · p(η|χ,ν)
                  ∝ exp(η^T Σ T(x_i) − n·A(η)) · exp(χ^T η − ν·A(η))
                  ∝ exp((χ + Σ T(x_i))^T η − (ν+n)·A(η))
$

Posterior has the same form with:

$χ_post = χ + Σ T(x_i)
ν_post = ν + n
$

Interpretation: The prior acts like having seen ν prior observations with sufficient statistics χ/ν. The posterior just adds the real data's sufficient statistics.


7. Maximum Entropy Principle

Among all distributions with a given set of expected sufficient statistics E[T(x)] = μ, the exponential family distribution maximizes entropy:

$max_p −∫ p(x) log p(x) dx   subject to   E_p[T(x)] = μ, ∫ p(x)dx = 1
$

Solution: p(x) ∝ exp(η^T T(x)) — exactly the exponential family. The Lagrange multipliers η are chosen to satisfy the moment constraints.

Why this matters: If all you know about a distribution is its mean and variance, the maximum-entropy distribution is Gaussian. If all you know is the mean of a non-negative variable, it's Exponential. The exponential family is the LEAST COMMITTAL distribution consistent with the specified constraints — it assumes nothing beyond the given moments.


8. Generalized Linear Models (GLMs)

GLMs model a response y given predictors x through three components:

  1. Random component: y | x ~ ExponentialFamily(η)
  2. Systematic component: η = θ^T x (linear predictor)
  3. Link function: g(E[y|x]) = η (connects mean to linear predictor)

Canonical link: g = (∇A)^{−1} (maps natural parameter to mean). With canonical links:

$∇_θ log p(y|x,θ) = (y − E[y]) · x
$

The gradient is prediction error × features — the same simple form across ALL exponential family GLMs. This is why logistic regression (Bernoulli + logit link), Poisson regression (Poisson + log link), and linear regression (Gaussian + identity link) share the same IRLS (Iteratively Reweighted Least Squares) optimization algorithm.


Worked Examples

Example 1: Writing Gamma in Canonical Form

Problem: The Gamma(α, β) density is p(x|α,β) = (β^α/Γ(α)) x^{α−1} e^{−βx} for x > 0. Write this in exponential family canonical form with natural parameters.

Solution:

$p(x|α,β) = exp(α log β − log Γ(α) + (α−1)log x − βx)
         = exp([α, −β]^T [log x, x] − (−α log β + log Γ(α)))
$

Natural parameters: η = [α−1, −β]^T Sufficient statistics: T(x) = [log x, x]^T Base measure: h(x) = 1 (for x > 0) Log-partition: A(η) = log Γ(η₁+1) − (η₁+1)log(−η₂)

This shows the Gamma is a two-parameter exponential family with sufficient statistics (log x, x).


Example 2: MLE via Moment Matching

Problem: For n i.i.d. Poisson(λ) observations, use the exponential family MLE condition to find λ̂.

Solution:

Poisson in exponential form: η = log λ, T(x) = x, A(η) = e^η.

Moment matching condition:

$E_η̂[T(x)] = (1/n) Σ T(x_i)
λ̂ = (1/n) Σ x_i = x̄
$

So λ̂ = x̄. More formally: η̂ = log λ̂ satisfies A'(η̂) = e^{η̂} = λ̂ = x̄. The moment matching equation directly gives the MLE.

For 5 observations [2, 5, 3, 7, 4]: λ̂ = (2+5+3+7+4)/5 = 21/5 = 4.2.


Example 3: Conjugate Prior for Poisson

Problem: Derive the conjugate prior for the Poisson likelihood in terms of the prior pseudo-count χ and pseudo-sample-size ν. Express the posterior after observing n=10 with Σx_i = 42, given prior χ=5, ν=2.

Solution:

Poisson: p(x|λ) = (1/x!) exp(x log λ − λ). η = log λ, T(x) = x, A(η) = e^η = λ.

Conjugate prior for η: p(η|χ,ν) ∝ exp(χ·η − ν·e^η) In terms of λ: p(λ|χ,ν) ∝ λ^χ e^{−νλ}

This is Gamma(χ+1, ν). Prior mean: (χ+1)/ν.

Given prior χ=5, ν=2: prior ~ Gamma(6, 2), mean = 3.

Posterior after seeing Σx_i = 42, n = 10:

$χ_post = 5 + 42 = 47
ν_post = 2 + 10 = 12
$

Posterior ~ Gamma(48, 12), mean = 48/12 = 4.0. The data (mean 4.2) pulled the estimate up from the prior mean of 3.



Quiz

Q1: What does the concept of The exponential family primarily refer to in this subject?

A) A visual representation of The exponential family B) A historical anecdote about The exponential family C) The definition and application of The exponential family D) A computational error related to The exponential family

Correct: C)

Q2: What is the primary purpose of Conjugate priors?

A) It is used only in advanced research contexts B) It is used to conjugate priors in mathematical analysis C) It is primarily a historical notation system D) It replaces all other methods in this domain

Correct: B)

Q3: Which statement about Exponential family is TRUE?

A) Exponential family is a fundamental concept covered in this subject B) Exponential family is not related to this subject C) Exponential family is an advanced topic beyond this subject's scope D) Exponential family is mentioned only as a historical footnote

Correct: A)

Q4: Based on the worked examples in this subject, what is the correct result?

A) An unrelated numerical value B) β−1 C) The inverse of the correct answer D) A different result from a common mistake

Correct: B)

Q5: How are Exponential family and Natural parameter related?

A) Exponential family and Natural parameter are closely related concepts B) Exponential family is a special case of Natural parameter C) Exponential family and Natural parameter are completely unrelated topics D) Exponential family is the inverse of Natural parameter

Correct: A)

Q6: What is a common pitfall when working with Sufficient statistic?

A) A common mistake is confusing Sufficient statistic with a similar concept B) The main error with Sufficient statistic is using it when it is not needed C) Sufficient statistic is always computed the same way in all contexts D) Sufficient statistic has no common misconceptions

Correct: A)

Q7: When should you apply Log-partition function?

A) Avoid Log-partition function unless explicitly instructed B) Use Log-partition function only in pure mathematics contexts C) Apply Log-partition function to solve problems in this subject's domain D) Log-partition function is not practically useful

Correct: C)

Practice Problems

Problem 1

Write the Beta(α, β) distribution in exponential family canonical form. What are the sufficient statistics?

Answer
$p(x|α,β) = [Γ(α+β)/(Γ(α)Γ(β))] x^{α−1} (1−x)^{β−1}
         = exp((α−1)log x + (β−1)log(1−x) − [log Γ(α)+log Γ(β)−log Γ(α+β)])
$
Natural parameters: η₁ = α−1, η₂ = β−1 Sufficient statistics: T(x) = [log x, log(1−x)]^T Base measure: h(x) = 1 A(η) = log Γ(η₁+1) + log Γ(η₂+1) − log Γ(η₁+η₂+2) The sufficient statistics for the Beta are (log x, log(1−x)) — which is why Beta updates in Bayesian inference use log-likelihoods, not raw counts.

Problem 2

Prove that the Hessian of A(η) equals Cov(T(x)). Use this to prove A is convex.

Answer First derivative:
$∂A/∂η_j = E[T_j(x)]
$
Second derivative:
$∂²A/∂η_j∂η_k = ∂/∂η_k ∫ T_j(x) h(x) exp(η^T T(x) − A(η)) dx
             = ∫ T_j(x) h(x) exp(η^T T(x)) · (T_k(x) − ∂A/∂η_k) · exp(−A(η)) dx
             − ∫ T_j(x) p(x|η) dx · ∂A/∂η_k
$
Wait — let me be more careful. Using the product rule:
$= ∫ T_j(x) T_k(x) p(x|η) dx − (∂A/∂η_k) ∫ T_j(x) p(x|η) dx
  − (∂A/∂η_j) ∫ T_k(x) p(x|η) dx + A(η) terms...
$
Actually, the cleanest proof uses the identity ∂p/∂η_j = (T_j − E[T_j])p:
$∂²A/∂η_j∂η_k = ∂/∂η_k E[T_j]
             = ∂/∂η_k ∫ T_j(x) exp(η^T T(x) − A(η)) h(x) dx
$
Using ∂(exp(η^T T − A))/∂η_k = (T_k − ∂A/∂η_k)exp(η^T T − A):
$= ∫ T_j(x)(T_k(x) − E[T_k]) p(x|η) dx
= E[T_j T_k] − E[T_j]E[T_k]
= Cov(T_j, T_k)
$
Since Cov(T) is positive semidefinite (and strictly positive definite for minimal representations), ∇²A is positive definite ⇒ A is strictly convex. ✓

Problem 3

Show that the product of two exponential family distributions is not generally in the exponential family, but the PRODUCT of the density with a conjugate prior IS.

Answer Exponential family likelihood: p(x|η) = h(x) exp(η^T T(x) − A(η)) Conjugate prior: p(η|χ,ν) = g(χ,ν) exp(χ^T η − ν A(η)) Product (posterior kernel):
$p(η|x,χ,ν) ∝ exp(η^T T(x) − A(η)) · exp(χ^T η − ν A(η))
           = exp((χ+T(x))^T η − (ν+1)A(η))
$
This has the SAME functional form in η — same sufficient statistics in η-space, same log-partition A(η). ✓ However, product of two exponential family distributions in x (different η₁, η₂):
$p₁(x|η₁) p₂(x|η₂) ∝ h₁(x)h₂(x) exp((η₁^T T₁(x) + η₂^T T₂(x)) − (A₁(η₁)+A₂(η₂)))
$
This is NOT generally exponential family because the combined "sufficient statistic" would need to capture the interaction of η₁^T T₁(x) + η₂^T T₂(x), which requires the union of T₁ and T₂. Unless T₁ and T₂ span the same space, the product isn't in the same family.

Problem 4

Using the maximum entropy principle, derive the Gaussian distribution as the maximum-entropy distribution with fixed mean μ and variance σ².

Answer Constraints: 1. ∫ p(x) dx = 1 2. ∫ x p(x) dx = μ 3. ∫ (x−μ)² p(x) dx = σ² Lagrangian:
$L = −∫ p(x)log p(x)dx + λ₀(∫ p−1) + λ₁(∫ xp−μ) + λ₂(∫ x²p−(μ²+σ²))
$
Functional derivative w.r.t. p(x):
$δL/δp = −log p(x) − 1 + λ₀ + λ₁x + λ₂x² = 0
$
So: p(x) ∝ exp(λ₁x + λ₂x²) = exp(−(x−μ)²/(2σ²)) after solving for λ₁, λ₂ from the constraints. The Gaussian is the distribution that assumes NOTHING beyond mean and variance — it maximizes entropy subject to these constraints. Any other assumption would reduce entropy, implying knowledge we don't have.

Problem 5

Explain why logistic regression is a GLM and identify its three components (random, systematic, link).

Answer 1. **Random component:** y | x ~ Bernoulli(π(x)) Bernoulli is exponential family with η = log(π/(1−π)), T(y)=y, A(η)=log(1+e^η) 2. **Systematic component:** η = β^T x (linear in parameters) The log-odds are a linear function of predictors. 3. **Link function:** g(π) = log(π/(1−π)) = η The logit link maps the mean π to the natural parameter η. This is the CANONICAL link because g = (∇A)^{−1}: ∇A(η) = e^η/(1+e^η) = π, so (∇A)^{−1}(π) = log(π/(1−π)) = g(π). With the canonical link, the gradient simplifies to:
$∇_β log p(y|x,β) = (y − π(x)) · x
$
The update is proportional to (actual − predicted) × features — the same intuitive form as linear regression. This is the hallmark of GLMs with canonical links.

Summary

  1. The exponential family p(x|η) = h(x)exp(η^T T(x) − A(η)) unifies Bernoulli, Gaussian, Poisson, Gamma, Beta, Dirichlet, Categorical, and more under one framework
  2. The log-partition function A(η) is the cumulant-generating function — its gradient gives E[T(x)] and its Hessian gives Cov(T(x))
  3. MLE reduces to moment matching: E_η̂[T(x)] = (1/n)Σ T(x_i) — the expected sufficient statistics match their empirical averages
  4. Conjugate priors are also exponential family — Bayesian updating is simple addition in natural parameter space
  5. GLMs combine exponential family likelihoods with linear predictors — logistic, Poisson, and linear regression share the same IRLS solver

Pitfalls


Key Terms

Term Definition
Exponential family p(x
Natural parameter η — the canonical parameterization; MLE and Bayesian updating are simplest in this form
Sufficient statistic T(x) — captures ALL information about η in the data; (1/n)Σ T(x_i) is a sufficient statistic for η
Log-partition function A(η) = log ∫ h(x)exp(η^T T(x))dx — cumulant-generating function; its derivatives give moments of T(x)
Convexity A(η) is strictly convex, so MLE is unique and log-likelihood is concave
Moment matching MLE sets E_η̂[T(x)] = (1/n)Σ T(x_i) — a single equation that characterizes the solution
Conjugate prior p(η
Maximum entropy Among all distributions matching specified moments, the exponential family maximizes entropy
GLM Generalized Linear Model — y

Next Steps

Continue to 21-06 — Causality and Causal Inference to learn the mathematical framework for distinguishing correlation from causation — do-calculus, structural causal models, and counterfactual reasoning.