Math graphic
📐 Concept diagram

21-01 — Bayesian Inference

Phase: 21 — Probability & Statistics for ML (Advanced) Subject: 21-01 Prerequisites: 10-01 (Probability Axioms), 10-02 (Conditional Probability and Bayes' Rule), 10-03 (Random Variables), 10-05 (Continuous Distributions), 12-01 (Maximum Likelihood Estimation) Next subject: 21-02 — Variational Inference


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive Bayes' theorem for continuous parameters and explain how it formalizes learning as updating beliefs with evidence
  2. Compute posterior distributions for conjugate prior-likelihood pairs (Beta-Binomial, Gaussian-Gaussian, Dirichlet-Multinomial)
  3. Derive the maximum a posteriori (MAP) estimate and prove it converges to MLE as the prior becomes uninformative
  4. Analyze how prior strength affects posterior concentration — derive the effective sample size interpretation of conjugate priors
  5. Apply Bayesian inference to ML problems: coin bias estimation, Gaussian mean with known variance, and linear regression with a Gaussian prior

Core Content

1. Bayes' Theorem — The Learning Equation

Frequentist statistics asks: "Given a hypothesis, what's the probability of observing this data?" Bayesian statistics inverts this: "Given this data, what's the probability of each hypothesis?"

For model parameters θ and observed data D:

$P(θ | D) = P(D | θ) · P(θ) / P(D)
$

where: - P(θ | D) = posterior — our updated belief about θ after seeing data - P(D | θ) = likelihood — how probable the data is under parameter θ - P(θ) = prior — our belief about θ before seeing data - P(D) = marginal likelihood (evidence) = ∫ P(D | θ) P(θ) dθ

⚠️ THIS IS CRITICAL — Bayes' theorem is the mathematical foundation of learning from data. Every update to model parameters, every assimilation of new evidence, is an application of Bayes' rule. The prior-to-posterior transition IS learning, formalized.

Proportional form (often sufficient):

$P(θ | D) ∝ P(D | θ) · P(θ)
$

The evidence P(D) is a normalization constant — it doesn't depend on θ, so it doesn't affect the shape of the posterior.


2. The Beta-Binomial Model (Coin Flipping)

The simplest and most instructive conjugate model: estimating a probability p from coin flips.

Setup: - Parameter: p ∈ [0, 1] — probability of heads - Prior: p ~ Beta(α, β) - Data: n flips, k heads - Likelihood: Binomial — P(k | p, n) = C(n,k) · p^k · (1−p)^{n−k}

Beta prior density:

$P(p | α, β) = p^{α−1} · (1−p)^{β−1} / B(α, β)
$

where B(α, β) = Γ(α)Γ(β)/Γ(α+β) is the Beta function.

Posterior derivation:

P(p | k, n) ∝ P(k | p, n) · P(p | α, β)
           ∝ p^k · (1−p)^{n−k} · p^{α−1} · (1−p)^{β−1}
           ∝ p^{α+k−1} · (1−p)^{β+n−k−1}

This is a Beta(α+k, β+n−k) distribution. The conjugate prior property: Beta prior × Binomial likelihood = Beta posterior.

Interpretation as pseudo-counts: α−1 acts as "pseudo-heads" and β−1 as "pseudo-tails" seen before any data. The prior Beta(α, β) carries the same weight as having seen α+β−2 prior observations. So: - Beta(1, 1) = Uniform(0,1) — "zero prior observations" (uninformative) - Beta(5, 5) — prior equivalent to having seen 4 heads and 4 tails - Beta(100, 1) — strong prior belief that p is near 1

Posterior mean:

$E[p | k, n] = (α + k) / (α + β + n)
$

This is a weighted average of the prior mean α/(α+β) and the sample proportion k/n, with weights proportional to prior strength (α+β) and sample size (n).


3. The Gaussian-Gaussian Model

Setup: - Parameter: μ (unknown mean, known variance σ²) - Prior: μ ~ N(μ₀, τ²) - Data: x₁, ..., x_n i.i.d. ~ N(μ, σ²) - Likelihood: P(x₁,...,x_n | μ) ∝ exp(−Σ(x_i − μ)² / (2σ²))

Posterior derivation (using the exponential form):

The log-posterior (up to a constant):

$log P(μ | data) = log P(data | μ) + log P(μ) + const
                = −Σ(x_i − μ)²/(2σ²) − (μ − μ₀)²/(2τ²) + const
$

Expanding and collecting μ² and μ terms:

$∝ −½[(n/σ² + 1/τ²)μ² − 2(nx̄/σ² + μ₀/τ²)μ]
$

This is the log of a Gaussian. The posterior is:

$μ | data ~ N(μ_n, τ²_n)
$

where:

$μ_n = (n·x̄/σ² + μ₀/τ²) / (n/σ² + 1/τ²)    [precision-weighted average]
τ²_n = 1 / (n/σ² + 1/τ²)
$

Precision parameterization (λ = 1/σ²):

$λ_n = λ_prior + n·λ_likelihood
μ_n = (λ_prior·μ₀ + n·λ_likelihood·x̄) / λ_n
$

Precisions ADD. The posterior precision is the sum of prior precision and data precision. This is elegantly additive.

As n → ∞: μ_n → x̄ (MLE), τ²_n → 0. The prior is overwhelmed by data.


4. Maximum a Posteriori (MAP) Estimation

The MAP estimate is the mode of the posterior — the single "best" parameter value under the Bayesian framework:

$θ_MAP = argmax_θ P(θ | D) = argmax_θ [log P(D | θ) + log P(θ)]
$

Compare to MLE: θ_MLE = argmax_θ log P(D | θ). MAP adds the log-prior as a regularization term.

For the Beta-Binomial model:

$p_MAP = (α + k − 1) / (α + β + n − 2)    [mode of Beta(α+k, β+n−k)]
$

For α=β=1 (uniform prior): p_MAP = k/n = p_MLE. The uniform prior adds no regularization.

For the Gaussian-Gaussian model:

$μ_MAP = μ_n = (n·x̄/σ² + μ₀/τ²) / (n/σ² + 1/τ²)
$

MAP equals posterior mean for Gaussians (Gaussian is symmetric, mode = mean).

Connection to L2 regularization: For a Gaussian prior μ ~ N(0, τ²), the log-prior is −μ²/(2τ²) + const. Maximizing log P(D | μ) − μ²/(2τ²) is equivalent to MLE with L2 regularization. The regularization strength is 1/τ². So Bayesian MAP with a Gaussian prior = ridge regression.


5. Conjugate Priors — The Full Family

A prior is conjugate to a likelihood if the posterior belongs to the same family. This makes Bayesian updating computationally tractable — just update parameters, no integration needed.

Likelihood Conjugate Prior Posterior Update
Binomial(n, p) Beta(α, β) Beta(α+k, β+n−k)
Poisson(λ) Gamma(a, b) Gamma(a+Σx_i, b+n)
Gaussian (μ, known σ²) Gaussian(μ₀, τ²) Gaussian(as above)
Gaussian (known μ, σ²) Inv-Gamma(a,b) Inv-Gamma(a+n/2, b+½Σ(x_i−μ)²)
Multinomial(p₁,...,p_k) Dirichlet(α₁,...,α_k) Dirichlet(α₁+n₁,...,α_k+n_k)
Exponential(λ) Gamma(a, b) Gamma(a+n, b+Σx_i)

The Dirichlet-Multinomial model is the multivariate extension of Beta-Binomial, used extensively in topic models (LDA), Bayesian mixture models, and any problem with categorical data.


6. Bayesian Model Comparison

Given models M₁ and M₂, which is better supported by the data?

Bayes factor:

$BF₁₂ = P(D | M₁) / P(D | M₂)
$

where P(D | M_k) = ∫ P(D | θ_k, M_k) · P(θ_k | M_k) dθ_k is the marginal likelihood.

Interpretation (Jeffreys' scale): - BF₁₂ > 100: Decisive evidence for M₁ - BF₁₂ ∈ [10, 100]: Strong evidence - BF₁₂ ∈ [3, 10]: Substantial evidence - BF₁₂ ∈ [1, 3]: Barely worth mentioning - BF₁₂ < 1: Evidence favors M₂

The marginal likelihood automatically penalizes model complexity — complex models spread their prior probability mass thinly over many parameter values, making any specific dataset less probable. This is the Bayesian Ockham's razor.


7. Bayesian Linear Regression

Model: y = Xβ + ε, where ε ~ N(0, σ²I)

Prior: β ~ N(0, τ²I) — a zero-mean isotropic Gaussian (ridge prior)

Likelihood: y | β, X ~ N(Xβ, σ²I)

Posterior:

$β | y, X ~ N(μ_n, Σ_n)
Σ_n = (X^T X/σ² + I/τ²)^{−1}
μ_n = Σ_n · X^T y / σ²
$

Compare to ridge regression: β̂_ridge = (X^T X + λI)^{−1} X^T y. Setting λ = σ²/τ², the ridge estimator equals the posterior mean. Bayesian linear regression unifies: - MLE: λ = 0 (τ² → ∞, flat prior) - Ridge: λ = σ²/τ² - Fully Bayesian: retain the full posterior distribution (not just the mode), enabling uncertainty quantification


Worked Examples

Example 1: Beta-Binomial Updating

Problem: You have a prior Beta(2, 2) — a weak belief that a coin is fair. You flip the coin 10 times and get 7 heads. Compute the posterior distribution, its mean, and the MAP estimate.

Solution:

$Posterior = Beta(α+k, β+n−k) = Beta(2+7, 2+3) = Beta(9, 5)

Posterior mean: E[p] = 9/(9+5) = 9/14 ≈ 0.643

MAP: mode = (9−1)/(9+5−2) = 8/12 = 2/3 ≈ 0.667

Prior mean was 2/4 = 0.5. Data says 0.7. Posterior compromises at ~0.64.
$

The data pulls the estimate from 0.5 toward 0.7, but the prior Beta(2,2) — equivalent to 2 prior observations — provides a small anchor.


Example 2: MAP vs MLE Convergence

Problem: Show that for the Beta-Binomial model with prior Beta(α, β), as n → ∞, the MAP estimate converges to the MLE k/n.

Solution:

$p_MAP = (α + k − 1) / (α + β + n − 2)

Divide numerator and denominator by n:
p_MAP = (α/n + k/n − 1/n) / (α/n + β/n + 1 − 2/n)

As n → ∞:
α/n → 0, β/n → 0, 1/n → 0, 2/n → 0

So p_MAP → (k/n) / (1) = k/n = p_MLE. ✓
$

The prior's influence decays as O(1/n). For large datasets, Bayesian and frequentist estimates agree — data overwhelms the prior.


Example 3: Gaussian Posterior with Informative Prior

Problem: A measurement device has known standard deviation σ = 2. Your prior on the true value μ is N(100, 5²). You take n = 4 measurements: [103, 99, 102, 104]. Compute the posterior distribution and a 95% credible interval.

Solution:

$x̄ = (103+99+102+104)/4 = 102
σ² = 4, so λ_data = n/σ² = 4/4 = 1
λ_prior = 1/τ² = 1/25 = 0.04

μ_n = (0.04·100 + 1·102) / (0.04 + 1) = (4 + 102)/1.04 = 106/1.04 ≈ 101.92
τ²_n = 1/(0.04 + 1) = 1/1.04 ≈ 0.9615
τ_n ≈ 0.981
$

Posterior: μ | data ~ N(101.92, 0.981²).

95% credible interval: μ_n ± 1.96·τ_n = 101.92 ± 1.96·0.981 = [100.00, 103.84].

The prior (centered at 100) pulls the estimate down slightly from the sample mean of 102. With n=4, the prior still has some influence.



Quiz

Q1: What does the concept of The Dirichlet-Multinomial model primarily refer to in this subject?

A) A computational error related to The Dirichlet-Multinomial model B) A historical anecdote about The Dirichlet-Multinomial model C) A visual representation of The Dirichlet-Multinomial model D) The definition and application of The Dirichlet-Multinomial model

Correct: D)

Q2: What is the primary purpose of Conjugate priors?

A) It is used to conjugate priors in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain

Correct: A)

Q3: Which statement about MAP estimation is TRUE?

A) MAP estimation is not related to this subject B) MAP estimation is a fundamental concept covered in this subject C) MAP estimation is an advanced topic beyond this subject's scope D) MAP estimation is mentioned only as a historical footnote

Correct: B)

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) An unrelated numerical value C) Beta(4, 56) D) A different result from a common mistake

Correct: C)

Q5: How are MAP estimation and Prior strength related?

A) MAP estimation is a special case of Prior strength B) MAP estimation is the inverse of Prior strength C) MAP estimation and Prior strength are closely related concepts D) MAP estimation and Prior strength are completely unrelated topics

Correct: C)

Q6: What is a common pitfall when working with Bayesian model comparison?

A) Bayesian model comparison has no common misconceptions B) Bayesian model comparison is always computed the same way in all contexts C) The main error with Bayesian model comparison is using it when it is not needed D) A common mistake is confusing Bayesian model comparison with a similar concept

Correct: D)

Q7: When should you apply Posterior?

A) Apply Posterior to solve problems in this subject's domain B) Avoid Posterior unless explicitly instructed C) Use Posterior only in pure mathematics contexts D) Posterior is not practically useful

Correct: A)

Practice Problems

Problem 1

A factory produces parts with a defect rate p. Your prior is Beta(1, 9) — you expect ~10% defective. After inspecting 50 parts, you find 3 defects. Compute the posterior distribution and the posterior probability that p < 0.15.

Answer
Posterior: Beta(1+3, 9+47) = Beta(4, 56)

Posterior mean: 4/(4+56) = 4/60 = 1/15 ≈ 0.067

P(p < 0.15): This requires the Beta CDF. For Beta(4, 56):
Mode ≈ (4−1)/(4+56−2) = 3/58 ≈ 0.052

The distribution is concentrated well below 0.15. Using the normal approximation:
E[p] = 4/60 ≈ 0.0667
Var(p) = (4·56)/(60²·61) = 224/(3600·61) ≈ 0.00102
SD ≈ 0.032

z = (0.15 − 0.0667)/0.032 ≈ 2.60
P(p < 0.15) ≈ Φ(2.60) ≈ 0.995
There's ~99.5% posterior probability that p < 0.15. The data (3 defects in 50) plus the prior strongly suggest a low defect rate.

Problem 2

Derive the posterior for a Poisson likelihood with a Gamma prior. Show that the posterior mean is a weighted average of the prior mean and the sample mean.

Answer Prior: λ ~ Gamma(a, b), with density P(λ) ∝ λ^{a−1}·e^{−bλ} Likelihood for data x₁,...,x_n: P(data | λ) ∝ λ^{Σx_i}·e^{−nλ} Posterior:
P(λ | data) ∝ λ^{a−1}·e^{−bλ} · λ^{Σx_i}·e^{−nλ}
           ∝ λ^{a+Σx_i−1}·e^{−(b+n)λ}
This is Gamma(a+Σx_i, b+n). Posterior mean: E[λ | data] = (a+Σx_i)/(b+n) Prior mean: a/b Sample mean: x̄ = Σx_i/n Rewrite posterior mean:
$E[λ | data] = (a + n·x̄)/(b + n)
            = [b/(b+n)]·(a/b) + [n/(b+n)]·x̄
$
This is a weighted average of the prior mean a/b and sample mean x̄, with weights proportional to b (prior "strength") and n (sample size). The Gamma(a, b) prior carries the weight of b prior observations.

Problem 3

You have two models for a dataset: M₁ (simple, 1 parameter) and M₂ (complex, 10 parameters). Both fit the data equally well (same maximum likelihood). Explain why the Bayes factor favors M₁, and compute the approximate penalty if each parameter has a uniform prior of width W.

Answer The marginal likelihood for model M with parameters θ is:
$P(D | M) = ∫ P(D | θ) P(θ | M) dθ
$
If both models achieve similar likelihood at their optimal parameters, the integral depends on: 1. The prior density at the optimal parameters 2. The volume of parameter space where likelihood is high ("Occam factor") For uniform priors of width W per parameter: - M₁: prior density = 1/W - M₂: prior density = 1/W^{10} The Bayes factor (assuming equal peak likelihood L):
$BF₁₂ ≈ L·(1/W) / [L·(1/W^{10})] = W^9
$
If W = 10 (each parameter could reasonably be anywhere in a range of 10), then BF₁₂ ≈ 10^9 — overwhelming evidence for the simpler model. The complex model spreads its prior probability so thinly that any specific dataset is vastly less probable under it. This is the Bayesian Ockham's razor: complex models are automatically penalized unless they fit the data SUBSTANTIALLY better.

Problem 4

In Bayesian linear regression with prior β ~ N(0, τ²I), derive the posterior covariance Σ_n = (X^T X/σ² + I/τ²)^{−1}. Interpret what happens to the posterior uncertainty as n → ∞.

Answer The log-posterior:
$log P(β | y, X) = −(y−Xβ)^T(y−Xβ)/(2σ²) − β^Tβ/(2τ²) + const
$
Expand the quadratic:
$∝ −½[β^T(X^T X/σ² + I/τ²)β − 2β^T X^T y/σ²]
$
This is the log of a multivariate Gaussian with:
$Σ_n = (X^T X/σ² + I/τ²)^{−1}
μ_n = Σ_n · X^T y / σ²
$
As n → ∞, X^T X grows as O(n). Each diagonal element of X^T X/σ² → ∞, so Σ_n → 0. The posterior variance collapses — we become certain about β. Specifically:
Σ_n ≈ σ²(X^T X)^{−1} for large n
This is the frequentist OLS covariance scaled by σ². The prior becomes irrelevant as data accumulates.

Problem 5

Explain why "uninformative priors" are a myth — every prior encodes assumptions. Use the Beta(1,1), Beta(0.5, 0.5), and Beta(2,2) as examples applied to coin flipping.

Answer All three are sometimes called "uninformative" but encode different assumptions: **Beta(1, 1) = Uniform(0,1):** "All values of p are equally likely." But this is NOT invariant under reparameterization. If we parameterize by odds p/(1−p), the prior is no longer uniform. So "uniform in p" is a specific assumption. **Beta(0.5, 0.5) = Jeffreys prior:** The prior proportional to √(I(p)) where I(p) is Fisher information. Invariant under reparameterization. But it places most mass near 0 and 1 — it assumes p is likely extreme. **Beta(2, 2):** Concentrates mass near 0.5 — assumes p is likely near fair. With n=10, k=7: - Beta(1,1) → posterior mean = 8/12 ≈ 0.667 - Beta(0.5,0.5) → posterior mean = 7.5/11 ≈ 0.682 - Beta(2,2) → posterior mean = 9/14 ≈ 0.643 Different "uninformative" priors produce different posteriors on small datasets. There is no truly uninformative prior — every prior is a modeling choice. The best practice is to test sensitivity to prior choice and report it.

Summary

  1. Bayes' theorem P(θ|D) ∝ P(D|θ)·P(θ) formalizes learning as updating prior beliefs with observed evidence — the posterior-to-prior shift IS learning
  2. Conjugate priors (Beta-Binomial, Gaussian-Gaussian, Gamma-Poisson, Dirichlet-Multinomial) make Bayesian updating analytic — just update parameters, no integration
  3. MAP estimation adds a log-prior regularization term to MLE — Gaussian MAP = ridge regression, Laplace MAP = Lasso
  4. Prior strength acts as pseudo-observations — Beta(α,β) carries weight α+β; as n → ∞, the prior's influence decays as O(1/n)
  5. Bayesian model comparison through marginal likelihoods automatically implements Ockham's razor — complex models are penalized unless they fit substantially better

Pitfalls


Key Terms

Term Definition
Posterior P(θ
Prior P(θ) — belief about parameters before observing data
Likelihood P(D
Marginal likelihood P(D) = ∫ P(D
Conjugate prior A prior distribution where the posterior belongs to the same family — enables analytic Bayesian updating
Beta distribution Conjugate prior for Binomial likelihood; Beta(α,β), support [0,1]
MAP estimate θ_MAP = argmax_θ P(θ
Precision λ = 1/σ² — additive in Gaussian-Gaussian conjugate updating
Bayes factor BF₁₂ = P(D
Credible interval Interval containing specified posterior probability mass — the Bayesian analog of confidence intervals

Next Steps

Continue to 21-02 — Variational Inference to learn how to approximate intractable posteriors using optimization rather than sampling — the foundation of variational autoencoders and modern Bayesian deep learning.