21-01 — Bayesian Inference
Phase: 21 — Probability & Statistics for ML (Advanced) Subject: 21-01 Prerequisites: 10-01 (Probability Axioms), 10-02 (Conditional Probability and Bayes' Rule), 10-03 (Random Variables), 10-05 (Continuous Distributions), 12-01 (Maximum Likelihood Estimation) Next subject: 21-02 — Variational Inference
Learning Objectives
By the end of this subject, you will be able to:
- Derive Bayes' theorem for continuous parameters and explain how it formalizes learning as updating beliefs with evidence
- Compute posterior distributions for conjugate prior-likelihood pairs (Beta-Binomial, Gaussian-Gaussian, Dirichlet-Multinomial)
- Derive the maximum a posteriori (MAP) estimate and prove it converges to MLE as the prior becomes uninformative
- Analyze how prior strength affects posterior concentration — derive the effective sample size interpretation of conjugate priors
- Apply Bayesian inference to ML problems: coin bias estimation, Gaussian mean with known variance, and linear regression with a Gaussian prior
Core Content
1. Bayes' Theorem — The Learning Equation
Frequentist statistics asks: "Given a hypothesis, what's the probability of observing this data?" Bayesian statistics inverts this: "Given this data, what's the probability of each hypothesis?"
For model parameters θ and observed data D:
$P(θ | D) = P(D | θ) · P(θ) / P(D) $
where: - P(θ | D) = posterior — our updated belief about θ after seeing data - P(D | θ) = likelihood — how probable the data is under parameter θ - P(θ) = prior — our belief about θ before seeing data - P(D) = marginal likelihood (evidence) = ∫ P(D | θ) P(θ) dθ
⚠️ THIS IS CRITICAL — Bayes' theorem is the mathematical foundation of learning from data. Every update to model parameters, every assimilation of new evidence, is an application of Bayes' rule. The prior-to-posterior transition IS learning, formalized.
Proportional form (often sufficient):
$P(θ | D) ∝ P(D | θ) · P(θ) $
The evidence P(D) is a normalization constant — it doesn't depend on θ, so it doesn't affect the shape of the posterior.
2. The Beta-Binomial Model (Coin Flipping)
The simplest and most instructive conjugate model: estimating a probability p from coin flips.
Setup: - Parameter: p ∈ [0, 1] — probability of heads - Prior: p ~ Beta(α, β) - Data: n flips, k heads - Likelihood: Binomial — P(k | p, n) = C(n,k) · p^k · (1−p)^{n−k}
Beta prior density:
$P(p | α, β) = p^{α−1} · (1−p)^{β−1} / B(α, β)
$
where B(α, β) = Γ(α)Γ(β)/Γ(α+β) is the Beta function.
Posterior derivation:
P(p | k, n) ∝ P(k | p, n) · P(p | α, β)
∝ p^k · (1−p)^{n−k} · p^{α−1} · (1−p)^{β−1}
∝ p^{α+k−1} · (1−p)^{β+n−k−1}
This is a Beta(α+k, β+n−k) distribution. The conjugate prior property: Beta prior × Binomial likelihood = Beta posterior.
Interpretation as pseudo-counts: α−1 acts as "pseudo-heads" and β−1 as "pseudo-tails" seen before any data. The prior Beta(α, β) carries the same weight as having seen α+β−2 prior observations. So: - Beta(1, 1) = Uniform(0,1) — "zero prior observations" (uninformative) - Beta(5, 5) — prior equivalent to having seen 4 heads and 4 tails - Beta(100, 1) — strong prior belief that p is near 1
Posterior mean:
$E[p | k, n] = (α + k) / (α + β + n) $
This is a weighted average of the prior mean α/(α+β) and the sample proportion k/n, with weights proportional to prior strength (α+β) and sample size (n).
3. The Gaussian-Gaussian Model
Setup: - Parameter: μ (unknown mean, known variance σ²) - Prior: μ ~ N(μ₀, τ²) - Data: x₁, ..., x_n i.i.d. ~ N(μ, σ²) - Likelihood: P(x₁,...,x_n | μ) ∝ exp(−Σ(x_i − μ)² / (2σ²))
Posterior derivation (using the exponential form):
The log-posterior (up to a constant):
$log P(μ | data) = log P(data | μ) + log P(μ) + const
= −Σ(x_i − μ)²/(2σ²) − (μ − μ₀)²/(2τ²) + const
$
Expanding and collecting μ² and μ terms:
$∝ −½[(n/σ² + 1/τ²)μ² − 2(nx̄/σ² + μ₀/τ²)μ] $
This is the log of a Gaussian. The posterior is:
$μ | data ~ N(μ_n, τ²_n) $
where:
$μ_n = (n·x̄/σ² + μ₀/τ²) / (n/σ² + 1/τ²) [precision-weighted average] τ²_n = 1 / (n/σ² + 1/τ²) $
Precision parameterization (λ = 1/σ²):
$λ_n = λ_prior + n·λ_likelihood μ_n = (λ_prior·μ₀ + n·λ_likelihood·x̄) / λ_n $
Precisions ADD. The posterior precision is the sum of prior precision and data precision. This is elegantly additive.
As n → ∞: μ_n → x̄ (MLE), τ²_n → 0. The prior is overwhelmed by data.
4. Maximum a Posteriori (MAP) Estimation
The MAP estimate is the mode of the posterior — the single "best" parameter value under the Bayesian framework:
$θ_MAP = argmax_θ P(θ | D) = argmax_θ [log P(D | θ) + log P(θ)] $
Compare to MLE: θ_MLE = argmax_θ log P(D | θ). MAP adds the log-prior as a regularization term.
For the Beta-Binomial model:
$p_MAP = (α + k − 1) / (α + β + n − 2) [mode of Beta(α+k, β+n−k)] $
For α=β=1 (uniform prior): p_MAP = k/n = p_MLE. The uniform prior adds no regularization.
For the Gaussian-Gaussian model:
$μ_MAP = μ_n = (n·x̄/σ² + μ₀/τ²) / (n/σ² + 1/τ²) $
MAP equals posterior mean for Gaussians (Gaussian is symmetric, mode = mean).
Connection to L2 regularization: For a Gaussian prior μ ~ N(0, τ²), the log-prior is −μ²/(2τ²) + const. Maximizing log P(D | μ) − μ²/(2τ²) is equivalent to MLE with L2 regularization. The regularization strength is 1/τ². So Bayesian MAP with a Gaussian prior = ridge regression.
5. Conjugate Priors — The Full Family
A prior is conjugate to a likelihood if the posterior belongs to the same family. This makes Bayesian updating computationally tractable — just update parameters, no integration needed.
| Likelihood | Conjugate Prior | Posterior Update |
|---|---|---|
| Binomial(n, p) | Beta(α, β) | Beta(α+k, β+n−k) |
| Poisson(λ) | Gamma(a, b) | Gamma(a+Σx_i, b+n) |
| Gaussian (μ, known σ²) | Gaussian(μ₀, τ²) | Gaussian(as above) |
| Gaussian (known μ, σ²) | Inv-Gamma(a,b) | Inv-Gamma(a+n/2, b+½Σ(x_i−μ)²) |
| Multinomial(p₁,...,p_k) | Dirichlet(α₁,...,α_k) | Dirichlet(α₁+n₁,...,α_k+n_k) |
| Exponential(λ) | Gamma(a, b) | Gamma(a+n, b+Σx_i) |
The Dirichlet-Multinomial model is the multivariate extension of Beta-Binomial, used extensively in topic models (LDA), Bayesian mixture models, and any problem with categorical data.
6. Bayesian Model Comparison
Given models M₁ and M₂, which is better supported by the data?
Bayes factor:
$BF₁₂ = P(D | M₁) / P(D | M₂) $
where P(D | M_k) = ∫ P(D | θ_k, M_k) · P(θ_k | M_k) dθ_k is the marginal likelihood.
Interpretation (Jeffreys' scale): - BF₁₂ > 100: Decisive evidence for M₁ - BF₁₂ ∈ [10, 100]: Strong evidence - BF₁₂ ∈ [3, 10]: Substantial evidence - BF₁₂ ∈ [1, 3]: Barely worth mentioning - BF₁₂ < 1: Evidence favors M₂
The marginal likelihood automatically penalizes model complexity — complex models spread their prior probability mass thinly over many parameter values, making any specific dataset less probable. This is the Bayesian Ockham's razor.
7. Bayesian Linear Regression
Model: y = Xβ + ε, where ε ~ N(0, σ²I)
Prior: β ~ N(0, τ²I) — a zero-mean isotropic Gaussian (ridge prior)
Likelihood: y | β, X ~ N(Xβ, σ²I)
Posterior:
$β | y, X ~ N(μ_n, Σ_n)
Σ_n = (X^T X/σ² + I/τ²)^{−1}
μ_n = Σ_n · X^T y / σ²
$
Compare to ridge regression: β̂_ridge = (X^T X + λI)^{−1} X^T y. Setting λ = σ²/τ², the ridge estimator equals the posterior mean. Bayesian linear regression unifies: - MLE: λ = 0 (τ² → ∞, flat prior) - Ridge: λ = σ²/τ² - Fully Bayesian: retain the full posterior distribution (not just the mode), enabling uncertainty quantification
Worked Examples
Example 1: Beta-Binomial Updating
Problem: You have a prior Beta(2, 2) — a weak belief that a coin is fair. You flip the coin 10 times and get 7 heads. Compute the posterior distribution, its mean, and the MAP estimate.
Solution:
$Posterior = Beta(α+k, β+n−k) = Beta(2+7, 2+3) = Beta(9, 5) Posterior mean: E[p] = 9/(9+5) = 9/14 ≈ 0.643 MAP: mode = (9−1)/(9+5−2) = 8/12 = 2/3 ≈ 0.667 Prior mean was 2/4 = 0.5. Data says 0.7. Posterior compromises at ~0.64. $
The data pulls the estimate from 0.5 toward 0.7, but the prior Beta(2,2) — equivalent to 2 prior observations — provides a small anchor.
Example 2: MAP vs MLE Convergence
Problem: Show that for the Beta-Binomial model with prior Beta(α, β), as n → ∞, the MAP estimate converges to the MLE k/n.
Solution:
$p_MAP = (α + k − 1) / (α + β + n − 2) Divide numerator and denominator by n: p_MAP = (α/n + k/n − 1/n) / (α/n + β/n + 1 − 2/n) As n → ∞: α/n → 0, β/n → 0, 1/n → 0, 2/n → 0 So p_MAP → (k/n) / (1) = k/n = p_MLE. ✓ $
The prior's influence decays as O(1/n). For large datasets, Bayesian and frequentist estimates agree — data overwhelms the prior.
Example 3: Gaussian Posterior with Informative Prior
Problem: A measurement device has known standard deviation σ = 2. Your prior on the true value μ is N(100, 5²). You take n = 4 measurements: [103, 99, 102, 104]. Compute the posterior distribution and a 95% credible interval.
Solution:
$x̄ = (103+99+102+104)/4 = 102 σ² = 4, so λ_data = n/σ² = 4/4 = 1 λ_prior = 1/τ² = 1/25 = 0.04 μ_n = (0.04·100 + 1·102) / (0.04 + 1) = (4 + 102)/1.04 = 106/1.04 ≈ 101.92 τ²_n = 1/(0.04 + 1) = 1/1.04 ≈ 0.9615 τ_n ≈ 0.981 $
Posterior: μ | data ~ N(101.92, 0.981²).
95% credible interval: μ_n ± 1.96·τ_n = 101.92 ± 1.96·0.981 = [100.00, 103.84].
The prior (centered at 100) pulls the estimate down slightly from the sample mean of 102. With n=4, the prior still has some influence.
Quiz
Q1: What does the concept of The Dirichlet-Multinomial model primarily refer to in this subject?
A) A computational error related to The Dirichlet-Multinomial model B) A historical anecdote about The Dirichlet-Multinomial model C) A visual representation of The Dirichlet-Multinomial model D) The definition and application of The Dirichlet-Multinomial model
Correct: D)
- If you chose A: This is incorrect. The Dirichlet-Multinomial model is defined as: the definition and application of the dirichlet-multinomial model. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. The Dirichlet-Multinomial model is defined as: the definition and application of the dirichlet-multinomial model. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. The Dirichlet-Multinomial model is defined as: the definition and application of the dirichlet-multinomial model. The other options describe different aspects that are not the primary focus.
- If you chose D: The Dirichlet-Multinomial model is defined as: the definition and application of the dirichlet-multinomial model. The other options describe different aspects that are not the primary focus. Correct!
Q2: What is the primary purpose of Conjugate priors?
A) It is used to conjugate priors in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain
Correct: A)
- If you chose A: Conjugate priors serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. Conjugate priors serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Conjugate priors serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Conjugate priors serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about MAP estimation is TRUE?
A) MAP estimation is not related to this subject B) MAP estimation is a fundamental concept covered in this subject C) MAP estimation is an advanced topic beyond this subject's scope D) MAP estimation is mentioned only as a historical footnote
Correct: B)
- If you chose A: This is incorrect. MAP estimation is a fundamental concept covered in this subject. This subject covers MAP estimation as part of its core content.
- If you chose B: MAP estimation is a fundamental concept covered in this subject. This subject covers MAP estimation as part of its core content. Correct!
- If you chose C: This is incorrect. MAP estimation is a fundamental concept covered in this subject. This subject covers MAP estimation as part of its core content.
- If you chose D: This is incorrect. MAP estimation is a fundamental concept covered in this subject. This subject covers MAP estimation as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) An unrelated numerical value C) Beta(4, 56) D) A different result from a common mistake
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is Beta(4, 56). The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is Beta(4, 56). The other options represent common errors.
- If you chose C: The worked examples show that the result is Beta(4, 56). The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is Beta(4, 56). The other options represent common errors.
Q5: How are MAP estimation and Prior strength related?
A) MAP estimation is a special case of Prior strength B) MAP estimation is the inverse of Prior strength C) MAP estimation and Prior strength are closely related concepts D) MAP estimation and Prior strength are completely unrelated topics
Correct: C)
- If you chose A: This is incorrect. Both MAP estimation and Prior strength are covered in this subject as interconnected topics.
- If you chose B: This is incorrect. Both MAP estimation and Prior strength are covered in this subject as interconnected topics.
- If you chose C: Both MAP estimation and Prior strength are covered in this subject as interconnected topics. Correct!
- If you chose D: This is incorrect. Both MAP estimation and Prior strength are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Bayesian model comparison?
A) Bayesian model comparison has no common misconceptions B) Bayesian model comparison is always computed the same way in all contexts C) The main error with Bayesian model comparison is using it when it is not needed D) A common mistake is confusing Bayesian model comparison with a similar concept
Correct: D)
- If you chose A: This is incorrect. Students often confuse Bayesian model comparison with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse Bayesian model comparison with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: This is incorrect. Students often confuse Bayesian model comparison with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: Students often confuse Bayesian model comparison with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
Q7: When should you apply Posterior?
A) Apply Posterior to solve problems in this subject's domain B) Avoid Posterior unless explicitly instructed C) Use Posterior only in pure mathematics contexts D) Posterior is not practically useful
Correct: A)
- If you chose A: Posterior is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose B: This is incorrect. Posterior is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Posterior is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Posterior is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
A factory produces parts with a defect rate p. Your prior is Beta(1, 9) — you expect ~10% defective. After inspecting 50 parts, you find 3 defects. Compute the posterior distribution and the posterior probability that p < 0.15.
Answer
Posterior: Beta(1+3, 9+47) = Beta(4, 56)
Posterior mean: 4/(4+56) = 4/60 = 1/15 ≈ 0.067
P(p < 0.15): This requires the Beta CDF. For Beta(4, 56):
Mode ≈ (4−1)/(4+56−2) = 3/58 ≈ 0.052
The distribution is concentrated well below 0.15. Using the normal approximation:
E[p] = 4/60 ≈ 0.0667
Var(p) = (4·56)/(60²·61) = 224/(3600·61) ≈ 0.00102
SD ≈ 0.032
z = (0.15 − 0.0667)/0.032 ≈ 2.60
P(p < 0.15) ≈ Φ(2.60) ≈ 0.995
There's ~99.5% posterior probability that p < 0.15. The data (3 defects in 50) plus the prior strongly suggest a low defect rate.
Problem 2
Derive the posterior for a Poisson likelihood with a Gamma prior. Show that the posterior mean is a weighted average of the prior mean and the sample mean.
Answer
Prior: λ ~ Gamma(a, b), with density P(λ) ∝ λ^{a−1}·e^{−bλ} Likelihood for data x₁,...,x_n: P(data | λ) ∝ λ^{Σx_i}·e^{−nλ} Posterior:P(λ | data) ∝ λ^{a−1}·e^{−bλ} · λ^{Σx_i}·e^{−nλ}
∝ λ^{a+Σx_i−1}·e^{−(b+n)λ}
This is Gamma(a+Σx_i, b+n).
Posterior mean: E[λ | data] = (a+Σx_i)/(b+n)
Prior mean: a/b
Sample mean: x̄ = Σx_i/n
Rewrite posterior mean:
$E[λ | data] = (a + n·x̄)/(b + n)
= [b/(b+n)]·(a/b) + [n/(b+n)]·x̄
$
This is a weighted average of the prior mean a/b and sample mean x̄, with weights proportional to b (prior "strength") and n (sample size). The Gamma(a, b) prior carries the weight of b prior observations.
Problem 3
You have two models for a dataset: M₁ (simple, 1 parameter) and M₂ (complex, 10 parameters). Both fit the data equally well (same maximum likelihood). Explain why the Bayes factor favors M₁, and compute the approximate penalty if each parameter has a uniform prior of width W.
Answer
The marginal likelihood for model M with parameters θ is:$P(D | M) = ∫ P(D | θ) P(θ | M) dθ $If both models achieve similar likelihood at their optimal parameters, the integral depends on: 1. The prior density at the optimal parameters 2. The volume of parameter space where likelihood is high ("Occam factor") For uniform priors of width W per parameter: - M₁: prior density = 1/W - M₂: prior density = 1/W^{10} The Bayes factor (assuming equal peak likelihood L):
$BF₁₂ ≈ L·(1/W) / [L·(1/W^{10})] = W^9
$
If W = 10 (each parameter could reasonably be anywhere in a range of 10), then BF₁₂ ≈ 10^9 — overwhelming evidence for the simpler model. The complex model spreads its prior probability so thinly that any specific dataset is vastly less probable under it.
This is the Bayesian Ockham's razor: complex models are automatically penalized unless they fit the data SUBSTANTIALLY better.
Problem 4
In Bayesian linear regression with prior β ~ N(0, τ²I), derive the posterior covariance Σ_n = (X^T X/σ² + I/τ²)^{−1}. Interpret what happens to the posterior uncertainty as n → ∞.
Answer
The log-posterior:$log P(β | y, X) = −(y−Xβ)^T(y−Xβ)/(2σ²) − β^Tβ/(2τ²) + const $Expand the quadratic:
$∝ −½[β^T(X^T X/σ² + I/τ²)β − 2β^T X^T y/σ²] $This is the log of a multivariate Gaussian with:
$Σ_n = (X^T X/σ² + I/τ²)^{−1}
μ_n = Σ_n · X^T y / σ²
$
As n → ∞, X^T X grows as O(n). Each diagonal element of X^T X/σ² → ∞, so Σ_n → 0. The posterior variance collapses — we become certain about β. Specifically:
Σ_n ≈ σ²(X^T X)^{−1} for large n
This is the frequentist OLS covariance scaled by σ². The prior becomes irrelevant as data accumulates.
Problem 5
Explain why "uninformative priors" are a myth — every prior encodes assumptions. Use the Beta(1,1), Beta(0.5, 0.5), and Beta(2,2) as examples applied to coin flipping.
Answer
All three are sometimes called "uninformative" but encode different assumptions: **Beta(1, 1) = Uniform(0,1):** "All values of p are equally likely." But this is NOT invariant under reparameterization. If we parameterize by odds p/(1−p), the prior is no longer uniform. So "uniform in p" is a specific assumption. **Beta(0.5, 0.5) = Jeffreys prior:** The prior proportional to √(I(p)) where I(p) is Fisher information. Invariant under reparameterization. But it places most mass near 0 and 1 — it assumes p is likely extreme. **Beta(2, 2):** Concentrates mass near 0.5 — assumes p is likely near fair. With n=10, k=7: - Beta(1,1) → posterior mean = 8/12 ≈ 0.667 - Beta(0.5,0.5) → posterior mean = 7.5/11 ≈ 0.682 - Beta(2,2) → posterior mean = 9/14 ≈ 0.643 Different "uninformative" priors produce different posteriors on small datasets. There is no truly uninformative prior — every prior is a modeling choice. The best practice is to test sensitivity to prior choice and report it.Summary
- Bayes' theorem P(θ|D) ∝ P(D|θ)·P(θ) formalizes learning as updating prior beliefs with observed evidence — the posterior-to-prior shift IS learning
- Conjugate priors (Beta-Binomial, Gaussian-Gaussian, Gamma-Poisson, Dirichlet-Multinomial) make Bayesian updating analytic — just update parameters, no integration
- MAP estimation adds a log-prior regularization term to MLE — Gaussian MAP = ridge regression, Laplace MAP = Lasso
- Prior strength acts as pseudo-observations — Beta(α,β) carries weight α+β; as n → ∞, the prior's influence decays as O(1/n)
- Bayesian model comparison through marginal likelihoods automatically implements Ockham's razor — complex models are penalized unless they fit substantially better
Pitfalls
- Treating uniform priors as truly uninformative. Beta(1,1) assigns equal probability to all values of p, but this is not invariant under reparameterization — a uniform prior on p becomes non-uniform on log-odds. Different "uninformative" priors (Beta(1,1), Beta(0.5,0.5), Jeffreys prior) produce different posteriors on small datasets. Always test sensitivity to prior choice and report how conclusions change under alternative priors.
- Confusing credible intervals with confidence intervals. A 95% credible interval means "there is a 95% probability that the parameter lies in this interval, given the data and prior." A 95% confidence interval means "if we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter." They answer different questions and can differ substantially, especially with informative priors or small samples. Don't interpret one as the other.
- Using conjugate priors for mathematical convenience without checking model fit. Conjugate priors make Bayesian updating analytically tractable, but the Beta-Binomial or Gaussian-Gaussian model may be a poor fit for your actual data. A conjugate model that's wrong is still wrong — the posterior will be precisely wrong rather than approximately right. Use posterior predictive checks to verify the model describes the data reasonably well.
- Forgetting that the prior's influence decays as O(1/n), not instantaneously. Even with n=100 observations, a Beta(10,1) prior (strong belief that p is near 1) still measurably influences the posterior. The rule of thumb "large samples overwhelm the prior" is asymptotic — in practice, strong priors require surprisingly large datasets to be fully overcome. Compute the effective sample size of your prior and compare it to your actual sample size.
- Computing the posterior mean and calling it done. The full posterior distribution contains information about uncertainty that a point estimate discards. Two posteriors with the same mean can have very different variances — and that variance matters for decision-making under uncertainty. Report credible intervals, posterior standard deviations, or visualize the full posterior, not just E[θ|D].
Key Terms
| Term | Definition |
|---|---|
| Posterior | P(θ |
| Prior | P(θ) — belief about parameters before observing data |
| Likelihood | P(D |
| Marginal likelihood | P(D) = ∫ P(D |
| Conjugate prior | A prior distribution where the posterior belongs to the same family — enables analytic Bayesian updating |
| Beta distribution | Conjugate prior for Binomial likelihood; Beta(α,β), support [0,1] |
| MAP estimate | θ_MAP = argmax_θ P(θ |
| Precision | λ = 1/σ² — additive in Gaussian-Gaussian conjugate updating |
| Bayes factor | BF₁₂ = P(D |
| Credible interval | Interval containing specified posterior probability mass — the Bayesian analog of confidence intervals |
Next Steps
Continue to 21-02 — Variational Inference to learn how to approximate intractable posteriors using optimization rather than sampling — the foundation of variational autoencoders and modern Bayesian deep learning.