21-03 β Markov Chain Monte Carlo (MCMC)
Phase: 21 β Probability & Statistics for ML (Advanced) Subject: 21-03 Prerequisites: 21-01 (Bayesian Inference), 10-06 (Expectation and Variance), 11-01 (Markov Chains β transition matrices, stationary distributions), 11-02 (Markov Processes β continuous-time), 03-06 (Sequences and Series β convergence) Next subject: 21-04 β EM Algorithm
Learning Objectives
By the end of this subject, you will be able to:
- Derive the Metropolis-Hastings algorithm from the detailed balance condition and prove it converges to the target distribution
- Implement and analyze Gibbs sampling β derive the conditional distributions and prove it is a special case of Metropolis-Hastings with acceptance probability 1
- Explain why MCMC is asymptotically exact (unlike variational inference) and derive the Monte Carlo error as O(1/βN_eff) where N_eff accounts for autocorrelation
- Diagnose MCMC convergence using trace plots, autocorrelation, effective sample size, and the Gelman-Rubin RΜ statistic
- Apply Hamiltonian Monte Carlo (HMC) intuition β explain how gradient information and Hamiltonian dynamics suppress random-walk behavior
Core Content
1. The Problem: Sampling from Intractable Distributions
Bayesian inference requires computing expectations under the posterior:
$E_{p(z|x)}[f(z)] = β« f(z) p(z|x) dz
$
But p(z|x) = p(x|z)p(z)/p(x) is only known up to a constant β the evidence p(x) is intractable. MCMC solves this by constructing a Markov chain whose stationary distribution IS the target distribution p(z|x). Then we simulate the chain and use the samples to approximate expectations.
β οΈ THIS IS CRITICAL β MCMC is asymptotically EXACT. Unlike variational inference (which minimizes KL to a simpler family), MCMC converges to the true posterior as the number of samples β β. The tradeoff is computational cost.
2. Markov Chains and Stationary Distributions
A Markov chain is a sequence zβ, zβ, ... where z_{t+1} depends only on z_t, through a transition kernel T(z' | z):
$P(z_{t+1} = z' | z_t = z) = T(z' | z)
$
A distribution Ο is stationary if applying the transition leaves it unchanged:
$Ο(z') = β« T(z' | z) Ο(z) dz $
Ergodic theorem: For an irreducible, aperiodic Markov chain with stationary distribution Ο, the time average converges to the expectation under Ο:
(1/N) Ξ£_{t=1}^N f(z_t) β E_Ο[f(z)] as N β β
MCMC constructs T such that Ο = p(z|x) (our target posterior).
3. Metropolis-Hastings (MH) Algorithm
The most general MCMC algorithm. Given current state z, propose z' ~ Q(z' | z), then accept with probability:
$Ξ±(z β z') = min(1, [p(z') Β· Q(z | z')] / [p(z) Β· Q(z' | z)]) $
where p(z) is the unnormalized target density (p(z) β p(z|x)).
Algorithm:
$Initialize z_0
For t = 1 to N:
Propose z' ~ Q(z' | z_{tβ1})
Compute Ξ± = min(1, p(z')Q(z_{tβ1}|z') / (p(z_{tβ1})Q(z'|z_{tβ1})))
With probability Ξ±: z_t = z' (accept)
Otherwise: z_t = z_{tβ1} (reject)
$
Proof of correctness (detailed balance):
The MH kernel satisfies detailed balance with respect to p:
$p(z) Β· T(z' | z) = p(z) Β· Q(z' | z) Β· Ξ±(z β z')
= p(z) Β· Q(z' | z) Β· min(1, p(z')Q(z|z')/(p(z)Q(z'|z)))
= min(p(z)Q(z'|z), p(z')Q(z|z'))
Similarly: p(z') Β· T(z | z') = min(p(z')Q(z|z'), p(z)Q(z'|z))
= p(z) Β· T(z' | z) β
$
Detailed balance implies p is stationary. The acceptance ratio is the heart of MH β it corrects for the asymmetry in the proposal distribution.
Special cases: - Random-walk MH: Q(z' | z) = N(z' | z, Ξ£) β symmetric proposal. Acceptance simplifies to Ξ± = min(1, p(z')/p(z)). - Independence sampler: Q(z' | z) = Q(z') β proposal independent of current state.
4. Gibbs Sampling
When the conditional distributions p(z_j | z_{βj}) are tractable, Gibbs sampling provides an acceptance probability of 1:
$For t = 1 to N:
zβ^{(t)} ~ p(zβ | zβ^{(tβ1)}, zβ^{(tβ1)}, ..., z_d^{(tβ1)})
zβ^{(t)} ~ p(zβ | zβ^{(t)}, zβ^{(tβ1)}, ..., z_d^{(tβ1)})
...
z_d^{(t)} ~ p(z_d | zβ^{(t)}, zβ^{(t)}, ..., z_{dβ1}^{(t)})
$
Each update samples from a FULL conditional β no accept/reject step needed.
Why Ξ± = 1: Gibbs is MH with proposal Q(z_j' | z) = p(z_j' | z_{βj}). Then:
$Ξ± = min(1, p(z')p(z_j|z_{βj}) / (p(z)p(z_j'|z_{βj})))
= min(1, p(z')p(z_j|z_{βj}) / [p(z_{βj})p(z_j|z_{βj}) Β· p(z_j'|z_{βj})])
$
Since z' and z share the same z_{βj}, p(z') = p(z_{βj})p(z_j'|z_{βj}), and the ratio simplifies to 1.
5. Hamiltonian Monte Carlo (HMC)
Random-walk MH suffers from slow exploration in high dimensions β the acceptance rate drops as dimension grows. HMC uses gradient information to propose distant states with high acceptance probability.
Key idea: Augment the parameter space with "momentum" variables r and simulate Hamiltonian dynamics:
$H(z, r) = βlog p(z) + Β½ r^T M^{β1} r
= U(z) + K(r)
$
where U(z) = βlog p(z) is the "potential energy" and K(r) = Β½r^T M^{β1}r is the "kinetic energy."
Hamilton's equations:
$dz/dt = βH/βr = M^{β1}r
dr/dt = ββH/βz = β_z log p(z)
$
HMC algorithm:
Sample momentum r ~ N(0, M)
Simulate Hamiltonian dynamics for L steps with step size Ξ΅
(using leapfrog integrator for volume-preservation and reversibility)
Proposed state: (z', r') after L steps
Accept with probability: min(1, exp(H(z,r) β H(z',r')))
Why it works better: The gradient β log p(z) guides proposals TOWARD high-probability regions. The Hamiltonian dynamics preserve the joint distribution p(z)Β·N(r|0,M), so proposals remain in high-probability regions. Acceptance rates stay near 1 even in high dimensions because energy is approximately conserved by the symplectic integrator.
HMC suppresses the random-walk behavior that plagues MH in high dimensions β the number of steps to traverse the distribution scales as O(d^{1/4}) for HMC vs O(d) for random-walk MH.
6. MCMC Diagnostics
Trace plots: Plot z_t vs t. Look for stationarity (no drift) and good mixing (chain explores the full range). A "hairy caterpillar" look is ideal.
Autocorrelation: Ο_k = Corr(z_t, z_{t+k}). Effective sample size:
$N_eff = N / (1 + 2 Ξ£_{k=1}^β Ο_k)
$
Chains with high autocorrelation provide fewer independent samples.
Gelman-Rubin RΜ: Run M chains from overdispersed starting points. Compare within-chain variance W to between-chain variance B:
$RΜ = β((Nβ1)/N Β· W + B/N) / W) $
RΜ > 1.1 suggests the chains haven't converged. RΜ β 1 indicates convergence.
Acceptance rate for MH: Too high (>90%) β proposals too conservative, slow mixing. Too low (<10%) β proposals too aggressive, chain gets stuck. Optimal: 20-50% for random-walk, 65-80% for HMC.
7. MCMC vs Variational Inference
| Aspect | MCMC | Variational Inference |
|---|---|---|
| Asymptotics | Exact β converges to true posterior as N β β | Biased β limited by variational family even at convergence |
| Speed | Slow β may need thousands of samples | Fast β optimization converges in hundreds of iterations |
| Diagnostics | Well-established (RΜ, ESS, trace plots) | Harder β ELBO gap unknown without true posterior |
| Scalability | Challenging for large datasets | Excellent β stochastic VI with minibatches |
| Gradients | HMC needs gradients of log p | VI needs gradients of ELBO |
| Use case | Gold-standard inference, small-medium data | Large-scale deep learning, when speed matters more than exactness |
Worked Examples
Example 1: Metropolis-Hastings for Beta Posterior
Problem: Prior Beta(2,2), likelihood Binomial(10,7). Target: Beta(9,5). Implement MH with uniform proposal Q(z'|z) ~ Uniform(zβ0.2, z+0.2). For current state z=0.6, proposed z'=0.7, compute acceptance probability.
Solution:
p(z) β Beta(9,5): p(z) β z^8 Β· (1βz)^4
At z=0.6: p(z) β 0.6^8 Β· 0.4^4 = 0.0168 Β· 0.0256 = 4.30Γ10^{β4} At z'=0.7: p(z') β 0.7^8 Β· 0.3^4 = 0.0576 Β· 0.0081 = 4.67Γ10^{β4}
Proposal is symmetric: Q(z|z') = Q(z'|z) = 1/0.4 within the window.
Ξ± = min(1, 4.67Γ10^{β4} / 4.30Γ10^{β4}) = min(1, 1.086) = 1.0
Always accept β the proposed state has higher density.
If z'=0.45 instead: p(0.45) β 0.45^8 Β· 0.55^4 = 0.00168 Β· 0.0915 = 1.54Γ10^{β4}. Ξ± = 1.54/4.30 = 0.358. Accept ~36% of the time.
Example 2: Gibbs Sampling for Bivariate Gaussian
Problem: (zβ, zβ) ~ N([0,0], [[1,Ο],[Ο,1]]). Derive the Gibbs conditionals and sample 3 steps from starting point (2, β2) with Ο=0.8.
Solution:
Conditionals for bivariate Gaussian:
$zβ | zβ ~ N(ΟΒ·zβ, 1βΟΒ²) zβ | zβ ~ N(ΟΒ·zβ, 1βΟΒ²) $
Starting at (2, β2):
Step 1: zβ ~ N(0.8Β·(β2), 0.36) = N(β1.6, 0.36). Sample: β1.52. zβ ~ N(0.8Β·(β1.52), 0.36) = N(β1.22, 0.36). Sample: β1.08.
Step 2: zβ ~ N(0.8Β·(β1.08), 0.36) = N(β0.86, 0.36). Sample: β0.73. zβ ~ N(0.8Β·(β0.73), 0.36) = N(β0.58, 0.36). Sample: β0.41.
Step 3: zβ ~ N(0.8Β·(β0.41), 0.36) = N(β0.33, 0.36). Sample: 0.12. zβ ~ N(0.8Β·0.12, 0.36) = N(0.10, 0.36). Sample: 0.45.
The chain moves from (2, β2) toward the origin (the true mean), exploring along the correlation direction (Ο=0.8 means zβ and zβ are positively correlated in the target).
Example 3: Effective Sample Size
Problem: An MCMC chain of N=10,000 samples has autocorrelations Οβ=0.9, Οβ=0.81, Οβ =0.5, Οββ=0.2, Οββ=0.05, and negligible beyond. Approximate N_eff.
Solution:
$N_eff = N / (1 + 2 Ξ£_{k=1}^β Ο_k)
β 10000 / (1 + 2(0.9+0.81+0.73+0.66+0.5+0.42+0.35+0.28+0.23+0.2+0.15+0.11+0.08+0.06+0.05))
β 10000 / (1 + 2Β·5.53)
β 10000 / 12.06
β 829
$
Only ~8.3% of the samples are "effective." The high autocorrelation means you need ~12Γ more MCMC iterations than you'd naively think. This is why thinning (keeping every k-th sample) is common β it trades storage for correlation, though it doesn't actually improve the Monte Carlo estimate (discarding samples always loses information).
Quiz
Q1: What does the concept of Markov chain primarily refer to in this subject?
A) A historical anecdote about Markov chain B) A visual representation of Markov chain C) The definition and application of Markov chain D) A computational error related to Markov chain
Correct: C)
- If you chose A: This is incorrect. Markov chain is defined as: the definition and application of markov chain. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Markov chain is defined as: the definition and application of markov chain. The other options describe different aspects that are not the primary focus.
- If you chose C: Markov chain is defined as: the definition and application of markov chain. The other options describe different aspects that are not the primary focus. Correct!
- If you chose D: This is incorrect. Markov chain is defined as: the definition and application of markov chain. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of Asymptotics?
A) It replaces all other methods in this domain B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It is used to asymptotics in mathematical analysis
Correct: D)
- If you chose A: This is incorrect. Asymptotics serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose B: This is incorrect. Asymptotics serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Asymptotics serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: Asymptotics serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
Q3: Which statement about Speed is TRUE?
A) Speed is a fundamental concept covered in this subject B) Speed is an advanced topic beyond this subject's scope C) Speed is not related to this subject D) Speed is mentioned only as a historical footnote
Correct: A)
- If you chose A: Speed is a fundamental concept covered in this subject. This subject covers Speed as part of its core content. Correct!
- If you chose B: This is incorrect. Speed is a fundamental concept covered in this subject. This subject covers Speed as part of its core content.
- If you chose C: This is incorrect. Speed is a fundamental concept covered in this subject. This subject covers Speed as part of its core content.
- If you chose D: This is incorrect. Speed is a fundamental concept covered in this subject. This subject covers Speed as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) A different result from a common mistake B) An unrelated numerical value C) The log-density difference: D) The inverse of the correct answer
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is The log-density difference:. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is The log-density difference:. The other options represent common errors.
- If you chose C: The worked examples show that the result is The log-density difference:. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is The log-density difference:. The other options represent common errors.
Q5: How are Speed and Diagnostics related?
A) Speed and Diagnostics are closely related concepts B) Speed and Diagnostics are completely unrelated topics C) Speed is the inverse of Diagnostics D) Speed is a special case of Diagnostics
Correct: A)
- If you chose A: Both Speed and Diagnostics are covered in this subject as interconnected topics. Correct!
- If you chose B: This is incorrect. Both Speed and Diagnostics are covered in this subject as interconnected topics.
- If you chose C: This is incorrect. Both Speed and Diagnostics are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Speed and Diagnostics are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Scalability?
A) The main error with Scalability is using it when it is not needed B) A common mistake is confusing Scalability with a similar concept C) Scalability is always computed the same way in all contexts D) Scalability has no common misconceptions
Correct: B)
- If you chose A: This is incorrect. Students often confuse Scalability with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse Scalability with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse Scalability with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Scalability with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Gradients?
A) Use Gradients only in pure mathematics contexts B) Apply Gradients to solve problems in this subject's domain C) Avoid Gradients unless explicitly instructed D) Gradients is not practically useful
Correct: B)
- If you chose A: This is incorrect. Gradients is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Gradients is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Gradients is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Gradients is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
Prove that if a Markov chain satisfies detailed balance p(z)T(z'|z) = p(z')T(z|z'), then p is a stationary distribution.
Answer
Integrate the detailed balance equation over z:$β« p(z) T(z' | z) dz = β« p(z') T(z | z') dz
= p(z') β« T(z | z') dz
= p(z') Β· 1
= p(z')
$
The left side is the distribution after one transition from p:
(Tp)(z') = β« p(z) T(z' | z) dz = p(z')
So applying T leaves p unchanged β p is stationary. β
Detailed balance is a sufficient (not necessary) condition for stationarity. It's stronger than needed but easier to verify and sufficient to ensure the chain converges to p.
Problem 2
Explain why random-walk Metropolis scales poorly with dimension. For a d-dimensional Gaussian target N(0,I) with proposal N(z, ΟΒ²I), what's the optimal ΟΒ² and what's the acceptance rate?
Answer
For a d-dimensional spherical Gaussian target with proposal N(z, ΟΒ²I): The log-density difference:$log p(z') β log p(z) = βΒ½(||z'||Β² β ||z||Β²) $For the proposal, ||z' β z||Β² ~ ΟΒ²Β·ΟΒ²_d. To maintain reasonable acceptance, ΟΒ² must scale as O(1/d). If ΟΒ² is too large, ||z'||Β² β ||z||Β² becomes very negative and proposals are rejected. So step size shrinks as 1/d. **Optimal ΟΒ² β 2.38Β²/d** and acceptance rate β 0.234 for d β β (Roberts et al., 1997). The O(1/d) step size means the chain needs O(d) steps to traverse the distribution in any direction β "random walk" indeed. HMC overcomes this by using gradient information to make coherent proposals.
Problem 3
Derive why Gibbs sampling has acceptance probability 1 in the MH framework.
Answer
Gibbs proposes z_j' ~ p(z_j' | z_{βj}) and leaves z_{βj} unchanged. So: Proposal: Q(z' | z) = p(z_j' | z_{βj}) MH acceptance ratio:$Ξ± = min(1, p(z')Β·Q(z | z') / (p(z)Β·Q(z' | z))) $Now: - p(z) = p(z_j, z_{βj}) = p(z_j | z_{βj}) Β· p(z_{βj}) - p(z') = p(z_j', z_{βj}) = p(z_j' | z_{βj}) Β· p(z_{βj}) [same z_{βj}!] - Q(z' | z) = p(z_j' | z_{βj}) [proposed new value given the rest] - Q(z | z') = p(z_j | z_{βj}) [reverse proposal from z' back to z] Substituting:
$Ξ± = min(1, [p(z_j'|z_{βj})Β·p(z_{βj}) Β· p(z_j|z_{βj})] / [p(z_j|z_{βj})Β·p(z_{βj}) Β· p(z_j'|z_{βj})])
= min(1, 1)
= 1
$
Everything cancels. Gibbs always accepts because the proposal is already distributed according to the conditional of the target. β
Problem 4
In HMC, if the leapfrog integrator were perfectly energy-preserving, what would the acceptance probability be? Why do we need the accept/reject step in practice?
Answer
If H(z', r') = H(z, r) (perfect energy conservation), then Ξ± = min(1, e^0) = 1. Every proposal would be accepted. **Why we still need accept/reject:** The leapfrog integrator introduces O(Ρ³) error per step and O(Ρ²) error overall. Over L steps, energy is not perfectly conserved β typically H(z',r') β H(z,r). The accept/reject step CORRECTS for this discretization error, ensuring the chain's stationary distribution remains exactly the target. Without the accept/reject step, the chain would sample from a perturbed distribution that differs from the target by O(Ρ²). This is why HMC is "exact" despite using approximate integration β the MH correction guarantees correctness.Problem 5
You run 4 MCMC chains for a parameter ΞΌ. The within-chain variance is W=2.3 and between-chain variance is B=8.7, with N=1000 samples per chain. Compute RΜ. Has the chain converged?
Answer
$VarΜ^+ = (Nβ1)/N Β· W + B/N = (999/1000)Β·2.3 + 8.7/1000 = 2.2977 + 0.0087 = 2.3064 RΜ = β(VarΜ^+ / W) = β(2.3064 / 2.3) = β1.0028 β 1.001 $RΜ β 1.001 < 1.1, suggesting good convergence. The between-chain variance B is large (8.7), but divided by N=1000 it contributes negligibly to the pooled variance estimate. The chains have explored the same region β the large B just means the chain means are somewhat spread out, but with N=1000 samples the estimate of each chain's mean is precise enough that we can still conclude convergence. **Warning:** RΜ β 1 doesn't guarantee convergence β it only tests whether chains have reached the SAME distribution. If all chains are stuck in the same local mode, RΜ β 1 but convergence hasn't been achieved. Always complement RΜ with trace plots and ESS.
Summary
- MCMC constructs a Markov chain whose stationary distribution is the target posterior β simulating the chain produces asymptotically exact samples
- Metropolis-Hastings proposes states from Q and accepts/rejects to satisfy detailed balance β the acceptance ratio corrects for proposal asymmetry
- Gibbs sampling is MH with acceptance probability 1 β sample from full conditionals when available
- Hamiltonian Monte Carlo uses gradient information and Hamiltonian dynamics to propose distant states with high acceptance β suppressing O(d) random-walk behavior to O(d^{1/4})
- Convergence diagnostics (trace plots, autocorrelation, ESS, RΜ) are essential β MCMC guarantees are asymptotic, not per-sample
Pitfalls
- Treating correlated MCMC samples as independent. MCMC produces autocorrelated samples β consecutive draws are not independent. Computing standard errors or confidence intervals as if N_corr = N_total can dramatically overstate precision. Always compute the effective sample size (ESS) and use it for uncertainty quantification. An ESS of 200 from 10,000 samples means you effectively have only 200 independent data points.
- Running a single chain and trusting it. A single chain can get stuck in a local mode and appear well-mixed (good trace plot, low autocorrelation) while entirely missing other important regions of the posterior. Always run at least 3-4 chains from overdispersed starting points and use RΜ to check that they converge to the same distribution. A single chain is never sufficient for reliable inference.
- Misinterpreting RΜ β 1 as proof of convergence. RΜ β 1 means the chains have converged to the SAME distribution β but if all chains are stuck in the same local mode, RΜ β 1 while the chains haven't explored the full posterior. RΜ is a necessary condition for convergence, not a sufficient one. Always complement RΜ with trace plots, effective sample size, and visual inspection of the marginal posterior distributions.
- Using random-walk Metropolis in high dimensions. Random-walk MH requires step sizes that scale as O(1/d), needing O(d) steps to traverse the distribution. In a 100-dimensional problem, this means the chain barely moves. Use HMC or its variants (NUTS, which adapts the path length automatically) for problems with more than 10-20 parameters. Random-walk MH is only practical for very low-dimensional problems or as a pedagogical tool.
- Discarding burn-in samples but not accounting for autocorrelation in the remaining samples. Burn-in removal handles initialization bias, but the post-burn-in samples are still correlated. The common practice of "keep every k-th sample" (thinning) reduces storage but does NOT improve the Monte Carlo estimate β it discards information. Use ALL post-burn-in samples for estimation, but compute standard errors using the effective sample size, not the nominal sample size.
Key Terms
| Term | Definition |
|---|---|
| Markov chain | A sequence where each state depends only on the previous state through transition kernel T(z' |
| Stationary distribution | Ο(z') = β« T(z' |
| Detailed balance | p(z)T(z' |
| Metropolis-Hastings | Proposal + accept/reject with Ξ± = min(1, p(z')Q(z |
| Gibbs sampling | Sample each variable from its full conditional p(z_j |
| HMC | Hamiltonian Monte Carlo β augments with momentum, simulates physics, uses gradients |
| Leapfrog integrator | Symplectic integrator for HMC β volume-preserving and reversible, O(Ρ³) per-step error |
| Effective sample size | N_eff = N/(1+2Ξ£Ο_k) β corrects for autocorrelation; fewer effective than nominal samples |
| RΜ (Gelman-Rubin) | β(VarΜ^+/W) β ratio of pooled to within-chain variance; near 1 suggests convergence |
Next Steps
Continue to 21-04 β EM Algorithm to learn how to perform maximum likelihood estimation with latent variables β a deterministic alternative that iteratively refines parameter estimates.