Math graphic
πŸ“ Concept diagram

21-03 β€” Markov Chain Monte Carlo (MCMC)

Phase: 21 β€” Probability & Statistics for ML (Advanced) Subject: 21-03 Prerequisites: 21-01 (Bayesian Inference), 10-06 (Expectation and Variance), 11-01 (Markov Chains β€” transition matrices, stationary distributions), 11-02 (Markov Processes β€” continuous-time), 03-06 (Sequences and Series β€” convergence) Next subject: 21-04 β€” EM Algorithm


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the Metropolis-Hastings algorithm from the detailed balance condition and prove it converges to the target distribution
  2. Implement and analyze Gibbs sampling β€” derive the conditional distributions and prove it is a special case of Metropolis-Hastings with acceptance probability 1
  3. Explain why MCMC is asymptotically exact (unlike variational inference) and derive the Monte Carlo error as O(1/√N_eff) where N_eff accounts for autocorrelation
  4. Diagnose MCMC convergence using trace plots, autocorrelation, effective sample size, and the Gelman-Rubin RΜ‚ statistic
  5. Apply Hamiltonian Monte Carlo (HMC) intuition β€” explain how gradient information and Hamiltonian dynamics suppress random-walk behavior

Core Content

1. The Problem: Sampling from Intractable Distributions

Bayesian inference requires computing expectations under the posterior:

$E_{p(z|x)}[f(z)] = ∫ f(z) p(z|x) dz
$

But p(z|x) = p(x|z)p(z)/p(x) is only known up to a constant β€” the evidence p(x) is intractable. MCMC solves this by constructing a Markov chain whose stationary distribution IS the target distribution p(z|x). Then we simulate the chain and use the samples to approximate expectations.

⚠️ THIS IS CRITICAL β€” MCMC is asymptotically EXACT. Unlike variational inference (which minimizes KL to a simpler family), MCMC converges to the true posterior as the number of samples β†’ ∞. The tradeoff is computational cost.


2. Markov Chains and Stationary Distributions

A Markov chain is a sequence z₁, zβ‚‚, ... where z_{t+1} depends only on z_t, through a transition kernel T(z' | z):

$P(z_{t+1} = z' | z_t = z) = T(z' | z)
$

A distribution Ο€ is stationary if applying the transition leaves it unchanged:

$Ο€(z') = ∫ T(z' | z) Ο€(z) dz
$

Ergodic theorem: For an irreducible, aperiodic Markov chain with stationary distribution Ο€, the time average converges to the expectation under Ο€:

(1/N) Ξ£_{t=1}^N f(z_t) β†’ E_Ο€[f(z)]   as N β†’ ∞

MCMC constructs T such that Ο€ = p(z|x) (our target posterior).


3. Metropolis-Hastings (MH) Algorithm

The most general MCMC algorithm. Given current state z, propose z' ~ Q(z' | z), then accept with probability:

$Ξ±(z β†’ z') = min(1, [p(z') Β· Q(z | z')] / [p(z) Β· Q(z' | z)])
$

where p(z) is the unnormalized target density (p(z) ∝ p(z|x)).

Algorithm:

$Initialize z_0
For t = 1 to N:
    Propose z' ~ Q(z' | z_{tβˆ’1})
    Compute Ξ± = min(1, p(z')Q(z_{tβˆ’1}|z') / (p(z_{tβˆ’1})Q(z'|z_{tβˆ’1})))
    With probability Ξ±: z_t = z'  (accept)
    Otherwise: z_t = z_{tβˆ’1}      (reject)
$

Proof of correctness (detailed balance):

The MH kernel satisfies detailed balance with respect to p:

$p(z) Β· T(z' | z) = p(z) Β· Q(z' | z) Β· Ξ±(z β†’ z')
                 = p(z) Β· Q(z' | z) Β· min(1, p(z')Q(z|z')/(p(z)Q(z'|z)))
                 = min(p(z)Q(z'|z), p(z')Q(z|z'))

Similarly: p(z') Β· T(z | z') = min(p(z')Q(z|z'), p(z)Q(z'|z))
                             = p(z) Β· T(z' | z)  βœ“
$

Detailed balance implies p is stationary. The acceptance ratio is the heart of MH β€” it corrects for the asymmetry in the proposal distribution.

Special cases: - Random-walk MH: Q(z' | z) = N(z' | z, Ξ£) β€” symmetric proposal. Acceptance simplifies to Ξ± = min(1, p(z')/p(z)). - Independence sampler: Q(z' | z) = Q(z') β€” proposal independent of current state.


4. Gibbs Sampling

When the conditional distributions p(z_j | z_{βˆ’j}) are tractable, Gibbs sampling provides an acceptance probability of 1:

$For t = 1 to N:
    z₁^{(t)} ~ p(z₁ | zβ‚‚^{(tβˆ’1)}, z₃^{(tβˆ’1)}, ..., z_d^{(tβˆ’1)})
    zβ‚‚^{(t)} ~ p(zβ‚‚ | z₁^{(t)}, z₃^{(tβˆ’1)}, ..., z_d^{(tβˆ’1)})
    ...
    z_d^{(t)} ~ p(z_d | z₁^{(t)}, zβ‚‚^{(t)}, ..., z_{dβˆ’1}^{(t)})
$

Each update samples from a FULL conditional β€” no accept/reject step needed.

Why Ξ± = 1: Gibbs is MH with proposal Q(z_j' | z) = p(z_j' | z_{βˆ’j}). Then:

$Ξ± = min(1, p(z')p(z_j|z_{βˆ’j}) / (p(z)p(z_j'|z_{βˆ’j})))
  = min(1, p(z')p(z_j|z_{βˆ’j}) / [p(z_{βˆ’j})p(z_j|z_{βˆ’j}) Β· p(z_j'|z_{βˆ’j})])
$

Since z' and z share the same z_{βˆ’j}, p(z') = p(z_{βˆ’j})p(z_j'|z_{βˆ’j}), and the ratio simplifies to 1.


5. Hamiltonian Monte Carlo (HMC)

Random-walk MH suffers from slow exploration in high dimensions β€” the acceptance rate drops as dimension grows. HMC uses gradient information to propose distant states with high acceptance probability.

Key idea: Augment the parameter space with "momentum" variables r and simulate Hamiltonian dynamics:

$H(z, r) = βˆ’log p(z) + Β½ r^T M^{βˆ’1} r
        = U(z) + K(r)
$

where U(z) = βˆ’log p(z) is the "potential energy" and K(r) = Β½r^T M^{βˆ’1}r is the "kinetic energy."

Hamilton's equations:

$dz/dt = βˆ‚H/βˆ‚r = M^{βˆ’1}r
dr/dt = βˆ’βˆ‚H/βˆ‚z = βˆ‡_z log p(z)
$

HMC algorithm:

Sample momentum r ~ N(0, M)
Simulate Hamiltonian dynamics for L steps with step size Ξ΅
    (using leapfrog integrator for volume-preservation and reversibility)
Proposed state: (z', r') after L steps
Accept with probability: min(1, exp(H(z,r) βˆ’ H(z',r')))

Why it works better: The gradient βˆ‡ log p(z) guides proposals TOWARD high-probability regions. The Hamiltonian dynamics preserve the joint distribution p(z)Β·N(r|0,M), so proposals remain in high-probability regions. Acceptance rates stay near 1 even in high dimensions because energy is approximately conserved by the symplectic integrator.

HMC suppresses the random-walk behavior that plagues MH in high dimensions β€” the number of steps to traverse the distribution scales as O(d^{1/4}) for HMC vs O(d) for random-walk MH.


6. MCMC Diagnostics

Trace plots: Plot z_t vs t. Look for stationarity (no drift) and good mixing (chain explores the full range). A "hairy caterpillar" look is ideal.

Autocorrelation: ρ_k = Corr(z_t, z_{t+k}). Effective sample size:

$N_eff = N / (1 + 2 Σ_{k=1}^∞ ρ_k)
$

Chains with high autocorrelation provide fewer independent samples.

Gelman-Rubin RΜ‚: Run M chains from overdispersed starting points. Compare within-chain variance W to between-chain variance B:

$RΜ‚ = √((Nβˆ’1)/N Β· W + B/N) / W)
$

RΜ‚ > 1.1 suggests the chains haven't converged. RΜ‚ β‰ˆ 1 indicates convergence.

Acceptance rate for MH: Too high (>90%) β€” proposals too conservative, slow mixing. Too low (<10%) β€” proposals too aggressive, chain gets stuck. Optimal: 20-50% for random-walk, 65-80% for HMC.


7. MCMC vs Variational Inference

Aspect MCMC Variational Inference
Asymptotics Exact β€” converges to true posterior as N β†’ ∞ Biased β€” limited by variational family even at convergence
Speed Slow β€” may need thousands of samples Fast β€” optimization converges in hundreds of iterations
Diagnostics Well-established (RΜ‚, ESS, trace plots) Harder β€” ELBO gap unknown without true posterior
Scalability Challenging for large datasets Excellent β€” stochastic VI with minibatches
Gradients HMC needs gradients of log p VI needs gradients of ELBO
Use case Gold-standard inference, small-medium data Large-scale deep learning, when speed matters more than exactness

Worked Examples

Example 1: Metropolis-Hastings for Beta Posterior

Problem: Prior Beta(2,2), likelihood Binomial(10,7). Target: Beta(9,5). Implement MH with uniform proposal Q(z'|z) ~ Uniform(zβˆ’0.2, z+0.2). For current state z=0.6, proposed z'=0.7, compute acceptance probability.

Solution:

p(z) ∝ Beta(9,5): p(z) ∝ z^8 Β· (1βˆ’z)^4

At z=0.6: p(z) ∝ 0.6^8 Β· 0.4^4 = 0.0168 Β· 0.0256 = 4.30Γ—10^{βˆ’4} At z'=0.7: p(z') ∝ 0.7^8 Β· 0.3^4 = 0.0576 Β· 0.0081 = 4.67Γ—10^{βˆ’4}

Proposal is symmetric: Q(z|z') = Q(z'|z) = 1/0.4 within the window.

Ξ± = min(1, 4.67Γ—10^{βˆ’4} / 4.30Γ—10^{βˆ’4}) = min(1, 1.086) = 1.0

Always accept β€” the proposed state has higher density.

If z'=0.45 instead: p(0.45) ∝ 0.45^8 Β· 0.55^4 = 0.00168 Β· 0.0915 = 1.54Γ—10^{βˆ’4}. Ξ± = 1.54/4.30 = 0.358. Accept ~36% of the time.


Example 2: Gibbs Sampling for Bivariate Gaussian

Problem: (z₁, zβ‚‚) ~ N([0,0], [[1,ρ],[ρ,1]]). Derive the Gibbs conditionals and sample 3 steps from starting point (2, βˆ’2) with ρ=0.8.

Solution:

Conditionals for bivariate Gaussian:

$z₁ | zβ‚‚ ~ N(ρ·zβ‚‚, 1βˆ’ΟΒ²)
zβ‚‚ | z₁ ~ N(ρ·z₁, 1βˆ’ΟΒ²)
$

Starting at (2, βˆ’2):

Step 1: z₁ ~ N(0.8Β·(βˆ’2), 0.36) = N(βˆ’1.6, 0.36). Sample: βˆ’1.52. zβ‚‚ ~ N(0.8Β·(βˆ’1.52), 0.36) = N(βˆ’1.22, 0.36). Sample: βˆ’1.08.

Step 2: z₁ ~ N(0.8Β·(βˆ’1.08), 0.36) = N(βˆ’0.86, 0.36). Sample: βˆ’0.73. zβ‚‚ ~ N(0.8Β·(βˆ’0.73), 0.36) = N(βˆ’0.58, 0.36). Sample: βˆ’0.41.

Step 3: z₁ ~ N(0.8Β·(βˆ’0.41), 0.36) = N(βˆ’0.33, 0.36). Sample: 0.12. zβ‚‚ ~ N(0.8Β·0.12, 0.36) = N(0.10, 0.36). Sample: 0.45.

The chain moves from (2, βˆ’2) toward the origin (the true mean), exploring along the correlation direction (ρ=0.8 means z₁ and zβ‚‚ are positively correlated in the target).


Example 3: Effective Sample Size

Problem: An MCMC chain of N=10,000 samples has autocorrelations ρ₁=0.9, ρ₂=0.81, ρ₅=0.5, ρ₁₀=0.2, ρ₂₀=0.05, and negligible beyond. Approximate N_eff.

Solution:

$N_eff = N / (1 + 2 Σ_{k=1}^∞ ρ_k)
      β‰ˆ 10000 / (1 + 2(0.9+0.81+0.73+0.66+0.5+0.42+0.35+0.28+0.23+0.2+0.15+0.11+0.08+0.06+0.05))
      β‰ˆ 10000 / (1 + 2Β·5.53)
      β‰ˆ 10000 / 12.06
      β‰ˆ 829
$

Only ~8.3% of the samples are "effective." The high autocorrelation means you need ~12Γ— more MCMC iterations than you'd naively think. This is why thinning (keeping every k-th sample) is common β€” it trades storage for correlation, though it doesn't actually improve the Monte Carlo estimate (discarding samples always loses information).



Quiz

Q1: What does the concept of Markov chain primarily refer to in this subject?

A) A historical anecdote about Markov chain B) A visual representation of Markov chain C) The definition and application of Markov chain D) A computational error related to Markov chain

Correct: C)

Q2: What is the primary purpose of Asymptotics?

A) It replaces all other methods in this domain B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It is used to asymptotics in mathematical analysis

Correct: D)

Q3: Which statement about Speed is TRUE?

A) Speed is a fundamental concept covered in this subject B) Speed is an advanced topic beyond this subject's scope C) Speed is not related to this subject D) Speed is mentioned only as a historical footnote

Correct: A)

Q4: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) An unrelated numerical value C) The log-density difference: D) The inverse of the correct answer

Correct: C)

Q5: How are Speed and Diagnostics related?

A) Speed and Diagnostics are closely related concepts B) Speed and Diagnostics are completely unrelated topics C) Speed is the inverse of Diagnostics D) Speed is a special case of Diagnostics

Correct: A)

Q6: What is a common pitfall when working with Scalability?

A) The main error with Scalability is using it when it is not needed B) A common mistake is confusing Scalability with a similar concept C) Scalability is always computed the same way in all contexts D) Scalability has no common misconceptions

Correct: B)

Q7: When should you apply Gradients?

A) Use Gradients only in pure mathematics contexts B) Apply Gradients to solve problems in this subject's domain C) Avoid Gradients unless explicitly instructed D) Gradients is not practically useful

Correct: B)

Practice Problems

Problem 1

Prove that if a Markov chain satisfies detailed balance p(z)T(z'|z) = p(z')T(z|z'), then p is a stationary distribution.

Answer Integrate the detailed balance equation over z:
$∫ p(z) T(z' | z) dz = ∫ p(z') T(z | z') dz
                    = p(z') ∫ T(z | z') dz
                    = p(z') Β· 1
                    = p(z')
$
The left side is the distribution after one transition from p: (Tp)(z') = ∫ p(z) T(z' | z) dz = p(z') So applying T leaves p unchanged β€” p is stationary. βœ“ Detailed balance is a sufficient (not necessary) condition for stationarity. It's stronger than needed but easier to verify and sufficient to ensure the chain converges to p.

Problem 2

Explain why random-walk Metropolis scales poorly with dimension. For a d-dimensional Gaussian target N(0,I) with proposal N(z, σ²I), what's the optimal σ² and what's the acceptance rate?

Answer For a d-dimensional spherical Gaussian target with proposal N(z, σ²I): The log-density difference:
$log p(z') βˆ’ log p(z) = βˆ’Β½(||z'||Β² βˆ’ ||z||Β²)
$
For the proposal, ||z' βˆ’ z||Β² ~ σ²·χ²_d. To maintain reasonable acceptance, σ² must scale as O(1/d). If σ² is too large, ||z'||Β² βˆ’ ||z||Β² becomes very negative and proposals are rejected. So step size shrinks as 1/d. **Optimal σ² β‰ˆ 2.38Β²/d** and acceptance rate β‰ˆ 0.234 for d β†’ ∞ (Roberts et al., 1997). The O(1/d) step size means the chain needs O(d) steps to traverse the distribution in any direction β€” "random walk" indeed. HMC overcomes this by using gradient information to make coherent proposals.

Problem 3

Derive why Gibbs sampling has acceptance probability 1 in the MH framework.

Answer Gibbs proposes z_j' ~ p(z_j' | z_{βˆ’j}) and leaves z_{βˆ’j} unchanged. So: Proposal: Q(z' | z) = p(z_j' | z_{βˆ’j}) MH acceptance ratio:
$Ξ± = min(1, p(z')Β·Q(z | z') / (p(z)Β·Q(z' | z)))
$
Now: - p(z) = p(z_j, z_{βˆ’j}) = p(z_j | z_{βˆ’j}) Β· p(z_{βˆ’j}) - p(z') = p(z_j', z_{βˆ’j}) = p(z_j' | z_{βˆ’j}) Β· p(z_{βˆ’j}) [same z_{βˆ’j}!] - Q(z' | z) = p(z_j' | z_{βˆ’j}) [proposed new value given the rest] - Q(z | z') = p(z_j | z_{βˆ’j}) [reverse proposal from z' back to z] Substituting:
$Ξ± = min(1, [p(z_j'|z_{βˆ’j})Β·p(z_{βˆ’j}) Β· p(z_j|z_{βˆ’j})] / [p(z_j|z_{βˆ’j})Β·p(z_{βˆ’j}) Β· p(z_j'|z_{βˆ’j})])
  = min(1, 1)
  = 1
$
Everything cancels. Gibbs always accepts because the proposal is already distributed according to the conditional of the target. βœ“

Problem 4

In HMC, if the leapfrog integrator were perfectly energy-preserving, what would the acceptance probability be? Why do we need the accept/reject step in practice?

Answer If H(z', r') = H(z, r) (perfect energy conservation), then Ξ± = min(1, e^0) = 1. Every proposal would be accepted. **Why we still need accept/reject:** The leapfrog integrator introduces O(Ρ³) error per step and O(Ρ²) error overall. Over L steps, energy is not perfectly conserved β€” typically H(z',r') β‰  H(z,r). The accept/reject step CORRECTS for this discretization error, ensuring the chain's stationary distribution remains exactly the target. Without the accept/reject step, the chain would sample from a perturbed distribution that differs from the target by O(Ρ²). This is why HMC is "exact" despite using approximate integration β€” the MH correction guarantees correctness.

Problem 5

You run 4 MCMC chains for a parameter ΞΌ. The within-chain variance is W=2.3 and between-chain variance is B=8.7, with N=1000 samples per chain. Compute RΜ‚. Has the chain converged?

Answer
$VarΜ‚^+ = (Nβˆ’1)/N Β· W + B/N = (999/1000)Β·2.3 + 8.7/1000 = 2.2977 + 0.0087 = 2.3064

RΜ‚ = √(VarΜ‚^+ / W) = √(2.3064 / 2.3) = √1.0028 β‰ˆ 1.001
$
RΜ‚ β‰ˆ 1.001 < 1.1, suggesting good convergence. The between-chain variance B is large (8.7), but divided by N=1000 it contributes negligibly to the pooled variance estimate. The chains have explored the same region β€” the large B just means the chain means are somewhat spread out, but with N=1000 samples the estimate of each chain's mean is precise enough that we can still conclude convergence. **Warning:** RΜ‚ β‰ˆ 1 doesn't guarantee convergence β€” it only tests whether chains have reached the SAME distribution. If all chains are stuck in the same local mode, RΜ‚ β‰ˆ 1 but convergence hasn't been achieved. Always complement RΜ‚ with trace plots and ESS.

Summary

  1. MCMC constructs a Markov chain whose stationary distribution is the target posterior β€” simulating the chain produces asymptotically exact samples
  2. Metropolis-Hastings proposes states from Q and accepts/rejects to satisfy detailed balance β€” the acceptance ratio corrects for proposal asymmetry
  3. Gibbs sampling is MH with acceptance probability 1 β€” sample from full conditionals when available
  4. Hamiltonian Monte Carlo uses gradient information and Hamiltonian dynamics to propose distant states with high acceptance β€” suppressing O(d) random-walk behavior to O(d^{1/4})
  5. Convergence diagnostics (trace plots, autocorrelation, ESS, RΜ‚) are essential β€” MCMC guarantees are asymptotic, not per-sample

Pitfalls


Key Terms

Term Definition
Markov chain A sequence where each state depends only on the previous state through transition kernel T(z'
Stationary distribution Ο€(z') = ∫ T(z'
Detailed balance p(z)T(z'
Metropolis-Hastings Proposal + accept/reject with Ξ± = min(1, p(z')Q(z
Gibbs sampling Sample each variable from its full conditional p(z_j
HMC Hamiltonian Monte Carlo β€” augments with momentum, simulates physics, uses gradients
Leapfrog integrator Symplectic integrator for HMC β€” volume-preserving and reversible, O(Ρ³) per-step error
Effective sample size N_eff = N/(1+2Σρ_k) β€” corrects for autocorrelation; fewer effective than nominal samples
RΜ‚ (Gelman-Rubin) √(VarΜ‚^+/W) β€” ratio of pooled to within-chain variance; near 1 suggests convergence

Next Steps

Continue to 21-04 β€” EM Algorithm to learn how to perform maximum likelihood estimation with latent variables β€” a deterministic alternative that iteratively refines parameter estimates.