📐 Concept diagram

21-03 — Markov Chain Monte Carlo (MCMC)

Phase: 21 — Probability & Statistics for ML (Advanced) Subject: 21-03 Prerequisites: 21-01 (Bayesian Inference), 10-06 (Expectation and Variance), 11-01 (Markov Chains — transition matrices, stationary distributions), 11-02 (Markov Processes — continuous-time), 03-06 (Sequences and Series — convergence) Next subject: 21-04 — EM Algorithm

Learning Objectives

By the end of this subject, you will be able to:

Derive the Metropolis-Hastings algorithm from the detailed balance condition and prove it converges to the target distribution
Implement and analyze Gibbs sampling — derive the conditional distributions and prove it is a special case of Metropolis-Hastings with acceptance probability 1
Explain why MCMC is asymptotically exact (unlike variational inference) and derive the Monte Carlo error as O(1/√N_eff) where N_eff accounts for autocorrelation
Diagnose MCMC convergence using trace plots, autocorrelation, effective sample size, and the Gelman-Rubin R̂ statistic
Apply Hamiltonian Monte Carlo (HMC) intuition — explain how gradient information and Hamiltonian dynamics suppress random-walk behavior

Core Content

1. The Problem: Sampling from Intractable Distributions

Bayesian inference requires computing expectations under the posterior:

$E_{p(z|x)}[f(z)] = ∫ f(z) p(z|x) dz
$

But p(z|x) = p(x|z)p(z)/p(x) is only known up to a constant — the evidence p(x) is intractable. MCMC solves this by constructing a Markov chain whose stationary distribution IS the target distribution p(z|x). Then we simulate the chain and use the samples to approximate expectations.

⚠️ THIS IS CRITICAL — MCMC is asymptotically EXACT. Unlike variational inference (which minimizes KL to a simpler family), MCMC converges to the true posterior as the number of samples → ∞. The tradeoff is computational cost.

2. Markov Chains and Stationary Distributions

A Markov chain is a sequence z₁, z₂, ... where z_{t+1} depends only on z_t, through a transition kernel T(z' | z):

$P(z_{t+1} = z' | z_t = z) = T(z' | z)
$

A distribution π is stationary if applying the transition leaves it unchanged:

$π(z') = ∫ T(z' | z) π(z) dz
$

Ergodic theorem: For an irreducible, aperiodic Markov chain with stationary distribution π, the time average converges to the expectation under π:

(1/N) Σ_{t=1}^N f(z_t) → E_π[f(z)]   as N → ∞

MCMC constructs T such that π = p(z|x) (our target posterior).

3. Metropolis-Hastings (MH) Algorithm

The most general MCMC algorithm. Given current state z, propose z' ~ Q(z' | z), then accept with probability:

$α(z → z') = min(1, [p(z') · Q(z | z')] / [p(z) · Q(z' | z)])
$

where p(z) is the unnormalized target density (p(z) ∝ p(z|x)).

Algorithm:

$Initialize z_0
For t = 1 to N:
    Propose z' ~ Q(z' | z_{t−1})
    Compute α = min(1, p(z')Q(z_{t−1}|z') / (p(z_{t−1})Q(z'|z_{t−1})))
    With probability α: z_t = z'  (accept)
    Otherwise: z_t = z_{t−1}      (reject)
$

Proof of correctness (detailed balance):

The MH kernel satisfies detailed balance with respect to p:

$p(z) · T(z' | z) = p(z) · Q(z' | z) · α(z → z')
                 = p(z) · Q(z' | z) · min(1, p(z')Q(z|z')/(p(z)Q(z'|z)))
                 = min(p(z)Q(z'|z), p(z')Q(z|z'))

Similarly: p(z') · T(z | z') = min(p(z')Q(z|z'), p(z)Q(z'|z))
                             = p(z) · T(z' | z)  ✓
$

Detailed balance implies p is stationary. The acceptance ratio is the heart of MH — it corrects for the asymmetry in the proposal distribution.

Special cases: - Random-walk MH: Q(z' | z) = N(z' | z, Σ) — symmetric proposal. Acceptance simplifies to α = min(1, p(z')/p(z)). - Independence sampler: Q(z' | z) = Q(z') — proposal independent of current state.

4. Gibbs Sampling

When the conditional distributions p(z_j | z_{−j}) are tractable, Gibbs sampling provides an acceptance probability of 1:

$For t = 1 to N:
    z₁^{(t)} ~ p(z₁ | z₂^{(t−1)}, z₃^{(t−1)}, ..., z_d^{(t−1)})
    z₂^{(t)} ~ p(z₂ | z₁^{(t)}, z₃^{(t−1)}, ..., z_d^{(t−1)})
    ...
    z_d^{(t)} ~ p(z_d | z₁^{(t)}, z₂^{(t)}, ..., z_{d−1}^{(t)})
$

Each update samples from a FULL conditional — no accept/reject step needed.

Why α = 1: Gibbs is MH with proposal Q(z_j' | z) = p(z_j' | z_{−j}). Then:

$α = min(1, p(z')p(z_j|z_{−j}) / (p(z)p(z_j'|z_{−j})))
  = min(1, p(z')p(z_j|z_{−j}) / [p(z_{−j})p(z_j|z_{−j}) · p(z_j'|z_{−j})])
$

Since z' and z share the same z_{−j}, p(z') = p(z_{−j})p(z_j'|z_{−j}), and the ratio simplifies to 1.

5. Hamiltonian Monte Carlo (HMC)

Random-walk MH suffers from slow exploration in high dimensions — the acceptance rate drops as dimension grows. HMC uses gradient information to propose distant states with high acceptance probability.

Key idea: Augment the parameter space with "momentum" variables r and simulate Hamiltonian dynamics:

$H(z, r) = −log p(z) + ½ r^T M^{−1} r
        = U(z) + K(r)
$

where U(z) = −log p(z) is the "potential energy" and K(r) = ½r^T M^{−1}r is the "kinetic energy."

Hamilton's equations:

$dz/dt = ∂H/∂r = M^{−1}r
dr/dt = −∂H/∂z = ∇_z log p(z)
$

HMC algorithm:

Sample momentum r ~ N(0, M)
Simulate Hamiltonian dynamics for L steps with step size ε
    (using leapfrog integrator for volume-preservation and reversibility)
Proposed state: (z', r') after L steps
Accept with probability: min(1, exp(H(z,r) − H(z',r')))

Why it works better: The gradient ∇ log p(z) guides proposals TOWARD high-probability regions. The Hamiltonian dynamics preserve the joint distribution p(z)·N(r|0,M), so proposals remain in high-probability regions. Acceptance rates stay near 1 even in high dimensions because energy is approximately conserved by the symplectic integrator.

HMC suppresses the random-walk behavior that plagues MH in high dimensions — the number of steps to traverse the distribution scales as O(d^{1/4}) for HMC vs O(d) for random-walk MH.

6. MCMC Diagnostics

Trace plots: Plot z_t vs t. Look for stationarity (no drift) and good mixing (chain explores the full range). A "hairy caterpillar" look is ideal.

Autocorrelation: ρ_k = Corr(z_t, z_{t+k}). Effective sample size:

$N_eff = N / (1 + 2 Σ_{k=1}^∞ ρ_k)
$

Chains with high autocorrelation provide fewer independent samples.

Gelman-Rubin R̂: Run M chains from overdispersed starting points. Compare within-chain variance W to between-chain variance B:

$R̂ = √((N−1)/N · W + B/N) / W)
$

R̂ > 1.1 suggests the chains haven't converged. R̂ ≈ 1 indicates convergence.

Acceptance rate for MH: Too high (>90%) — proposals too conservative, slow mixing. Too low (<10%) — proposals too aggressive, chain gets stuck. Optimal: 20-50% for random-walk, 65-80% for HMC.

7. MCMC vs Variational Inference

Aspect	MCMC	Variational Inference
Asymptotics	Exact — converges to true posterior as N → ∞	Biased — limited by variational family even at convergence
Speed	Slow — may need thousands of samples	Fast — optimization converges in hundreds of iterations
Diagnostics	Well-established (R̂, ESS, trace plots)	Harder — ELBO gap unknown without true posterior
Scalability	Challenging for large datasets	Excellent — stochastic VI with minibatches
Gradients	HMC needs gradients of log p	VI needs gradients of ELBO
Use case	Gold-standard inference, small-medium data	Large-scale deep learning, when speed matters more than exactness

Worked Examples

Example 1: Metropolis-Hastings for Beta Posterior

Problem: Prior Beta(2,2), likelihood Binomial(10,7). Target: Beta(9,5). Implement MH with uniform proposal Q(z'|z) ~ Uniform(z−0.2, z+0.2). For current state z=0.6, proposed z'=0.7, compute acceptance probability.

Solution:

p(z) ∝ Beta(9,5): p(z) ∝ z^8 · (1−z)^4

At z=0.6: p(z) ∝ 0.6^8 · 0.4^4 = 0.0168 · 0.0256 = 4.30×10^{−4} At z'=0.7: p(z') ∝ 0.7^8 · 0.3^4 = 0.0576 · 0.0081 = 4.67×10^{−4}

Proposal is symmetric: Q(z|z') = Q(z'|z) = 1/0.4 within the window.

α = min(1, 4.67×10^{−4} / 4.30×10^{−4}) = min(1, 1.086) = 1.0

Always accept — the proposed state has higher density.

If z'=0.45 instead: p(0.45) ∝ 0.45^8 · 0.55^4 = 0.00168 · 0.0915 = 1.54×10^{−4}. α = 1.54/4.30 = 0.358. Accept ~36% of the time.

Example 2: Gibbs Sampling for Bivariate Gaussian

Problem: (z₁, z₂) ~ N([0,0], [[1,ρ],[ρ,1]]). Derive the Gibbs conditionals and sample 3 steps from starting point (2, −2) with ρ=0.8.

Solution:

Conditionals for bivariate Gaussian:

$z₁ | z₂ ~ N(ρ·z₂, 1−ρ²)
z₂ | z₁ ~ N(ρ·z₁, 1−ρ²)
$

Starting at (2, −2):

Step 1: z₁ ~ N(0.8·(−2), 0.36) = N(−1.6, 0.36). Sample: −1.52. z₂ ~ N(0.8·(−1.52), 0.36) = N(−1.22, 0.36). Sample: −1.08.

Step 2: z₁ ~ N(0.8·(−1.08), 0.36) = N(−0.86, 0.36). Sample: −0.73. z₂ ~ N(0.8·(−0.73), 0.36) = N(−0.58, 0.36). Sample: −0.41.

Step 3: z₁ ~ N(0.8·(−0.41), 0.36) = N(−0.33, 0.36). Sample: 0.12. z₂ ~ N(0.8·0.12, 0.36) = N(0.10, 0.36). Sample: 0.45.

The chain moves from (2, −2) toward the origin (the true mean), exploring along the correlation direction (ρ=0.8 means z₁ and z₂ are positively correlated in the target).

Example 3: Effective Sample Size

Problem: An MCMC chain of N=10,000 samples has autocorrelations ρ₁=0.9, ρ₂=0.81, ρ₅=0.5, ρ₁₀=0.2, ρ₂₀=0.05, and negligible beyond. Approximate N_eff.

Solution:

$N_eff = N / (1 + 2 Σ_{k=1}^∞ ρ_k)
      ≈ 10000 / (1 + 2(0.9+0.81+0.73+0.66+0.5+0.42+0.35+0.28+0.23+0.2+0.15+0.11+0.08+0.06+0.05))
      ≈ 10000 / (1 + 2·5.53)
      ≈ 10000 / 12.06
      ≈ 829
$

Only ~8.3% of the samples are "effective." The high autocorrelation means you need ~12× more MCMC iterations than you'd naively think. This is why thinning (keeping every k-th sample) is common — it trades storage for correlation, though it doesn't actually improve the Monte Carlo estimate (discarding samples always loses information).

Quiz

Q1: What does the concept of Markov chain primarily refer to in this subject?

A) A historical anecdote about Markov chain B) A visual representation of Markov chain C) The definition and application of Markov chain D) A computational error related to Markov chain

Correct: C)

If you chose A: This is incorrect. Markov chain is defined as: the definition and application of markov chain. The other options describe different aspects that are not the primary focus.
If you chose B: This is incorrect. Markov chain is defined as: the definition and application of markov chain. The other options describe different aspects that are not the primary focus.
If you chose C: Markov chain is defined as: the definition and application of markov chain. The other options describe different aspects that are not the primary focus. Correct!
If you chose D: This is incorrect. Markov chain is defined as: the definition and application of markov chain. The other options describe different aspects that are not the primary focus.

Q2: What is the primary purpose of Asymptotics?

A) It replaces all other methods in this domain B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It is used to asymptotics in mathematical analysis

Correct: D)

If you chose A: This is incorrect. Asymptotics serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: This is incorrect. Asymptotics serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose C: This is incorrect. Asymptotics serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: Asymptotics serves the purpose described in the correct answer. The other options misrepresent its role. Correct!

Q3: Which statement about Speed is TRUE?

A) Speed is a fundamental concept covered in this subject B) Speed is an advanced topic beyond this subject's scope C) Speed is not related to this subject D) Speed is mentioned only as a historical footnote

Correct: A)

If you chose A: Speed is a fundamental concept covered in this subject. This subject covers Speed as part of its core content. Correct!
If you chose B: This is incorrect. Speed is a fundamental concept covered in this subject. This subject covers Speed as part of its core content.
If you chose C: This is incorrect. Speed is a fundamental concept covered in this subject. This subject covers Speed as part of its core content.
If you chose D: This is incorrect. Speed is a fundamental concept covered in this subject. This subject covers Speed as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) A different result from a common mistake B) An unrelated numerical value C) The log-density difference: D) The inverse of the correct answer

Correct: C)

If you chose A: This is incorrect. The worked examples show that the result is The log-density difference:. The other options represent common errors.
If you chose B: This is incorrect. The worked examples show that the result is The log-density difference:. The other options represent common errors.
If you chose C: The worked examples show that the result is The log-density difference:. The other options represent common errors. Correct!
If you chose D: This is incorrect. The worked examples show that the result is The log-density difference:. The other options represent common errors.

Q5: How are Speed and Diagnostics related?

A) Speed and Diagnostics are closely related concepts B) Speed and Diagnostics are completely unrelated topics C) Speed is the inverse of Diagnostics D) Speed is a special case of Diagnostics

Correct: A)

If you chose A: Both Speed and Diagnostics are covered in this subject as interconnected topics. Correct!
If you chose B: This is incorrect. Both Speed and Diagnostics are covered in this subject as interconnected topics.
If you chose C: This is incorrect. Both Speed and Diagnostics are covered in this subject as interconnected topics.
If you chose D: This is incorrect. Both Speed and Diagnostics are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with Scalability?

A) The main error with Scalability is using it when it is not needed B) A common mistake is confusing Scalability with a similar concept C) Scalability is always computed the same way in all contexts D) Scalability has no common misconceptions

Correct: B)

If you chose A: This is incorrect. Students often confuse Scalability with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose B: Students often confuse Scalability with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose C: This is incorrect. Students often confuse Scalability with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse Scalability with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply Gradients?

A) Use Gradients only in pure mathematics contexts B) Apply Gradients to solve problems in this subject's domain C) Avoid Gradients unless explicitly instructed D) Gradients is not practically useful

Correct: B)

If you chose A: This is incorrect. Gradients is a practical tool used throughout this subject to solve relevant problems.
If you chose B: Gradients is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. Gradients is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Gradients is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Problem 1

Prove that if a Markov chain satisfies detailed balance p(z)T(z'|z) = p(z')T(z|z'), then p is a stationary distribution.

Answer

Integrate the detailed balance equation over z:

$∫ p(z) T(z' | z) dz = ∫ p(z') T(z | z') dz
                    = p(z') ∫ T(z | z') dz
                    = p(z') · 1
                    = p(z')
$

The left side is the distribution after one transition from p: (Tp)(z') = ∫ p(z) T(z' | z) dz = p(z') So applying T leaves p unchanged — p is stationary. ✓ Detailed balance is a sufficient (not necessary) condition for stationarity. It's stronger than needed but easier to verify and sufficient to ensure the chain converges to p.

Problem 2

Explain why random-walk Metropolis scales poorly with dimension. For a d-dimensional Gaussian target N(0,I) with proposal N(z, σ²I), what's the optimal σ² and what's the acceptance rate?

Answer

For a d-dimensional spherical Gaussian target with proposal N(z, σ²I): The log-density difference:

$log p(z') − log p(z) = −½(||z'||² − ||z||²)
$

For the proposal, ||z' − z||² ~ σ²·χ²_d. To maintain reasonable acceptance, σ² must scale as O(1/d). If σ² is too large, ||z'||² − ||z||² becomes very negative and proposals are rejected. So step size shrinks as 1/d. **Optimal σ² ≈ 2.38²/d** and acceptance rate ≈ 0.234 for d → ∞ (Roberts et al., 1997). The O(1/d) step size means the chain needs O(d) steps to traverse the distribution in any direction — "random walk" indeed. HMC overcomes this by using gradient information to make coherent proposals.

Problem 3

Derive why Gibbs sampling has acceptance probability 1 in the MH framework.

Answer

Gibbs proposes z_j' ~ p(z_j' | z_{−j}) and leaves z_{−j} unchanged. So: Proposal: Q(z' | z) = p(z_j' | z_{−j}) MH acceptance ratio:

$α = min(1, p(z')·Q(z | z') / (p(z)·Q(z' | z)))
$

Now: - p(z) = p(z_j, z_{−j}) = p(z_j | z_{−j}) · p(z_{−j}) - p(z') = p(z_j', z_{−j}) = p(z_j' | z_{−j}) · p(z_{−j}) [same z_{−j}!] - Q(z' | z) = p(z_j' | z_{−j}) [proposed new value given the rest] - Q(z | z') = p(z_j | z_{−j}) [reverse proposal from z' back to z] Substituting:

$α = min(1, [p(z_j'|z_{−j})·p(z_{−j}) · p(z_j|z_{−j})] / [p(z_j|z_{−j})·p(z_{−j}) · p(z_j'|z_{−j})])
  = min(1, 1)
  = 1
$

Everything cancels. Gibbs always accepts because the proposal is already distributed according to the conditional of the target. ✓

Problem 4

In HMC, if the leapfrog integrator were perfectly energy-preserving, what would the acceptance probability be? Why do we need the accept/reject step in practice?

Answer

If H(z', r') = H(z, r) (perfect energy conservation), then α = min(1, e^0) = 1. Every proposal would be accepted. **Why we still need accept/reject:** The leapfrog integrator introduces O(ε³) error per step and O(ε²) error overall. Over L steps, energy is not perfectly conserved — typically H(z',r') ≠ H(z,r). The accept/reject step CORRECTS for this discretization error, ensuring the chain's stationary distribution remains exactly the target. Without the accept/reject step, the chain would sample from a perturbed distribution that differs from the target by O(ε²). This is why HMC is "exact" despite using approximate integration — the MH correction guarantees correctness.

Problem 5

You run 4 MCMC chains for a parameter μ. The within-chain variance is W=2.3 and between-chain variance is B=8.7, with N=1000 samples per chain. Compute R̂. Has the chain converged?

Answer

$Var̂^+ = (N−1)/N · W + B/N = (999/1000)·2.3 + 8.7/1000 = 2.2977 + 0.0087 = 2.3064

R̂ = √(Var̂^+ / W) = √(2.3064 / 2.3) = √1.0028 ≈ 1.001
$

R̂ ≈ 1.001 < 1.1, suggesting good convergence. The between-chain variance B is large (8.7), but divided by N=1000 it contributes negligibly to the pooled variance estimate. The chains have explored the same region — the large B just means the chain means are somewhat spread out, but with N=1000 samples the estimate of each chain's mean is precise enough that we can still conclude convergence. **Warning:** R̂ ≈ 1 doesn't guarantee convergence — it only tests whether chains have reached the SAME distribution. If all chains are stuck in the same local mode, R̂ ≈ 1 but convergence hasn't been achieved. Always complement R̂ with trace plots and ESS.

Summary

MCMC constructs a Markov chain whose stationary distribution is the target posterior — simulating the chain produces asymptotically exact samples
Metropolis-Hastings proposes states from Q and accepts/rejects to satisfy detailed balance — the acceptance ratio corrects for proposal asymmetry
Gibbs sampling is MH with acceptance probability 1 — sample from full conditionals when available
Hamiltonian Monte Carlo uses gradient information and Hamiltonian dynamics to propose distant states with high acceptance — suppressing O(d) random-walk behavior to O(d^{1/4})
Convergence diagnostics (trace plots, autocorrelation, ESS, R̂) are essential — MCMC guarantees are asymptotic, not per-sample

Pitfalls

Treating correlated MCMC samples as independent. MCMC produces autocorrelated samples — consecutive draws are not independent. Computing standard errors or confidence intervals as if N_corr = N_total can dramatically overstate precision. Always compute the effective sample size (ESS) and use it for uncertainty quantification. An ESS of 200 from 10,000 samples means you effectively have only 200 independent data points.
Running a single chain and trusting it. A single chain can get stuck in a local mode and appear well-mixed (good trace plot, low autocorrelation) while entirely missing other important regions of the posterior. Always run at least 3-4 chains from overdispersed starting points and use R̂ to check that they converge to the same distribution. A single chain is never sufficient for reliable inference.
Misinterpreting R̂ ≈ 1 as proof of convergence. R̂ ≈ 1 means the chains have converged to the SAME distribution — but if all chains are stuck in the same local mode, R̂ ≈ 1 while the chains haven't explored the full posterior. R̂ is a necessary condition for convergence, not a sufficient one. Always complement R̂ with trace plots, effective sample size, and visual inspection of the marginal posterior distributions.
Using random-walk Metropolis in high dimensions. Random-walk MH requires step sizes that scale as O(1/d), needing O(d) steps to traverse the distribution. In a 100-dimensional problem, this means the chain barely moves. Use HMC or its variants (NUTS, which adapts the path length automatically) for problems with more than 10-20 parameters. Random-walk MH is only practical for very low-dimensional problems or as a pedagogical tool.
Discarding burn-in samples but not accounting for autocorrelation in the remaining samples. Burn-in removal handles initialization bias, but the post-burn-in samples are still correlated. The common practice of "keep every k-th sample" (thinning) reduces storage but does NOT improve the Monte Carlo estimate — it discards information. Use ALL post-burn-in samples for estimation, but compute standard errors using the effective sample size, not the nominal sample size.

Key Terms

Term	Definition
Markov chain	A sequence where each state depends only on the previous state through transition kernel T(z'
Stationary distribution	π(z') = ∫ T(z'
Detailed balance	p(z)T(z'
Metropolis-Hastings	Proposal + accept/reject with α = min(1, p(z')Q(z
Gibbs sampling	Sample each variable from its full conditional p(z_j
HMC	Hamiltonian Monte Carlo — augments with momentum, simulates physics, uses gradients
Leapfrog integrator	Symplectic integrator for HMC — volume-preserving and reversible, O(ε³) per-step error
Effective sample size	N_eff = N/(1+2Σρ_k) — corrects for autocorrelation; fewer effective than nominal samples
R̂ (Gelman-Rubin)	√(Var̂^+/W) — ratio of pooled to within-chain variance; near 1 suggests convergence

Next Steps

Continue to 21-04 — EM Algorithm to learn how to perform maximum likelihood estimation with latent variables — a deterministic alternative that iteratively refines parameter estimates.

Progress

Phases

21-03 — Markov Chain Monte Carlo (MCMC)

Learning Objectives

Core Content

1. The Problem: Sampling from Intractable Distributions

2. Markov Chains and Stationary Distributions

3. Metropolis-Hastings (MH) Algorithm

4. Gibbs Sampling

5. Hamiltonian Monte Carlo (HMC)

6. MCMC Diagnostics

7. MCMC vs Variational Inference

Worked Examples

Example 1: Metropolis-Hastings for Beta Posterior

Example 2: Gibbs Sampling for Bivariate Gaussian

Example 3: Effective Sample Size

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Key Terms

Next Steps