21-04 — EM Algorithm
Phase: 21 — Probability & Statistics for ML (Advanced) Subject: 21-04 Prerequisites: 21-01 (Bayesian Inference), 12-01 (Maximum Likelihood Estimation), 10-02 (Conditional Probability), 13-04 (KL Divergence), 14-06 (Convex Functions — Jensen's inequality) Next subject: 21-05 — Exponential Family
Learning Objectives
By the end of this subject, you will be able to:
- Derive the EM algorithm from Jensen's inequality — prove that the E-step computes the expected complete-data log-likelihood and the M-step maximizes it
- Prove that EM monotonically increases the marginal likelihood at each iteration
- Apply EM to Gaussian Mixture Models (GMMs) — derive the E-step responsibilities and M-step parameter updates
- Analyze EM convergence: why it can be slow (linear convergence) and when it gets stuck in local maxima
- Connect EM to k-means clustering as a limiting case (hard EM with σ² → 0)
Core Content
1. The Problem: MLE with Latent Variables
Many models have latent (unobserved) variables z that make direct MLE intractable:
$θ_MLE = argmax_θ log p_θ(x) = argmax_θ log ∫ p_θ(x, z) dz $
The log of an integral doesn't simplify. But if we OBSERVED z, the complete-data log-likelihood log p_θ(x, z) would be easy to maximize. EM alternates between filling in z (E-step) and maximizing with filled-in z (M-step).
⚠️ THIS IS CRITICAL — EM is the foundational algorithm for training models with latent structure: GMMs, HMMs, topic models (LDA), factor analysis, and many more. Understanding EM is understanding how to learn from incomplete data.
2. Derivation from Jensen's Inequality
For any distribution q(z) over the latent variables:
$log p_θ(x) = log ∫ p_θ(x, z) dz
= log ∫ q(z) · p_θ(x, z)/q(z) dz
≥ ∫ q(z) · log(p_θ(x, z)/q(z)) dz [Jensen]
= E_q[log p_θ(x, z)] − E_q[log q(z)]
= L(q, θ)
$
L(q, θ) is a lower bound on log p_θ(x). EM alternately maximizes this bound:
E-step: Fix θ, find q that makes the bound tight (KL=0):
$q^{(t)}(z) = p_{θ^{(t)}}(z | x) [the posterior over latents]
$
M-step: Fix q, maximize the bound with respect to θ:
$θ^{(t+1)} = argmax_θ E_{q^{(t)}}[log p_θ(x, z)]
$
The E-step computes the expected complete-data log-likelihood under the current posterior. The M-step maximizes it.
Why it works:
$log p(x; θ^{(t+1)}) ≥ L(q^{(t)}, θ^{(t+1)}) [Jensen bound]
≥ L(q^{(t)}, θ^{(t)}) [M-step maximizes]
= log p(x; θ^{(t)}) [E-step makes bound tight]
$
So log-likelihood is monotonically non-decreasing. EM never makes the likelihood WORSE.
3. Gaussian Mixture Models (GMMs)
A GMM with K components:
$p(x) = Σ_{k=1}^K π_k · N(x | μ_k, Σ_k)
$
Latent variable z ∈ {1,...,K} indicates which component generated x:
$p(z=k) = π_k p(x | z=k) = N(x | μ_k, Σ_k) $
Complete-data log-likelihood:
$log p(x, z) = Σ_{i=1}^n Σ_{k=1}^K I(z_i=k) [log π_k + log N(x_i | μ_k, Σ_k)]
$
E-step — responsibilities:
$γ_{ik} = P(z_i = k | x_i; θ^{(t)})
= π_k · N(x_i | μ_k, Σ_k) / Σ_j π_j · N(x_i | μ_j, Σ_j)
$
M-step — parameter updates:
$N_k = Σ_i γ_{ik} [effective number of points in component k]
π_k = N_k / n [component weight]
μ_k = (1/N_k) Σ_i γ_{ik} x_i [weighted mean]
Σ_k = (1/N_k) Σ_i γ_{ik} (x_i − μ_k)(x_i − μ_k)^T [weighted covariance]
$
Each M-step is a weighted version of the standard MLE — with soft assignments γ_{ik} instead of hard cluster labels.
4. Connection to k-Means
k-means is the limiting case of EM for GMMs as:
- All components share the SAME spherical covariance: Σ_k = σ²I
- σ² → 0
Then the responsibility γ_{ik} becomes:
$γ_{ik} ∝ exp(−||x_i − μ_k||² / (2σ²))
$
As σ² → 0, the component with the smallest distance dominates:
γ_{ik} → 1 if k = argmin_j ||x_i − μ_j||², else 0
The E-step becomes HARD assignment, and the M-step becomes:
$μ_k = mean of all points assigned to cluster k $
This is exactly Lloyd's algorithm for k-means. EM generalizes k-means by allowing soft assignments, unequal covariances, and component weights.
5. EM for Other Models
Hidden Markov Models (Baum-Welch): The E-step computes forward-backward probabilities; the M-step updates transition and emission probabilities. EM for HMMs is called the Baum-Welch algorithm.
Probabilistic PCA: Latent z ~ N(0, I), observed x = Wz + μ + ε with ε ~ N(0, σ²I). EM alternates between computing E[z|x] and updating W, μ, σ².
Topic Models (LDA): LDA can be trained with variational EM — the E-step uses variational inference (21-02) to approximate p(z|x), and the M-step updates the topic-word distributions.
6. Convergence Properties
Monotonicity: EM never decreases the marginal likelihood. Each iteration either improves or converges.
Linear convergence: Near the optimum, ||θ^{(t+1)} − θ|| ≈ ρ·||θ^{(t)} − θ|| where ρ = 1 − λ_min/λ_max depends on the fraction of missing information. When most information is missing, ρ ≈ 1 and EM converges SLOWLY.
Local maxima: EM converges to a local maximum (or saddle point) of the likelihood — not guaranteed to find the global optimum. Multiple random restarts are essential.
Acceleration: Various methods (Aitken's acceleration, Louis' method, parameter expansion/PX-EM) can speed up convergence when EM is slow.
Worked Examples
Example 1: EM for a Simple Two-Component GMM
Problem: Data: x = [−2.0, −1.5, 1.0, 1.5, 2.0]. Fit a 2-component GMM with shared variance σ²=1 (only means and weights unknown). Initialize: π=[0.5, 0.5], μ=[−1, 1].
E-step: Compute responsibilities.
$For x₁=−2.0: N(−2.0 | −1, 1) = exp(−(−2+1)²/2)/√(2π) = exp(−0.5)/2.507 = 0.242 N(−2.0 | 1, 1) = exp(−(−2−1)²/2)/√(2π) = exp(−4.5)/2.507 = 0.00444 γ₁₁ = 0.5·0.242/(0.5·0.242+0.5·0.00444) = 0.121/0.123 = 0.982 γ₁₂ = 1−0.982 = 0.018 For x₃=1.0: N(1.0 | −1, 1) = exp(−4/2)/2.507 = 0.0540 N(1.0 | 1, 1) = exp(0)/2.507 = 0.3989 γ₃₁ = 0.5·0.054/(0.5·0.054+0.5·0.3989) = 0.027/0.226 = 0.119 γ₃₂ = 0.881 $
Full responsibilities show clear separation — negative points belong to component 1, positive to component 2.
M-step:
$N₁ = 0.982+0.976+0.119+0.018+0.007 = 2.102 N₂ = 5−2.102 = 2.898 μ₁ = (0.982·(−2)+0.976·(−1.5)+0.119·1+0.018·1.5+0.007·2)/2.102 = (−1.964−1.464+0.119+0.027+0.014)/2.102 = −3.267/2.102 = −1.554 μ₂ = (0.018·(−2)+0.024·(−1.5)+0.881·1+0.982·1.5+0.993·2)/2.898 = 3.850/2.898 = 1.328 π₁ = 2.102/5 = 0.420, π₂ = 0.580 $
After one iteration: μ moved from [−1, 1] to [−1.55, 1.33], weights adjusted to reflect the 2:3 split. Each iteration pulls the means toward their respective clusters.
Example 2: Proving Monotonicity
Problem: Prove that EM never decreases the log-likelihood: log p(x; θ^{(t+1)}) ≥ log p(x; θ^{(t)}).
Solution:
The key inequality:
$log p(x; θ^{(t+1)}) = log ∫ p(x,z; θ^{(t+1)}) dz
≥ ∫ p(z|x; θ^{(t)}) log(p(x,z; θ^{(t+1)})/p(z|x; θ^{(t)})) dz [Jensen]
= E_{z|x;θ^{(t)}}[log p(x,z; θ^{(t+1)})] − E[log p(z|x; θ^{(t)})]
≥ E_{z|x;θ^{(t)}}[log p(x,z; θ^{(t)})] − E[log p(z|x; θ^{(t)})] [M-step]
= log p(x; θ^{(t)}) [E-step tightness]
$
The first inequality is Jensen. The second inequality is because the M-step MAXIMIZES the expected complete-data log-likelihood, so θ^{(t+1)} achieves at least as high a value as θ^{(t)}. The last equality uses the fact that with q = p(z|x; θ^{(t)}), the bound is tight at θ^{(t)}.
Example 3: EM vs Direct Optimization
Problem: For a 1D GMM with K=2, σ²=1 known, and data x=[−3, −2, 0, 2, 3], compare the EM trajectory vs gradient ascent from the same starting point.
Solution:
EM update for μ₁ (holding μ₂ fixed):
$γ_i1 = π₁N(x_i|μ₁)/(π₁N(x_i|μ₁) + π₂N(x_i|μ₂))
μ₁^{new} = Σ γ_i1 x_i / Σ γ_i1
$
Gradient of log-likelihood w.r.t. μ₁:
$∂/∂μ₁ log p(x) = Σ [γ_i1 · (x_i − μ₁)] $
EM takes larger steps initially because the M-step directly computes the optimal μ₁ given current responsibilities. Gradient ascent takes small incremental steps. EM typically converges in fewer iterations but each iteration costs more (the M-step requires solving for the optimum).
After 3 EM iterations from μ=[−1, 1]: μ ≈ [−2.1, 1.7]. After 3 gradient ascent steps with optimal learning rate: μ ≈ [−1.6, 1.3]. EM converges faster because it jumps to the conditional optimum each iteration.
Quiz
Q1: What does the concept of For GMMs primarily refer to in this subject?
A) A computational error related to For GMMs B) The definition and application of For GMMs C) A historical anecdote about For GMMs D) A visual representation of For GMMs
Correct: B)
- If you chose A: This is incorrect. For GMMs is defined as: the definition and application of for gmms. The other options describe different aspects that are not the primary focus.
- If you chose B: For GMMs is defined as: the definition and application of for gmms. The other options describe different aspects that are not the primary focus. Correct!
- If you chose C: This is incorrect. For GMMs is defined as: the definition and application of for gmms. The other options describe different aspects that are not the primary focus.
- If you chose D: This is incorrect. For GMMs is defined as: the definition and application of for gmms. The other options describe different aspects that are not the primary focus.
Q2: What is the primary purpose of EM generalizes k-means?
A) It is used to em generalizes k-means in mathematical analysis B) It is primarily a historical notation system C) It is used only in advanced research contexts D) It replaces all other methods in this domain
Correct: A)
- If you chose A: EM generalizes k-means serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. EM generalizes k-means serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. EM generalizes k-means serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. EM generalizes k-means serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about EM converges linearly is TRUE?
A) EM converges linearly is a fundamental concept covered in this subject B) EM converges linearly is not related to this subject C) EM converges linearly is mentioned only as a historical footnote D) EM converges linearly is an advanced topic beyond this subject's scope
Correct: A)
- If you chose A: EM converges linearly is a fundamental concept covered in this subject. This subject covers EM converges linearly as part of its core content. Correct!
- If you chose B: This is incorrect. EM converges linearly is a fundamental concept covered in this subject. This subject covers EM converges linearly as part of its core content.
- If you chose C: This is incorrect. EM converges linearly is a fundamental concept covered in this subject. This subject covers EM converges linearly as part of its core content.
- If you chose D: This is incorrect. EM converges linearly is a fundamental concept covered in this subject. This subject covers EM converges linearly as part of its core content.
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) A different result from a common mistake C) 2 GMM. D) An unrelated numerical value
Correct: C)
- If you chose A: This is incorrect. The worked examples show that the result is 2 GMM.. The other options represent common errors.
- If you chose B: This is incorrect. The worked examples show that the result is 2 GMM.. The other options represent common errors.
- If you chose C: The worked examples show that the result is 2 GMM.. The other options represent common errors. Correct!
- If you chose D: This is incorrect. The worked examples show that the result is 2 GMM.. The other options represent common errors.
Q5: How are EM converges linearly and EM applies broadly related?
A) EM converges linearly is a special case of EM applies broadly B) EM converges linearly and EM applies broadly are closely related concepts C) EM converges linearly and EM applies broadly are completely unrelated topics D) EM converges linearly is the inverse of EM applies broadly
Correct: B)
- If you chose A: This is incorrect. Both EM converges linearly and EM applies broadly are covered in this subject as interconnected topics.
- If you chose B: Both EM converges linearly and EM applies broadly are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both EM converges linearly and EM applies broadly are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both EM converges linearly and EM applies broadly are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with EM algorithm?
A) EM algorithm is always computed the same way in all contexts B) The main error with EM algorithm is using it when it is not needed C) A common mistake is confusing EM algorithm with a similar concept D) EM algorithm has no common misconceptions
Correct: C)
- If you chose A: This is incorrect. Students often confuse EM algorithm with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: This is incorrect. Students often confuse EM algorithm with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose C: Students often confuse EM algorithm with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose D: This is incorrect. Students often confuse EM algorithm with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Complete-data likelihood?
A) Apply Complete-data likelihood to solve problems in this subject's domain B) Use Complete-data likelihood only in pure mathematics contexts C) Avoid Complete-data likelihood unless explicitly instructed D) Complete-data likelihood is not practically useful
Correct: A)
- If you chose A: Complete-data likelihood is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose B: This is incorrect. Complete-data likelihood is a practical tool used throughout this subject to solve relevant problems.
- If you chose C: This is incorrect. Complete-data likelihood is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Complete-data likelihood is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
Derive the E-step responsibility γ_{ik} for a GMM. Show that it equals the posterior probability P(z_i=k | x_i).
Answer
By Bayes' rule:$γ_{ik} = P(z_i=k | x_i) = p(z_i=k, x_i) / p(x_i)
= π_k · N(x_i | μ_k, Σ_k) / Σ_j π_j · N(x_i | μ_j, Σ_j)
$
This is the posterior probability that component k generated x_i. The denominator normalizes across components. The E-step simply computes these posteriors under the current parameter estimates — no optimization required.
Problem 2
Show that k-means is a limiting case of EM for GMMs as all Σ_k = σ²I and σ² → 0.
Answer
With Σ_k = σ²I:$γ_{ik} = π_k exp(−||x_i−μ_k||²/(2σ²)) / Σ_j π_j exp(−||x_i−μ_j||²/(2σ²))
$
Divide numerator and denominator by the largest exponential term (corresponding to the closest cluster k*):
$γ_{ik} = π_k exp(−(d_k²−d_{k*}^2)/(2σ²)) / Σ_j π_j exp(−(d_j²−d_{k*}^2)/(2σ²))
$
where d_k² = ||x_i−μ_k||². For k ≠ k*, d_k²−d_{k*}^2 > 0, so as σ²→0, exp(−(positive)/0) → 0. Only k* survives: γ_{i,k*} → 1, all others → 0.
The M-step becomes:
$μ_k = (1/N_k) Σ_{i: assigned to k} x_i
$
This is exactly the k-means centroid update. And π_k = N_k/n becomes the fraction of points in cluster k. With σ²→0 and equal π_k, this reduces to standard k-means.
Problem 3
EM converges to a local maximum. Give a concrete 1D GMM example where EM from a poor initialization converges to a bad local maximum, and explain why.
Answer
Data: x = [−10, −9, −8, 8, 9, 10]. Fit K=2 GMM. **Good initialization:** μ₁≈−9, μ₂≈9. EM converges to the correct solution — one component centered on the negative cluster, one on the positive. **Bad initialization:** μ₁=0, μ₂=0.1. Both components start in the MIDDLE where there's no data. The responsibilities for all points are ~0.5 for both components. After 1 iteration, μ₁ and μ₂ both move to ~0 (the global mean). The algorithm is stuck — both components model the same (empty) center. **Another bad local maximum:** μ₁=−9, μ₂=−8. Both components model the left cluster. The right cluster (8,9,10) gets split between the two left components with near-zero responsibilities. EM converges with both components on the left cluster — missing the right cluster entirely. EM is sensitive to initialization. Multiple random restarts and k-means++ initialization are standard practice.Problem 4
The EM algorithm can be viewed as coordinate ascent on the ELBO L(q, θ). Express this clearly and explain why this guarantees monotonicity.
Answer
ELBO: L(q, θ) = E_q[log p_θ(x,z)] − E_q[log q(z)] E-step: q^{(t)} = argmax_q L(q, θ^{(t)}) ⇒ q^{(t)} = p(z|x; θ^{(t)}) M-step: θ^{(t+1)} = argmax_θ L(q^{(t)}, θ) This is block coordinate ascent on two blocks (q and θ). Since each step maximizes over its block holding the other fixed, the objective never decreases: L(q^{(t)}, θ^{(t)}) ≤ L(q^{(t)}, θ^{(t+1)}) ≤ L(q^{(t+1)}, θ^{(t+1)}) And since L(q^{(t)}, θ^{(t)}) = log p(x; θ^{(t)}) (E-step makes the bound tight), we get: log p(x; θ^{(t)}) ≤ log p(x; θ^{(t+1)}) Coordinate ascent on the ELBO guarantees monotonic likelihood improvement. ✓Problem 5
Why does EM converge slowly when there's a lot of "missing information"? Derive the rate in terms of the observed vs complete-data Fisher information.
Answer
Near the optimum θ*, the EM update approximates:$θ^{(t+1)} − θ* ≈ (I − I_{oc}) (θ^{(t)} − θ*)
$
where I_{oc} = I_c^{−1} I_o and I_o, I_c are the observed and complete-data Fisher information matrices.
The convergence rate is governed by the largest eigenvalue of I − I_{oc}, which equals:
$ρ = 1 − λ_min(I_{oc})
$
The "missing information principle": I_o = I_c − I_missing where I_missing is the information lost by not observing z. When I_missing is large relative to I_c, I_{oc} has small eigenvalues → ρ ≈ 1 → slow convergence.
In the extreme case where z tells us everything and x tells us nothing (complete separation), I_o ≈ 0, and EM barely moves. This happens in GMMs with well-separated components — the posterior is nearly deterministic, but the parameters still take many iterations to settle.
Summary
- EM alternates between computing posterior responsibilities (E-step) and maximizing the expected complete-data log-likelihood (M-step) — guaranteeing monotonic likelihood improvement
- For GMMs, EM provides closed-form updates for means, covariances, and weights — each M-step is a weighted version of standard MLE
- EM generalizes k-means — k-means is the hard-assignment limit (σ²→0) of EM for GMMs with equal spherical covariances
- EM converges linearly near the optimum — slow when most information is missing; multiple restarts needed to avoid local maxima
- EM applies broadly to any model with latent variables: HMMs (Baum-Welch), PPCA, factor analysis, topic models, and more
Pitfalls
- Assuming EM finds the global optimum. EM is guaranteed to converge to a LOCAL maximum of the likelihood — not the global maximum. The algorithm's trajectory is entirely determined by initialization. For GMMs with poorly separated clusters, different initializations can lead to completely different solutions. Always run EM from multiple random starting points and select the run with the highest final likelihood.
- Using poor initialization that causes component collapse. In GMMs, if two component means start too close together, they can converge to the same cluster (component collapse), wasting model capacity. If a component's responsibility sum N_k approaches zero, the component's covariance becomes undefined and the algorithm fails. Use k-means++ initialization (which spaces initial centroids apart) or run k-means first and use the resulting centroids as EM's starting point.
- Stopping EM too early when convergence is slow. Near the optimum, EM converges linearly with rate ρ ≈ 1 when the fraction of missing information is high. The likelihood may still be improving meaningfully even when parameter changes appear tiny. Use a tight tolerance on the RELATIVE change in log-likelihood (e.g., 1e-6), not on parameter changes. In well-separated GMMs, EM can require hundreds or thousands of iterations to fully converge.
- Confusing hard EM (classification) with soft EM. Hard EM assigns each point to exactly one component (γ_{ik} ∈ {0,1}), while standard EM uses soft assignments (γ_{ik} ∈ [0,1]). Hard EM is equivalent to k-means for GMMs with spherical covariances and is simpler, but it discards uncertainty information and is more prone to local minima. Use soft EM unless you have a specific reason for hard assignments — the uncertainty typically improves convergence.
- Using EM when direct gradient-based optimization would be faster. EM's M-step requires solving for the maximum of the expected complete-data log-likelihood — which may itself be an optimization problem. For some models (e.g., mixture of non-Gaussian distributions), the M-step has no closed form and requires numerical optimization, negating EM's simplicity advantage. In these cases, direct gradient ascent on the marginal log-likelihood with automatic differentiation (using modern frameworks like JAX or PyTorch) can be simpler and faster.
Key Terms
| Term | Definition |
|---|---|
| EM algorithm | Iterative MLE for latent variable models: E-step computes expected complete-data log-likelihood, M-step maximizes it |
| Complete-data likelihood | p_θ(x, z) — the likelihood if we observed the latents; typically simpler than the marginal p_θ(x) |
| Responsibility | γ_{ik} = P(z_i=k |
| ELBO | L(q,θ) = E_q[log p_θ(x,z)] − E_q[log q(z)] — lower bound on log p(x); EM performs coordinate ascent on this bound |
| Monotonicity | EM never decreases the marginal likelihood: log p(x; θ^{(t+1)}) ≥ log p(x; θ^{(t)}) |
| GMM | Gaussian Mixture Model — p(x) = Σ π_k N(x |
| Hard EM | Assign each point to its most likely component (argmax of responsibility); k-means is a special case |
| Baum-Welch | EM applied to Hidden Markov Models — forward-backward for E-step, re-estimation for M-step |
Next Steps
Continue to 21-05 — Exponential Family to learn the unifying mathematical framework behind Gaussian, Bernoulli, Beta, Gamma, Dirichlet, Poisson, and many other distributions — and why they make EM, VI, and MLE analytically tractable.