Phase 11: Probability Theory II
Subject 11-06: Limit Theorems
Prerequisites: 10-06 (Expectation), 10-08 (Normal Distribution), 10-10 (Joint Distributions), basic limits
Learning Objectives
- State the Weak Law of Large Numbers (WLLN) and Strong Law of Large Numbers (SLLN) and explain the difference
- State, prove, and apply the Central Limit Theorem (CLT) for i.i.d. random variables
- Use the CLT to approximate probabilities for sums and sample means with continuity correction
- Define convergence in probability, almost sure convergence, and convergence in distribution
- State the Berry-Esseen theorem for bounding CLT approximation error
Core Content
1. The Law of Large Numbers
Let X₁, X₂, ..., Xₙ be i.i.d. with mean μ = E[Xᵢ] and finite variance σ².
Weak Law of Large Numbers (WLLN): For any ε > 0:
P(|X̄ₙ − μ| > ε) → 0 as n → ∞
The sample mean converges in probability to μ. For any fixed margin ε, you can make the probability of deviation arbitrarily small by taking n large enough.
ⓘ Convergence in probability: Xₙ → X in probability if for every ε > 0, P(|Xₙ − X| > ε) → 0.
Proof of WLLN (Chebyshev's inequality): E[X̄ₙ] = μ, Var(X̄ₙ) = σ²/n. By Chebyshev:
$P(|X̄ₙ − μ| > ε) ≤ Var(X̄ₙ)/ε² = σ²/(nε²) → 0 $
Strong Law of Large Numbers (SLLN):
$P(lim_{n→∞} X̄ₙ = μ) = 1
$
The sample mean converges almost surely to μ. Not just "probably close" — with probability 1, the sequence of sample means actually approaches μ in the limit. This is stronger: almost sure convergence ⇒ convergence in probability (but not conversely).
ⓘ Almost sure convergence: Xₙ → X a.s. if P(lim Xₙ = X) = 1.
⚠️ CRITICAL distinction: WLLN says "for large n, X̄ₙ is probably close to μ." SLLN says "eventually, X̄ₙ stays close to μ forever." For practical purposes, both justify using the sample mean to estimate μ for large n. The SLLN is the theoretical foundation of Monte Carlo integration.
2. The Central Limit Theorem (CLT)
Theorem: Let X₁, X₂, ..., Xₙ be i.i.d. with mean μ and finite variance σ². Then:
$√n (X̄ₙ − μ) / σ → N(0, 1) in distribution $
Equivalently:
$(X̄ₙ − μ) / (σ/√n) → N(0, 1) $
Or, for the sum Sₙ = Σ Xᵢ:
$(Sₙ − nμ) / (σ√n) → N(0, 1) $
In words: For large n, the sample mean is approximately normally distributed with mean μ and variance σ²/n — REGARDLESS of the original distribution (as long as it has finite variance). This is the single most important result in probability and statistics.
ⓘ Convergence in distribution: Xₙ → X in distribution if F_{Xₙ}(x) → F_X(x) at all continuity points of F_X.
What the CLT does NOT say: - It does NOT say X̄ₙ is exactly normal for any finite n — it's an asymptotic approximation - It does NOT say individual Xᵢ are normal — they can be anything (discrete, skewed, bounded) - It does NOT work if σ² is infinite (e.g., Cauchy distribution — no CLT!)
Why the normal distribution is universal: The normal is a "fixed point" under the operation of summing independent random variables. The CLT explains why the normal appears everywhere in nature — it's the universal limit of sums of many small independent effects.
Practical rule of thumb: CLT approximation is usually adequate for n ≥ 30, though highly skewed distributions may require larger n.
3. CLT Applications
Probability approximation:
$P(X̄ₙ ≤ a) ≈ Φ((a − μ) / (σ/√n)) $
Binomial proportion: For X ~ Binomial(n, p) = Σ_{i=1}^{n} Bernoulli(p):
$(X − np) / √(np(1−p)) ≈ N(0, 1) $
or equivalently, the sample proportion p̂ = X/n:
$p̂ ≈ N(p, p(1−p)/n) $
Continuity correction (for integer-valued sums): When approximating a discrete distribution with a continuous one, adjust by ±0.5: - P(X ≤ k) ≈ Φ((k + 0.5 − np) / √(np(1−p))) - P(X ≥ k) ≈ 1 − Φ((k − 0.5 − np) / √(np(1−p)))
Without continuity correction, the approximation can be noticeably off for moderate n.
Poisson approximation to binomial (when p is small): When n is large and p is small (np < 10), Poisson(np) is better than normal.
4. Berry-Esseen Theorem
The CLT says convergence happens — Berry-Esseen tells us how fast:
Let Fₙ be the CDF of √n (X̄ₙ − μ)/σ and Φ be the standard normal CDF. Then:
$sup_x |Fₙ(x) − Φ(x)| ≤ C · E[|X − μ|³] / (σ³ √n) $
where C is a universal constant (proven to be ≤ 0.4748).
In words: The maximum error in the CLT approximation shrinks at rate 1/√n, with the constant depending on the standardized third absolute moment (skewness). More skewed distributions need larger n for good normal approximation.
Implication: For symmetric distributions (skewness = 0), the CLT approximation can be remarkably good even for very small n (e.g., n=5 for Uniform).
5. Relationship Between Convergence Modes
Almost Sure ⇒ In Probability ⇒ In Distribution
(constants) ⇐ (constants only)
- Almost sure convergence ⇒ convergence in probability (always)
- Convergence in probability ⇒ convergence in distribution (always)
- Convergence in probability to a CONSTANT ⇒ almost sure along a subsequence
- Convergence in distribution to a CONSTANT ⇒ convergence in probability
The CLT gives convergence in distribution (to a non-constant limit — the normal). The WLLN gives convergence in probability. The SLLN gives almost sure convergence. All three converge to constants μ or σ² (or their standardized forms).
Key Terms
- Continuity correction
- Poisson approximation to binomial
- REGARDLESS of the original distribution
Worked Examples
Example 1: WLLN Application — Polling
A pollster wants to estimate the proportion p of voters supporting a candidate to within ±0.03 with 95% confidence. How large a sample is needed?
Solution:
Let Xᵢ = 1 if voter i supports candidate, 0 otherwise. E[Xᵢ] = p, Var(Xᵢ) = p(1−p).
By Chebyshev: P(|p̂ − p| > 0.03) ≤ Var(p̂)/(0.03)² = p(1−p) / (n · 0.0009).
Worst case: p(1−p) ≤ 1/4. So P(error > 0.03) ≤ (1/4)/(n·0.0009) = 1/(0.0036n).
We want this ≤ 0.05: 1/(0.0036n) ≤ 0.05 → n ≥ 1/(0.0036·0.05) = 1/0.00018 ≈ 5556.
Using CLT instead (which gives a tighter bound): n_poll ≈ (z_{0.025})² / (4·0.03²) = (1.96)² / (4·0.0009) ≈ 1067. The CLT bound is much tighter because Chebyshev is conservative.
Example 2: CLT for a Sum
The weight of a chocolate bar is a random variable with mean 100g and standard deviation 8g. You buy 50 bars (assumed independent). What is the approximate probability the total weight exceeds 5100g?
Solution:
S₅₀ = Σ Xᵢ. E[S] = 50·100 = 5000. Var(S) = 50·64 = 3200. σ_S = √3200 ≈ 56.57.
Z = (5100 − 5000)/56.57 = 100/56.57 ≈ 1.768.
P(S > 5100) = 1 − Φ(1.768) ≈ 1 − 0.9615 = 0.0385. About 3.85% chance.
Example 3: CLT with Continuity Correction
A fair coin is flipped 100 times. Find the approximate probability of getting between 45 and 55 heads inclusive.
Solution:
X ~ Binomial(100, 0.5). μ = 50, σ = √25 = 5.
Without continuity correction: P(45 ≤ X ≤ 55) ≈ Φ((55−50)/5) − Φ((45−50)/5) = Φ(1) − Φ(−1) = 0.8413 − 0.1587 = 0.6826.
With continuity correction: P(45 ≤ X ≤ 55) ≈ Φ((55.5−50)/5) − Φ((44.5−50)/5) = Φ(1.1) − Φ(−1.1) = 0.8643 − 0.1357 = 0.7286.
Exact (binomial sum): P(45 ≤ X ≤ 55) = Σ_{k=45}^{55} C(100,k)(0.5)¹⁰⁰ ≈ 0.7287.
The continuity correction gives near-perfect agreement, while the uncorrected approximation is off by ~4.6 percentage points.
Quiz
Q1: The Weak Law of Large Numbers (WLLN) states that for i.i.d. random variables with finite mean μ:
A) X̄_n → μ almost surely B) X̄_n converges in probability to μ C) X̄_n is exactly μ for all n D) √n(X̄_n − μ) converges to N(0, σ²)
Correct: B)
- If you chose B: Correct! WLLN: P(|X̄_n − μ| > ε) → 0 as n → ∞. The sample mean converges in probability.
- If you chose A: Almost sure convergence is the STRONG Law of Large Numbers (SLLN).
- If you chose C: The sample mean is random and equals μ only in expectation for finite n.
- If you chose D: This is the Central Limit Theorem, not the Law of Large Numbers.
Q2: The Central Limit Theorem states that for i.i.d. RVs with mean μ and variance σ²:
A) X̄_n ~ N(μ, σ²/n) for any n B) √n(X̄_n − μ)/σ converges in distribution to N(0, 1) C) X̄_n converges to a constant D) The distribution of X̄_n is always normal
Correct: B)
- If you chose B: Correct! √n(X̄_n − μ)/σ → N(0, 1) in distribution, REGARDLESS of the original distribution (subject to finite variance).
- If you chose A: This is only exact when the Xᵢ are themselves normal; CLT is asymptotic.
- If you chose C: This is the LLN, not the CLT.
- If you chose D: For finite n, X̄_n follows the original distribution's shape, only approaching normal as n → ∞.
Q3: The CLT requires which condition on the underlying distribution?
A) The distribution must be symmetric B) Finite mean and variance C) The distribution must be continuous D) The distribution must be bounded
Correct: B)
- If you chose B: Correct! The classical CLT requires finite mean μ and variance σ² > 0. It works for discrete, continuous, symmetric, or skewed distributions.
- If you chose A: Symmetry is not required — CLT works for highly skewed distributions.
- If you chose C: CLT applies equally to discrete distributions (e.g., Bernoulli, Poisson).
- If you chose D: Unbounded distributions like Normal or Exponential satisfy CLT.
Q5: A 95% confidence interval for μ using the CLT is approximately:
A) X̄_n ± 1.96 · σ/√n B) X̄_n ± 1.96 · σ C) X̄_n ± σ/√n D) X̄_n ± 1.96 · σ²
Correct: A)
- If you chose A: Correct! The standard error is σ/√n, and 1.96 is the 97.5th percentile of N(0,1) for 95% coverage.
- If you chose B: Forgets to divide by √n — would give overly wide intervals.
- If you chose C: Uses only 1 standard error, giving ~68% confidence, not 95%.
- If you chose D: Uses σ² instead of σ — wrong units.
Practice Problems
- Using Chebyshev's inequality, find the smallest n such that P(|X̄ₙ − μ| > 0.1σ) ≤ 0.05.
- A casino's slot machine pays out with mean $0.95 and SD $5 per play. If 10,000 plays occur in a day, approximate P(total payout > $9,800).
- Show that if Xₙ → c in probability (c constant) and g is continuous at c, then g(Xₙ) → g(c) in probability.
- Verify the CLT for sums of uniform random variables: compute the exact distribution of the sum of 12 i.i.d. Uniform(0, 1) and compare its mean and variance to N(6, 1).
- State and prove the WLLN using Chebyshev's inequality. What additional condition does the SLLN require?
- An insurance company has 10,000 independent policies, each with a claim probability of 0.02. Claim amounts have mean $5000 and SD $2000. Approximate the probability total claims exceed $12M.
- Show that SLLN ⇒ WLLN, but WLLN ⇏ SLLN. (Construct a counterexample where convergence in probability holds but not almost surely.)
Answers
1. Chebyshev: P(|X̄ − μ| > 0.1σ) ≤ Var(X̄)/(0.01σ²) = (σ²/n)/(0.01σ²) = 1/(0.01n). Set ≤ 0.05 → n ≥ 2000. 2. S ~ N(10000·0.95, 10000·25) = N(9500, 250000). σ_S = 500. Z = (9800−9500)/500 = 0.6. P = 1−Φ(0.6) ≈ 0.274. 3. For any ε>0, P(|g(Xₙ)−g(c)|>ε) ≤ P(|Xₙ−c|>δ) for some δ (by continuity of g). Since Xₙ→c in prob, RHS→0. 4. Each Uᵢ has mean 0.5, variance 1/12. Sum of 12: mean = 12·0.5 = 6, variance = 12·(1/12) = 1. Exact distribution of sum of 12 uniforms is a piecewise polynomial (Irwin-Hall), but the CLT says it's ≈ N(6,1), which is remarkably accurate even at n=12. 5. WLLN: E[X̄] = μ, Var(X̄) = σ²/n. P(|X̄−μ|>ε) ≤ σ²/(nε²) → 0. SLLN requires only finite mean (variance can be infinite!), proven via Kolmogorov's inequality and the Borel-Cantelli lemma. 6. Total claims S = Σ_{i=1}^{10000} XᵢYᵢ where Yᵢ~Bern(0.02), Xᵢ|Yᵢ=1 has given moments. E[S] = 10000·0.02·5000 = 1,000,000. Var(S) = 10000·[0.02·2000² + 0.02·0.98·5000²] = 10000·[80000 + 4,900,000] = 10000·4,980,000 = 4.98×10¹⁰. σ_S = √4.98×10¹⁰ ≈ 223,161. Z = (12M−1M)/223161 ≈ 49,327 (off the charts — effectively zero probability). 7. SLLN ⇒ WLLN because a.s. convergence implies convergence in probability. Counterexample: Let Xₙ be independent with P(Xₙ=1)=1/n, P(Xₙ=0)=1−1/n. Then Xₙ → 0 in probability (for any ε>0, P(|Xₙ|>ε)=1/n→0), but Σ P(Xₙ=1) = Σ 1/n = ∞, so by Borel-Cantelli second lemma, P(Xₙ=1 i.o.) = 1 — a.s. convergence fails.Summary
- WLLN: X̄ₙ → μ in probability; P(|X̄ₙ−μ|>ε) → 0. Proof via Chebyshev requires finite variance. Rate: O(1/n).
- SLLN: X̄ₙ → μ almost surely; P(lim X̄ₙ = μ) = 1. Stronger — implies WLLN. Requires only finite mean (Kolmogorov).
- CLT: √n (X̄ₙ−μ)/σ → N(0,1) in distribution. Regardless of original distribution (if finite variance). The universal limit theorem for sums.
- Berry-Esseen bounds max error by C·E[|X−μ|³]/(σ³√n) — convergence rate is 1/√n; skewness determines the constant.
- Continuity correction (±0.5) improves CLT approximation for discrete sums. Poisson approximation is better than normal when np is small.
Pitfalls
- Confusing WLLN with SLLN: WLLN says X̄ₙ → μ in probability (P(|X̄ₙ−μ|>ε) → 0); SLLN says X̄ₙ → μ almost surely (P(lim X̄ₙ = μ) = 1). SLLN is strictly stronger. Don't claim the WLLN gives almost sure convergence.
- Treating the CLT as an exact finite-sample result: The CLT says √n(X̄ₙ−μ)/σ → N(0,1) in distribution as n → ∞. For any finite n, the distribution is only approximately normal. Assuming exact normality for n=5 from a skewed distribution will give poor results.
- Forgetting the continuity correction for discrete sums: When using the normal to approximate P(a ≤ X ≤ b) for a discrete random variable like the binomial, adjust bounds by ±0.5. Without it, the approximation can be off by several percentage points even at n=100.
- Applying the CLT to distributions without finite variance: The classical CLT requires σ² < ∞. Distributions like the Cauchy (no finite mean or variance) do not satisfy the CLT — their sample mean has the same distribution as a single observation, regardless of n.
- Assuming convergence in distribution implies convergence in probability: Xₙ → X in distribution is the weakest mode. It does not imply Xₙ → X in probability unless the limit is a constant. The CLT converges to a non-degenerate normal, so it is only convergence in distribution.
Quiz
-
The Weak Law of Large Numbers states that X̄ₙ: a) Equals μ for all n b) Converges to μ in probability c) Is normally distributed d) Has variance exactly σ²/n Answer: b. For any ε > 0, P(|X̄ₙ − μ| > ε) → 0 as n → ∞.
-
The CLT says that for i.i.d. random variables with finite variance: a) X̄ₙ is exactly normal b) √n (X̄ₙ − μ)/σ converges in distribution to N(0, 1) c) Individual Xᵢ become normal as n increases d) σ/√n converges to 0 Answer: b. The standardized sample mean converges in distribution to standard normal.
-
Almost sure convergence is ______ than convergence in probability: a) Weaker b) Stronger c) Equivalent to d) Unrelated to Answer: b. a.s. ⇒ in probability, but not conversely.
-
A continuity correction in CLT approximations adds/subtracts: a) σ b) 0.5 c) 1 d) √n Answer: b. Adjusting by ±0.5 accounts for the gap between a discrete integer and a continuous interval.
-
What does the Berry-Esseen theorem bound? a) The rate of convergence in the WLLN b) The maximum error in the CLT normal approximation c) The variance of the sample mean d) The probability of a Type I error Answer: b. |Fₙ(x) − Φ(x)| ≤ C·E[|X−μ|³]/(σ³√n) — it quantifies how fast the CLT approximation improves.
-
For X ~ Binomial(100, 0.5), the normal approximation works well because: a) p = 0.5 gives perfect symmetry b) n=100 is large and np(1−p)=25 > 10 c) The binomial is already normal d) Continuity correction is unnecessary Answer: b. Rule of thumb: np(1−p) ≥ 10 for good normal approximation. Here 25 >> 10.
-
The SLLN requires: a) Finite variance b) Finite mean c) Normal distribution d) Independence (which can be relaxed) Answer: b. Kolmogorov's SLLN requires only E[|X|] < ∞. Finite variance is not needed.
-
If Xₙ → X in distribution and g is a bounded continuous function, then: a) g(Xₙ) → g(X) in distribution b) g(Xₙ) → g(X) in probability c) Xₙ and X have the same mean d) Xₙ → X almost surely Answer: a. Continuous mapping theorem: convergence in distribution is preserved under continuous functions.
Next Steps
Continue to 11-07 Markov Chains (Discrete) to learn about transition matrices, stationary distributions, and the ergodic theorem for discrete-time Markov chains.