Math graphic
📐 Concept diagram

Phase 11: Probability Theory II

Subject 11-06: Limit Theorems

Prerequisites: 10-06 (Expectation), 10-08 (Normal Distribution), 10-10 (Joint Distributions), basic limits


Learning Objectives

  1. State the Weak Law of Large Numbers (WLLN) and Strong Law of Large Numbers (SLLN) and explain the difference
  2. State, prove, and apply the Central Limit Theorem (CLT) for i.i.d. random variables
  3. Use the CLT to approximate probabilities for sums and sample means with continuity correction
  4. Define convergence in probability, almost sure convergence, and convergence in distribution
  5. State the Berry-Esseen theorem for bounding CLT approximation error

Core Content

1. The Law of Large Numbers

Let X₁, X₂, ..., Xₙ be i.i.d. with mean μ = E[Xᵢ] and finite variance σ².

Weak Law of Large Numbers (WLLN): For any ε > 0:

P(|X̄ₙ − μ| > ε) → 0    as n → ∞

The sample mean converges in probability to μ. For any fixed margin ε, you can make the probability of deviation arbitrarily small by taking n large enough.

ⓘ Convergence in probability: Xₙ → X in probability if for every ε > 0, P(|Xₙ − X| > ε) → 0.

Proof of WLLN (Chebyshev's inequality): E[X̄ₙ] = μ, Var(X̄ₙ) = σ²/n. By Chebyshev:

$P(|X̄ₙ − μ| > ε) ≤ Var(X̄ₙ)/ε² = σ²/(nε²) → 0
$

Strong Law of Large Numbers (SLLN):

$P(lim_{n→∞} X̄ₙ = μ) = 1
$

The sample mean converges almost surely to μ. Not just "probably close" — with probability 1, the sequence of sample means actually approaches μ in the limit. This is stronger: almost sure convergence ⇒ convergence in probability (but not conversely).

ⓘ Almost sure convergence: Xₙ → X a.s. if P(lim Xₙ = X) = 1.

⚠️ CRITICAL distinction: WLLN says "for large n, X̄ₙ is probably close to μ." SLLN says "eventually, X̄ₙ stays close to μ forever." For practical purposes, both justify using the sample mean to estimate μ for large n. The SLLN is the theoretical foundation of Monte Carlo integration.

2. The Central Limit Theorem (CLT)

Theorem: Let X₁, X₂, ..., Xₙ be i.i.d. with mean μ and finite variance σ². Then:

$√n (X̄ₙ − μ) / σ  →  N(0, 1)    in distribution
$

Equivalently:

$(X̄ₙ − μ) / (σ/√n)  →  N(0, 1)
$

Or, for the sum Sₙ = Σ Xᵢ:

$(Sₙ − nμ) / (σ√n)  →  N(0, 1)
$

In words: For large n, the sample mean is approximately normally distributed with mean μ and variance σ²/n — REGARDLESS of the original distribution (as long as it has finite variance). This is the single most important result in probability and statistics.

ⓘ Convergence in distribution: Xₙ → X in distribution if F_{Xₙ}(x) → F_X(x) at all continuity points of F_X.

What the CLT does NOT say: - It does NOT say X̄ₙ is exactly normal for any finite n — it's an asymptotic approximation - It does NOT say individual Xᵢ are normal — they can be anything (discrete, skewed, bounded) - It does NOT work if σ² is infinite (e.g., Cauchy distribution — no CLT!)

Why the normal distribution is universal: The normal is a "fixed point" under the operation of summing independent random variables. The CLT explains why the normal appears everywhere in nature — it's the universal limit of sums of many small independent effects.

Practical rule of thumb: CLT approximation is usually adequate for n ≥ 30, though highly skewed distributions may require larger n.

3. CLT Applications

Probability approximation:

$P(X̄ₙ ≤ a) ≈ Φ((a − μ) / (σ/√n))
$

Binomial proportion: For X ~ Binomial(n, p) = Σ_{i=1}^{n} Bernoulli(p):

$(X − np) / √(np(1−p)) ≈ N(0, 1)
$

or equivalently, the sample proportion p̂ = X/n:

$p̂ ≈ N(p, p(1−p)/n)
$

Continuity correction (for integer-valued sums): When approximating a discrete distribution with a continuous one, adjust by ±0.5: - P(X ≤ k) ≈ Φ((k + 0.5 − np) / √(np(1−p))) - P(X ≥ k) ≈ 1 − Φ((k − 0.5 − np) / √(np(1−p)))

Without continuity correction, the approximation can be noticeably off for moderate n.

Poisson approximation to binomial (when p is small): When n is large and p is small (np < 10), Poisson(np) is better than normal.

4. Berry-Esseen Theorem

The CLT says convergence happens — Berry-Esseen tells us how fast:

Let Fₙ be the CDF of √n (X̄ₙ − μ)/σ and Φ be the standard normal CDF. Then:

$sup_x |Fₙ(x) − Φ(x)| ≤ C · E[|X − μ|³] / (σ³ √n)
$

where C is a universal constant (proven to be ≤ 0.4748).

In words: The maximum error in the CLT approximation shrinks at rate 1/√n, with the constant depending on the standardized third absolute moment (skewness). More skewed distributions need larger n for good normal approximation.

Implication: For symmetric distributions (skewness = 0), the CLT approximation can be remarkably good even for very small n (e.g., n=5 for Uniform).

5. Relationship Between Convergence Modes

Almost Sure  ⇒  In Probability  ⇒  In Distribution
(constants)  ⇐  (constants only)

The CLT gives convergence in distribution (to a non-constant limit — the normal). The WLLN gives convergence in probability. The SLLN gives almost sure convergence. All three converge to constants μ or σ² (or their standardized forms).



Key Terms

Worked Examples

Example 1: WLLN Application — Polling

A pollster wants to estimate the proportion p of voters supporting a candidate to within ±0.03 with 95% confidence. How large a sample is needed?

Solution:

Let Xᵢ = 1 if voter i supports candidate, 0 otherwise. E[Xᵢ] = p, Var(Xᵢ) = p(1−p).

By Chebyshev: P(|p̂ − p| > 0.03) ≤ Var(p̂)/(0.03)² = p(1−p) / (n · 0.0009).

Worst case: p(1−p) ≤ 1/4. So P(error > 0.03) ≤ (1/4)/(n·0.0009) = 1/(0.0036n).

We want this ≤ 0.05: 1/(0.0036n) ≤ 0.05 → n ≥ 1/(0.0036·0.05) = 1/0.00018 ≈ 5556.

Using CLT instead (which gives a tighter bound): n_poll ≈ (z_{0.025})² / (4·0.03²) = (1.96)² / (4·0.0009) ≈ 1067. The CLT bound is much tighter because Chebyshev is conservative.


Example 2: CLT for a Sum

The weight of a chocolate bar is a random variable with mean 100g and standard deviation 8g. You buy 50 bars (assumed independent). What is the approximate probability the total weight exceeds 5100g?

Solution:

S₅₀ = Σ Xᵢ. E[S] = 50·100 = 5000. Var(S) = 50·64 = 3200. σ_S = √3200 ≈ 56.57.

Z = (5100 − 5000)/56.57 = 100/56.57 ≈ 1.768.

P(S > 5100) = 1 − Φ(1.768) ≈ 1 − 0.9615 = 0.0385. About 3.85% chance.


Example 3: CLT with Continuity Correction

A fair coin is flipped 100 times. Find the approximate probability of getting between 45 and 55 heads inclusive.

Solution:

X ~ Binomial(100, 0.5). μ = 50, σ = √25 = 5.

Without continuity correction: P(45 ≤ X ≤ 55) ≈ Φ((55−50)/5) − Φ((45−50)/5) = Φ(1) − Φ(−1) = 0.8413 − 0.1587 = 0.6826.

With continuity correction: P(45 ≤ X ≤ 55) ≈ Φ((55.5−50)/5) − Φ((44.5−50)/5) = Φ(1.1) − Φ(−1.1) = 0.8643 − 0.1357 = 0.7286.

Exact (binomial sum): P(45 ≤ X ≤ 55) = Σ_{k=45}^{55} C(100,k)(0.5)¹⁰⁰ ≈ 0.7287.

The continuity correction gives near-perfect agreement, while the uncorrected approximation is off by ~4.6 percentage points.

Quiz

Q1: The Weak Law of Large Numbers (WLLN) states that for i.i.d. random variables with finite mean μ:

A) X̄_n → μ almost surely B) X̄_n converges in probability to μ C) X̄_n is exactly μ for all n D) √n(X̄_n − μ) converges to N(0, σ²)

Correct: B)


Q2: The Central Limit Theorem states that for i.i.d. RVs with mean μ and variance σ²:

A) X̄_n ~ N(μ, σ²/n) for any n B) √n(X̄_n − μ)/σ converges in distribution to N(0, 1) C) X̄_n converges to a constant D) The distribution of X̄_n is always normal

Correct: B)


Q3: The CLT requires which condition on the underlying distribution?

A) The distribution must be symmetric B) Finite mean and variance C) The distribution must be continuous D) The distribution must be bounded

Correct: B)


Q5: A 95% confidence interval for μ using the CLT is approximately:

A) X̄_n ± 1.96 · σ/√n B) X̄_n ± 1.96 · σ C) X̄_n ± σ/√n D) X̄_n ± 1.96 · σ²

Correct: A)


Practice Problems

  1. Using Chebyshev's inequality, find the smallest n such that P(|X̄ₙ − μ| > 0.1σ) ≤ 0.05.
  2. A casino's slot machine pays out with mean $0.95 and SD $5 per play. If 10,000 plays occur in a day, approximate P(total payout > $9,800).
  3. Show that if Xₙ → c in probability (c constant) and g is continuous at c, then g(Xₙ) → g(c) in probability.
  4. Verify the CLT for sums of uniform random variables: compute the exact distribution of the sum of 12 i.i.d. Uniform(0, 1) and compare its mean and variance to N(6, 1).
  5. State and prove the WLLN using Chebyshev's inequality. What additional condition does the SLLN require?
  6. An insurance company has 10,000 independent policies, each with a claim probability of 0.02. Claim amounts have mean $5000 and SD $2000. Approximate the probability total claims exceed $12M.
  7. Show that SLLN ⇒ WLLN, but WLLN ⇏ SLLN. (Construct a counterexample where convergence in probability holds but not almost surely.)
Answers 1. Chebyshev: P(|X̄ − μ| > 0.1σ) ≤ Var(X̄)/(0.01σ²) = (σ²/n)/(0.01σ²) = 1/(0.01n). Set ≤ 0.05 → n ≥ 2000. 2. S ~ N(10000·0.95, 10000·25) = N(9500, 250000). σ_S = 500. Z = (9800−9500)/500 = 0.6. P = 1−Φ(0.6) ≈ 0.274. 3. For any ε>0, P(|g(Xₙ)−g(c)|>ε) ≤ P(|Xₙ−c|>δ) for some δ (by continuity of g). Since Xₙ→c in prob, RHS→0. 4. Each Uᵢ has mean 0.5, variance 1/12. Sum of 12: mean = 12·0.5 = 6, variance = 12·(1/12) = 1. Exact distribution of sum of 12 uniforms is a piecewise polynomial (Irwin-Hall), but the CLT says it's ≈ N(6,1), which is remarkably accurate even at n=12. 5. WLLN: E[X̄] = μ, Var(X̄) = σ²/n. P(|X̄−μ|>ε) ≤ σ²/(nε²) → 0. SLLN requires only finite mean (variance can be infinite!), proven via Kolmogorov's inequality and the Borel-Cantelli lemma. 6. Total claims S = Σ_{i=1}^{10000} XᵢYᵢ where Yᵢ~Bern(0.02), Xᵢ|Yᵢ=1 has given moments. E[S] = 10000·0.02·5000 = 1,000,000. Var(S) = 10000·[0.02·2000² + 0.02·0.98·5000²] = 10000·[80000 + 4,900,000] = 10000·4,980,000 = 4.98×10¹⁰. σ_S = √4.98×10¹⁰ ≈ 223,161. Z = (12M−1M)/223161 ≈ 49,327 (off the charts — effectively zero probability). 7. SLLN ⇒ WLLN because a.s. convergence implies convergence in probability. Counterexample: Let Xₙ be independent with P(Xₙ=1)=1/n, P(Xₙ=0)=1−1/n. Then Xₙ → 0 in probability (for any ε>0, P(|Xₙ|>ε)=1/n→0), but Σ P(Xₙ=1) = Σ 1/n = ∞, so by Borel-Cantelli second lemma, P(Xₙ=1 i.o.) = 1 — a.s. convergence fails.

Summary


Pitfalls


Quiz

  1. The Weak Law of Large Numbers states that X̄ₙ: a) Equals μ for all n b) Converges to μ in probability c) Is normally distributed d) Has variance exactly σ²/n Answer: b. For any ε > 0, P(|X̄ₙ − μ| > ε) → 0 as n → ∞.

  2. The CLT says that for i.i.d. random variables with finite variance: a) X̄ₙ is exactly normal b) √n (X̄ₙ − μ)/σ converges in distribution to N(0, 1) c) Individual Xᵢ become normal as n increases d) σ/√n converges to 0 Answer: b. The standardized sample mean converges in distribution to standard normal.

  3. Almost sure convergence is ______ than convergence in probability: a) Weaker b) Stronger c) Equivalent to d) Unrelated to Answer: b. a.s. ⇒ in probability, but not conversely.

  4. A continuity correction in CLT approximations adds/subtracts: a) σ b) 0.5 c) 1 d) √n Answer: b. Adjusting by ±0.5 accounts for the gap between a discrete integer and a continuous interval.

  5. What does the Berry-Esseen theorem bound? a) The rate of convergence in the WLLN b) The maximum error in the CLT normal approximation c) The variance of the sample mean d) The probability of a Type I error Answer: b. |Fₙ(x) − Φ(x)| ≤ C·E[|X−μ|³]/(σ³√n) — it quantifies how fast the CLT approximation improves.

  6. For X ~ Binomial(100, 0.5), the normal approximation works well because: a) p = 0.5 gives perfect symmetry b) n=100 is large and np(1−p)=25 > 10 c) The binomial is already normal d) Continuity correction is unnecessary Answer: b. Rule of thumb: np(1−p) ≥ 10 for good normal approximation. Here 25 >> 10.

  7. The SLLN requires: a) Finite variance b) Finite mean c) Normal distribution d) Independence (which can be relaxed) Answer: b. Kolmogorov's SLLN requires only E[|X|] < ∞. Finite variance is not needed.

  8. If Xₙ → X in distribution and g is a bounded continuous function, then: a) g(Xₙ) → g(X) in distribution b) g(Xₙ) → g(X) in probability c) Xₙ and X have the same mean d) Xₙ → X almost surely Answer: a. Continuous mapping theorem: convergence in distribution is preserved under continuous functions.


Next Steps

Continue to 11-07 Markov Chains (Discrete) to learn about transition matrices, stationary distributions, and the ergodic theorem for discrete-time Markov chains.