Phase 11: Probability Theory II
Subject 11-01: Expectation for Continuous Random Variables
Prerequisites: 10-07 (Continuous Random Variables), 10-06 (Expectation of Discrete RVs), basic calculus
Learning Objectives
- Compute expected values for continuous random variables using LOTUS (Law of the Unconscious Statistician)
- Define and derive moment generating functions (MGFs) for continuous distributions
- Define characteristic functions and explain their advantages over MGFs
- Compute moments (mean, variance, skewness, kurtosis) from MGFs and characteristic functions
- Apply Jensen's inequality to relate E[g(X)] and g(E[X])
Core Content
1. LOTUS for Continuous Random Variables
For a continuous random variable X with PDF f_X(x), the Law of the Unconscious Statistician states:
$E[g(X)] = ∫_{−∞}^{∞} g(x) f_X(x) dx
$
You do NOT need to find the distribution of Y = g(X) — just integrate g(x) weighted by the PDF of X.
⚠️ CRITICAL: E[g(X)] ≠ g(E[X]) in general. For convex functions g, Jensen's inequality gives E[g(X)] ≥ g(E[X]).
Example (variance via LOTUS): Var(X) = E[(X − μ)²] = ∫ (x − μ)² f_X(x) dx. No need to find the distribution of (X − μ)².
Edge case: The integral must converge absolutely for E[g(X)] to exist.
2. Moment Generating Functions (MGF)
The moment generating function of X is:
$M_X(t) = E[e^{tX}] = ∫_{−∞}^{∞} e^{tx} f_X(x) dx
$
defined for all t where the integral converges (the MGF may not exist for all t, or may not exist at all — e.g., the Cauchy distribution has no MGF).
Key property — generating moments: The k-th moment is obtained by differentiating M_X(t) k times and evaluating at t = 0:
$E[Xᵏ] = M_X^{(k)}(0)
$
Proof sketch: Expand e^{tX} = 1 + tX + t²X²/2! + t³X³/3! + ... and take expectation term by term:
$M_X(t) = E[1 + tX + t²X²/2! + ...] = 1 + t E[X] + t²E[X²]/2! + ... $
Differentiating k times and setting t = 0 isolates the k-th moment.
Common MGFs (continuous):
| Distribution | MGF | Domain |
|---|---|---|
| Uniform(0, 1) | (eᵗ − 1)/t | all t |
| Exponential(λ) | λ/(λ − t) | t < λ |
| Gamma(α, β) | (1 − t/β)^{−α} | t < β |
| Normal(μ, σ²) | exp(μt + σ²t²/2) | all t |
| Chi-squared(k) | (1 − 2t)^{−k/2} | t < 1/2 |
Why MGFs matter: 1. Uniqueness: If M_X(t) = M_Y(t) in a neighborhood of 0, then X and Y have the same distribution. 2. Sums of independent RVs: M_{X+Y}(t) = M_X(t) · M_Y(t) when X, Y are independent. 3. Proving limit theorems: Convergence of MGFs implies convergence in distribution (Lévy's continuity theorem).
Pitfall: The MGF may not exist for any t ≠ 0. Example: the log-normal distribution has no MGF because E[e^{tX}] diverges for any t > 0.
3. Characteristic Functions
The characteristic function of X is:
$φ_X(t) = E[e^{itX}] = ∫_{−∞}^{∞} e^{itx} f_X(x) dx
$
where i = √(−1). This is essentially the Fourier transform of the PDF.
Key advantages over MGFs: 1. Always exists: |e^{itx}| = 1, so |φ_X(t)| ≤ 1 for all t. The characteristic function exists for EVERY distribution, including the Cauchy. 2. Inversion formula: The PDF can be recovered from φ_X(t) via inverse Fourier transform: $f_X(x) = (1/(2π)) ∫_{−∞}^{∞} e^{−itx} φ_X(t) dt$ 3. Uniqueness: φ_X uniquely determines the distribution (stronger than MGF since it always exists).
Moment generation from characteristic functions:
$E[Xᵏ] = i^{−k} φ_X^{(k)}(0)
$
provided the k-th moment exists.
Example — Standard Cauchy: PDF: f(x) = 1/(π(1+x²)). Characteristic function: φ(t) = e^{−|t|}. Notice φ is NOT differentiable at t = 0, reflecting that E[X] does not exist. The characteristic function cleanly encodes moment existence!
Common characteristic functions:
| Distribution | φ(t) |
|---|---|
| Normal(0, 1) | e^{−t²/2} |
| Exponential(λ) | λ/(λ − it) |
| Cauchy(0, 1) | e^{− |
| Uniform(−a, a) | sin(at)/(at) |
4. Moments: Mean, Variance, Skewness, Kurtosis
For a continuous RV X with PDF f:
Raw moments: μ'_k = E[Xᵏ] = ∫ xᵏ f(x) dx
Central moments: μ_k = E[(X − μ)ᵏ] where μ = E[X].
Key standardized moments:
- Mean: μ = E[X] — first raw moment
- Variance: σ² = E[(X − μ)²] = μ₂ — second central moment
- Skewness: γ₁ = E[(X − μ)³]/σ³ = μ₃/σ³ — measures asymmetry
- γ₁ > 0: right-skewed (long right tail — exponential)
- γ₁ = 0: symmetric (normal)
- γ₁ < 0: left-skewed
- Excess Kurtosis: γ₂ = E[(X − μ)⁴]/σ⁴ − 3 — measures tail weight relative to normal
- γ₂ > 0: heavier tails than normal (leptokurtic — t-distribution)
- γ₂ = 0: normal tail weight (mesokurtic — normal)
- γ₂ < 0: lighter tails than normal (platykurtic — uniform)
Common Pitfall: Kurtosis measures tail weight, NOT "peakedness." A distribution can have higher kurtosis than normal while being flatter at the center — the t-distribution with low df is an example.
Computing skewness/kurtosis from MGF: Use cumulant generating function K(t) = ln M_X(t). The cumulants κᵣ relate to moments. For the normal distribution, κ₁ = μ, κ₂ = σ², κ₃ = κ₄ = ... = 0 — all cumulants beyond order 2 are zero, which characterizes the normal distribution.
5. Jensen's Inequality
Theorem: If g is a convex function, then:
$E[g(X)] ≥ g(E[X]) $
If g is strictly convex, equality holds iff X is constant (almost surely).
Convexity check: g''(x) ≥ 0 for all x ⇒ g is convex. Examples: g(x) = x², x⁴, eˣ, −ln(x), 1/x (for x > 0).
Examples: - E[X²] ≥ (E[X])² (since g(x)=x² is convex) — this is equivalent to Var(X) ≥ 0 - E[e^X] ≥ e^{E[X]} - E[1/X] ≥ 1/E[X] for X > 0 (since g(x)=1/x is convex for x > 0)
Concave functions (reverse inequality): g''(x) ≤ 0 ⇒ E[g(X)] ≤ g(E[X]). Examples: g(x) = ln(x), √x.
Application — Information theory: By Jensen, E[−ln(f(X))] ≥ −ln(E[f(X)]) — the foundation for entropy bounds.
Key Terms
- 11 01 Expectation Continuous Rv
- 11-02 Covariance and Correlation
- Answer: b.
- Answer: c.
- Distribution
- MGF of a sum of independent random variables
- Subject 11-01: Expectation for Continuous Random Variables
- characteristic function
- moment generating function
Worked Examples
Example 1: Computing MGF and Moments
Let X ~ Exponential(λ). (a) Derive the MGF. (b) Use it to compute E[X], E[X²], and Var(X).
Solution:
(a) M_X(t) = E[e^{tX}] = ∫₀^{∞} e^{tx} λ e^{−λx} dx = λ ∫₀^{∞} e^{−(λ−t)x} dx
For t < λ: = λ [−e^{−(λ−t)x}/(λ−t)]₀^{∞} = λ · (1/(λ−t)) = λ/(λ−t).
(b) M'_X(t) = λ/(λ−t)². M'_X(0) = λ/λ² = 1/λ. So E[X] = 1/λ. M''_X(t) = 2λ/(λ−t)³. M''_X(0) = 2λ/λ³ = 2/λ². Var(X) = E[X²] − (E[X])² = 2/λ² − 1/λ² = 1/λ². ✓
Example 2: Characteristic Function of Normal
Find the characteristic function of Z ~ N(0, 1).
Solution:
φ_Z(t) = E[e^{itZ}] = ∫_{−∞}^{∞} e^{itz} · (1/√(2π)) e^{−z²/2} dz
Complete the square in the exponent: itz − z²/2 = −(z² − 2itz)/2 = −((z − it)² + t²)/2 = −(z − it)²/2 − t²/2.
φ_Z(t) = e^{−t²/2} ∫_{−∞}^{∞} (1/√(2π)) e^{−(z−it)²/2} dz
The integral equals 1 (it's a shifted normal kernel, and by contour integration or recognizing the moment generating function pattern, the integral of a complex-shifted Gaussian is 1). Thus φ_Z(t) = e^{−t²/2}.
For X ~ N(μ, σ²): X = μ + σZ, so φ_X(t) = e^{iμt} φ_Z(σt) = exp(iμt − σ²t²/2).
Example 3: Jensen's Inequality in Action
Let X ~ Uniform(0, 2). Compute E[X²] and (E[X])², and verify E[X²] ≥ (E[X])².
Solution:
E[X] = (0+2)/2 = 1. So (E[X])² = 1.
E[X²] = ∫₀² x² · (1/2) dx = (1/2)[x³/3]₀² = (1/2)(8/3) = 4/3 ≈ 1.333.
Clearly 4/3 > 1, so E[X²] > (E[X])², consistent with Jensen (x² is strictly convex). The gap E[X²] − (E[X])² = 4/3 − 1 = 1/3 is Var(X). ✓
Quiz
Q1: LOTUS for a continuous random variable states that E[g(X)] equals:
A) g(E[X]) B) ∫ g(x) f_X(x) dx C) Σ g(x) p_X(x) D) g(∫ x f_X(x) dx)
Correct: B)
- If you chose B: Correct! LOTUS lets you compute E[g(X)] by integrating g(x) times the PDF, without finding the distribution of g(X) first.
- If you chose A: This only holds when g is linear. In general, E[g(X)] ≠ g(E[X]) — Jensen's inequality.
- If you chose C: This is the discrete LOTUS. Continuous uses integration.
- If you chose D: This equals g(E[X]), which is generally incorrect for nonlinear g.
Q2: The moment generating function M_X(t) = E[e^{tX}] has the property that:
A) M_X(1) = E[X] B) M'_X(0) = E[X] C) M_X(0) = E[X] D) M''_X(0) = E[X]
Correct: B)
- If you chose B: Correct! The n-th derivative at 0 gives the n-th moment: M^(n)_X(0) = E[X^n]. So M'_X(0) = E[X] and M''_X(0) = E[X²].
- If you chose A: M_X(1) = E[e^X], not E[X].
- If you chose C: M_X(0) = E[1] = 1 for any random variable.
- If you chose D: M''_X(0) = E[X²] (the second moment), not E[X].
Q3: If the MGF of X exists in a neighborhood of 0, then:
A) X must be normally distributed B) The MGF uniquely determines the distribution of X C) All moments of X are zero D) X has finite support
Correct: B)
- If you chose B: Correct! When the MGF exists, it uniquely characterizes the distribution. Two RVs with the same MGF have the same distribution.
- If you chose A: Many distributions have MGFs (normal, gamma, Poisson, etc.), not just the normal.
- If you chose C: If all moments were zero, the MGF would be identically 1.
- If you chose D: The normal distribution has an MGF and infinite support.
Q5: Characteristic functions differ from MGFs in that:
A) They always exist for any random variable B) They only work for discrete distributions C) They don't generate moments D) They are always real-valued
Correct: A)
- If you chose A: Correct! The characteristic function φ(t) = E[e^{itX}] always exists because |e^{itX}| = 1. The MGF E[e^{tX}] may not exist for some distributions (e.g., Cauchy).
- If you chose B: Characteristic functions work for all distributions.
- If you chose C: Characteristic functions DO generate moments via derivatives at 0.
- If you chose D: Characteristic functions are generally complex-valued.
Practice Problems
- Let X ~ Uniform(0, θ). Derive the MGF and use it to compute E[X] and Var(X).
- Compute the characteristic function of X ~ Exponential(λ). Use it to verify E[X] = 1/λ.
- For X ~ Gamma(α, β), the MGF is M(t) = (1 − t/β)^{−α}. Find E[X] and Var(X) by differentiating the MGF.
- Let X have PDF f(x) = (1/2)e^{−|x|} for −∞ < x < ∞ (Laplace distribution). Find E[X], E[|X|], and Var(X).
- Prove that for any random variable X with finite variance, E[X²] ≥ (E[X])². When does equality hold?
- Show that the characteristic function of the Cauchy(0, 1) distribution is e^{−|t|}. Explain why this implies no finite moments.
- If X and Y are independent, prove that φ_{X+Y}(t) = φ_X(t) φ_Y(t).
Answers
1. M_X(t) = (e^{θt} − 1)/(θt). Using series expansion: M(t) = 1 + (θt)/2 + (θt)²/6 + ... so E[X] = M'(0) = θ/2, E[X²] = M''(0) = θ²/3, Var(X) = θ²/3 − (θ/2)² = θ²/12. 2. φ_X(t) = ∫₀^{∞} e^{itx} λ e^{−λx} dx = λ ∫₀^{∞} e^{−(λ−it)x} dx = λ/(λ−it). φ'_X(t) = iλ/(λ−it)². E[X] = i^{−1} φ'_X(0) = −i · iλ/λ² = 1/λ. ✓ 3. M'(t) = (α/β)(1 − t/β)^{−α−1}, M'(0) = α/β. M''(t) = (α(α+1)/β²)(1 − t/β)^{−α−2}, M''(0) = α(α+1)/β². Var(X) = α(α+1)/β² − (α/β)² = α/β². 4. By symmetry, E[X] = 0. E[|X|] = 2∫₀^{∞} x·(1/2)e^{−x} dx = ∫₀^{∞} x e^{−x} dx = 1. E[X²] = 2∫₀^{∞} x²·(1/2)e^{−x} dx = ∫₀^{∞} x² e^{−x} dx = Γ(3) = 2. Var(X) = 2 − 0² = 2. 5. Var(X) = E[(X−μ)²] = E[X²] − 2μE[X] + μ² = E[X²] − μ² ≥ 0, so E[X²] ≥ μ² = (E[X])². Equality holds iff Var(X) = 0, i.e., P(X = c) = 1 for some constant c. 6. φ(t) = ∫ e^{itx}/(π(1+x²)) dx = e^{−|t|} (requires contour integration or recognizing it as the Fourier transform of the Cauchy). φ is not differentiable at t = 0, so no moments exist — the k-th derivative at 0 doesn't exist for any k ≥ 1. 7. φ_{X+Y}(t) = E[e^{it(X+Y)}] = E[e^{itX} e^{itY}]. By independence, E[e^{itX} e^{itY}] = E[e^{itX}] E[e^{itY}] = φ_X(t) φ_Y(t).Summary
- LOTUS: E[g(X)] = ∫ g(x) f_X(x) dx — compute expectations of functions without finding the distribution of g(X); Jensen's inequality gives direction: E[g(X)] ≥ g(E[X]) for convex g
- The MGF M_X(t) = E[e^{tX}] generates moments via derivatives at 0, uniquely identifies distributions, and factorizes for independent sums — but it may not exist (e.g., Cauchy)
- The characteristic function φ_X(t) = E[e^{itX}] ALWAYS exists, uniquely identifies the distribution, and its differentiability at 0 indicates which moments exist
- Skewness (γ₁) measures asymmetry; excess kurtosis (γ₂) measures tail weight relative to normal — NOT peakedness
- Moments can be extracted from MGFs (M^{(k)}(0)) or characteristic functions (i^{−k} φ^{(k)}(0))
Pitfalls
- Assuming E[g(X)] = g(E[X]). LOTUS says E[g(X)] = ∫ g(x) f_X(x) dx — you integrate g(x) against the original PDF. Plugging E[X] into g is correct ONLY when g is linear. For g(x) = x², E[X²] ≠ (E[X])²; for g(x) = 1/x, E[1/X] ≠ 1/E[X]. Jensen's inequality gives the direction for convex/concave functions.
- Assuming the MGF always exists. The MGF M_X(t) = E[e^{tX}] requires the integral to converge. For the Cauchy distribution, E[e^{tX}] diverges for all t ≠ 0. Even for distributions where E[X] exists, the MGF may not (e.g., log-normal). When in doubt, use the characteristic function, which ALWAYS exists.
- Confusing the characteristic function with the MGF. φ_X(t) = E[e^{itX}] (contains i = √(−1)); M_X(t) = E[e^{tX}] (no i). The characteristic function is always bounded (|φ_X(t)| ≤ 1) and always exists; the MGF is unbounded and may not exist. Their derivatives give moments differently: E[Xᵏ] = M^{(k)}(0) = i^{−k} φ^{(k)}(0).
- Thinking kurtosis measures "peakedness." Excess kurtosis γ₂ measures TAIL WEIGHT relative to the normal distribution. A t-distribution with low df has high kurtosis (heavy tails) but is actually flatter at the center than the normal. The interpretation as "peakedness" is a persistent misconception.
- Forgetting the direction of Jensen's inequality. For convex g: E[g(X)] ≥ g(E[X]). For concave g: E[g(X)] ≤ g(E[X]). Check g''(x) to verify: g''(x) ≥ 0 ⇒ convex; g''(x) ≤ 0 ⇒ concave. Common examples: x², eˣ, 1/x (for x>0) are convex; ln(x), √x are concave.
Quiz
-
The Law of the Unconscious Statistician (LOTUS) for continuous RVs says: a) E[g(X)] = g(E[X]) b) E[g(X)] = ∫ g(x) f_X(x) dx c) E[g(X)] = ∫ x f_{g(X)}(x) dx d) E[g(X)] = g(∫ x f_X(x) dx) Answer: b. LOTUS lets you use the original PDF of X, not the distribution of g(X).
-
Which distribution has NO moment generating function (for any t ≠ 0)? a) Normal b) Exponential c) Cauchy d) Uniform Answer: c. The Cauchy has no finite mean, and its MGF does not exist for any t ≠ 0. Its characteristic function does exist: φ(t) = e^{−|t|}.
-
The MGF of a sum of independent random variables is: a) The sum of individual MGFs b) The product of individual MGFs c) The average of individual MGFs d) Undefined Answer: b. M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX}]E[e^{tY}] = M_X(t) M_Y(t) by independence.
-
Jensen's inequality for a convex function g states: a) E[g(X)] ≤ g(E[X]) b) E[g(X)] = g(E[X]) c) E[g(X)] ≥ g(E[X]) d) E[g(X)] = E[X] · g(1) Answer: c. For convex g, the function of the expectation is ≤ the expectation of the function.
-
Skewness measures: a) The spread of the distribution b) The asymmetry of the distribution c) The peakedness of the distribution d) The range of the distribution Answer: b. Skewness γ₁ = E[(X−μ)³]/σ³. Positive = right-skewed, negative = left-skewed.
-
The characteristic function φ_X(t) is defined as: a) E[e^{tX}] b) E[e^{itX}] c) E[e^{−tX}] d) E[cos(tX)] Answer: b. φ_X(t) = E[e^{itX}] where i = √(−1). This always exists because |e^{itX}| = 1.
-
If M_X(t) = exp(2t + 8t²), then Var(X) = ? a) 2 b) 8 c) 16 d) 4 Answer: c. This is the MGF of N(μ, σ²) with μ = 2 and σ²/2 = 8 → σ² = 16.
-
The excess kurtosis of the normal distribution is: a) 3 b) 0 c) −3 d) 1 Answer: b. Excess kurtosis = (μ₄/σ⁴) − 3. For normal, μ₄/σ⁴ = 3, so excess = 0 by definition.
Next Steps
Continue to 11-02 Covariance and Correlation for a deeper treatment of multivariate relationships, the covariance matrix, and the geometric interpretation of correlation.