Phase 11: Probability Theory II
Subject 11-03: Conditional Expectation
Prerequisites: 10-02 (Conditional Probability), 10-10 (Joint Distributions), 11-01 (Expectation for Continuous RVs)
Learning Objectives
- Define E[Y | X = x] for discrete and continuous random variables from conditional distributions
- State and apply the Law of Total Expectation (Adam's law): E[Y] = E[E[Y | X]]
- State and apply the Law of Total Variance (Eve's law): Var(Y) = E[Var(Y | X)] + Var(E[Y | X])
- Compute conditional expectations for bivariate normal distributions
- Use conditional expectation as a random variable E[Y | X] and compute its expectation and variance
Core Content
1. Conditional Expectation Definition
For fixed x with f_X(x) > 0, the conditional expectation of Y given X = x is:
Discrete:
$E[Y | X = x] = Σ_y y · p_{Y|X}(y|x)
$
Continuous:
$E[Y | X = x] = ∫_{−∞}^{∞} y · f_{Y|X}(y|x) dy
$
where f_{Y|X}(y|x) = f_{X,Y}(x, y) / f_X(x) is the conditional PDF.
⚠️ CRITICAL: E[Y | X = x] is a NUMBER — a function of x. For each specific value x, it gives a specific expected value. But E[Y | X] (without "= x") is a RANDOM VARIABLE — it's the function evaluated at the random X. This distinction is crucial for the laws that follow.
E[Y | X] as a random variable: Let g(x) = E[Y | X = x]. Then E[Y | X] = g(X) is a random variable — its randomness comes from X. Before observing X, we don't know what E[Y | X] will be. After observing X = x, it's the number g(x).
2. Law of Total Expectation (Adam's Law)
$E[Y] = E[E[Y | X]] $
Discrete proof:
$E[E[Y | X]] = Σ_x E[Y | X = x] p_X(x)
= Σ_x (Σ_y y p_{Y|X}(y|x)) p_X(x)
= Σ_y y Σ_x p_{Y|X}(y|x) p_X(x)
= Σ_y y p_Y(y) = E[Y]
$
Continuous: Same structure with integrals replacing sums.
Why it's powerful: To find E[Y], you can first condition on X (making the problem easier), find the conditional expectation as a function of X, then average over X. This "divide and conquer" strategy often simplifies complex problems.
Example: Find E[Y] where Y | X ~ Binomial(X, 0.5) and X ~ Poisson(λ). - E[Y | X] = X·0.5 (mean of binomial) - E[Y] = E[E[Y | X]] = E[0.5X] = 0.5E[X] = 0.5λ No need to find the marginal distribution of Y!
Adam's law generalizes: E[g(Y)] = E[E[g(Y) | X]] for any function g.
3. Law of Total Variance (Eve's Law)
$Var(Y) = E[Var(Y | X)] + Var(E[Y | X]) $
Components: - E[Var(Y | X)]: Average of the conditional variances — the "within-group" variability - Var(E[Y | X]): Variance of the conditional means — the "between-group" variability
Proof: Var(Y) = E[Y²] − (E[Y])². By Adam's law: E[Y²] = E[E[Y² | X]] (E[Y])² = (E[E[Y | X]])²
Var(Y) = E[E[Y² | X]] − (E[E[Y | X]])² = E[E[Y² | X] − (E[Y | X])²] + E[(E[Y | X])²] − (E[E[Y | X]])² = E[Var(Y | X)] + Var(E[Y | X]) ✓
Intuition: Total variance = average variance within groups + variance of group means. Think ANOVA — total variation decomposes into within-treatment and between-treatment variation.
Example: Y | X ~ N(X, X²) and X ~ N(0, 1). - E[Y | X] = X, Var(Y | X) = X² - E[Y] = E[X] = 0 - Var(Y) = E[X²] + Var(X) = 1 + 1 = 2
4. Conditional Expectation for Bivariate Normal
For (X, Y) jointly bivariate normal with parameters μ_X, μ_Y, σ_X², σ_Y², ρ:
$E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X) $
Key properties: 1. Linearity: E[Y | X] is a LINEAR function of X — a special property of the bivariate normal 2. Regression interpretation: The slope is ρ(σ_Y/σ_X) and the intercept is μ_Y − ρ(σ_Y/σ_X)μ_X 3. Homoscedasticity: Var(Y | X) = σ_Y²(1 − ρ²) is CONSTANT — it doesn't depend on x
Implication for prediction: Given X = x, the best prediction of Y (minimizing mean squared error) is E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X). For bivariate normal, this is also the best LINEAR predictor, but more generally the conditional expectation may be nonlinear.
Regression to the mean: If ρ < 1, then E[Y | X = x] is closer to μ_Y (in standardized units) than x is to μ_X. A student scoring 3σ above the mean on test 1 is predicted to score less than 3σ above the mean on test 2. This is NOT a law of nature — it's a consequence of imperfect correlation.
5. Conditional Expectation as a Projection
In L² space (random variables with finite second moments), the conditional expectation E[Y | X] has a geometric interpretation:
E[Y | X] is the orthogonal projection of Y onto the space of all functions of X.
This means: 1. E[Y | X] is a function of X (obviously) 2. For any function h(X), the error Y − E[Y | X] is uncorrelated with h(X): E[(Y − E[Y | X]) h(X)] = 0
This is the property that makes E[Y | X] the "best predictor" of Y given X in the mean-squared-error sense:
$E[Y | X] = argmin_{g(X)} E[(Y − g(X))²]
$
Among all functions of X, the conditional expectation minimizes the expected squared prediction error.
⚠️ Important distinction: - Best LINEAR predictor: Ê[Y | X] = μ_Y + (Cov(X,Y)/Var(X))(X − μ_X) — minimizes MSE among linear functions - Best predictor: E[Y | X] — minimizes MSE among ALL functions - For bivariate normal, these coincide - For other distributions, E[Y | X] may be nonlinear and strictly better than any linear predictor
Key Terms
- 11 03 Conditional Expectation
- 11-04 Transformations of Random Variables
- Answer: a.
- Answer: b.
- Answer: c.
- Subject 11-03: Conditional Expectation
Worked Examples
Example 1: Adam's Law in a Hierarchical Model
X ~ Uniform(0, 1). Given X = x, Y ~ Binomial(3, x). Find E[Y] and Var(Y).
Solution:
E[Y | X] = 3X (mean of Binomial(3, X)). By Adam's law: E[Y] = E[3X] = 3E[X] = 3(0.5) = 1.5.
Var(Y | X) = 3X(1−X) (variance of Binomial(3, X)). By Eve's law: Var(Y) = E[3X(1−X)] + Var(3X).
E[3X(1−X)] = 3(E[X] − E[X²]) = 3(1/2 − 1/3) = 3(1/6) = 0.5. (E[X²] from Uniform(0,1) = 1/3). Var(3X) = 9 Var(X) = 9(1/12) = 0.75.
Var(Y) = 0.5 + 0.75 = 1.25.
Check: Directly, Y is Beta-Binomial. E[Y] = 3·(1/2) = 1.5 ✓. Var(Y) = 3·(1/2)·(1/2)·(1 + 2·(1/2))? No, for Beta-Binomial: Var = n·Ep·(1 + (n−1)ρ). But our answer of 1.25 from Eve's law is correct.
Example 2: Conditional Expectation as a Random Variable
Suppose X ~ Exponential(1), and Y | X = x ~ Uniform(0, x). Find: (a) E[Y | X] as a random variable (b) E[Y] using Adam's law (c) E[E[Y | X]²] and Var(E[Y | X])
Solution:
(a) E[Y | X = x] = x/2 (mean of Uniform(0, x)). So E[Y | X] = X/2 (a random variable).
(b) E[Y] = E[E[Y | X]] = E[X/2] = (1/2)E[X] = 1/2. (Since E[X] = 1 for Exp(1)).
(c) E[Y | X] = X/2. E[(E[Y | X])²] = E[(X/2)²] = E[X²]/4. For Exp(1), E[X²] = 2. So = 2/4 = 1/2. Var(E[Y | X]) = E[(E[Y|X])²] − (E[E[Y|X]])² = 1/2 − (1/2)² = 1/4.
Example 3: Bivariate Normal Prediction
In a population, height (H) and weight (W) are jointly normal: H ~ N(170, 100), W ~ N(70, 225), ρ = 0.7.
(a) Given someone is 185 cm tall, what's the expected weight? (b) What proportion of people 185 cm tall weigh over 85 kg?
Solution:
(a) E[W | H = 185] = 70 + 0.7·(15/10)(185−170) = 70 + 0.7·1.5·15 = 70 + 15.75 = 85.75 kg.
(b) W | H = 185 ~ N(85.75, σ²_{W|H}) where σ²_{W|H} = 225(1−0.49) = 225·0.51 = 114.75. σ_{W|H} = √114.75 ≈ 10.71.
P(W > 85 | H = 185) = P(Z > (85−85.75)/10.71) = P(Z > −0.07) = Φ(0.07) ≈ 0.528. About 53%.
Quiz
Q1: The Law of Total Expectation (Adam's Law) states:
A) E[X] = E[E[X | Y]] B) E[X] = E[X | Y] C) E[X | Y] = E[Y | X] D) E[XY] = E[X]E[Y]
Correct: A)
- If you chose A: Correct! E[X] = E[E[X|Y]] — the expected value of the conditional expectation equals the unconditional expectation. This is a powerful tool for computing expectations by conditioning.
- If you chose B: E[X|Y] is a random variable (function of Y), while E[X] is a constant. They're not equal in general.
- If you chose C: This is generally false — it would require symmetry not present in most joint distributions.
- If you chose D: This holds only when X and Y are uncorrelated; Adam's Law is much more general.
Q2: The Law of Total Variance (Eve's Law) decomposes Var(X) as:
A) Var(E[X|Y]) + Var(X|Y) B) E[Var(X|Y)] + Var(E[X|Y]) C) E[X|Y] + Var(X|Y) D) Var(X|Y) × E[Var(X|Y)]
Correct: B)
- If you chose B: Correct! Var(X) = E[Var(X|Y)] + Var(E[X|Y]). "Within-group variance" plus "between-group variance."
- If you chose A: The first term should use expectation, and Var(X|Y) is a random variable.
- If you chose C: This mixes expectations and variances incorrectly.
- If you chose D: This is not a valid decomposition of variance.
Q3: E[X | Y] as a function of Y is:
A) Always constant B) A random variable that is a function of Y C) Equal to E[X] D) Independent of Y
Correct: B)
- If you chose B: Correct! E[X|Y] is a random variable — its value depends on the observed Y. When Y = y, it equals the conditional mean of X given that value.
- If you chose A: It's only constant when X and Y are independent.
- If you chose C: The unconditional expectation E[X] equals the expectation OF E[X|Y], not E[X|Y] itself.
- If you chose D: By definition, E[X|Y] is a function of Y, so it depends on Y.
Q5: For a bivariate normal distribution, E[X | Y = y] takes what form?
A) A constant independent of y B) A linear function of y: μ_X + ρ(σ_X/σ_Y)(y − μ_Y) C) A quadratic function of y D) E[X] regardless of y
Correct: B)
- If you chose B: Correct! For bivariate normal, the conditional expectation is linear in y. The slope is ρ·σ_X/σ_Y, reflecting the regression-to-the-mean effect.
- If you chose A: Only if ρ = 0 (uncorrelated, hence independent for normal).
- If you chose C: The conditional expectation is linear for bivariate normal, not quadratic.
- If you chose D: This ignores the information Y provides — the conditional mean updates with Y.
Practice Problems
- X ~ Poisson(λ). Given X = x, Y ~ Binomial(x, p). Find E[Y] and Var(Y) using Adam's and Eve's laws.
- f_{X,Y}(x,y) = 2 for 0 < x < y < 1. Find E[Y | X = 0.3] and E[Y | X] as a function of X.
- Prove Eve's law: Var(Y) = E[Var(Y | X)] + Var(E[Y | X]).
- X ~ Geometric(p) with P(X=k) = (1−p)^{k−1}p. Given X = k, Y is the sum of k i.i.d. Bernoulli(0.5) trials. Find E[Y].
- For bivariate normal (X, Y) with μ_X=μ_Y=0, σ_X=σ_Y=1, ρ=0.8, what is E[Y² | X = 1]? (Hint: E[Y²|X] = Var(Y|X) + (E[Y|X])².)
- Show that E[(Y − E[Y | X]) h(X)] = 0 for any function h.
- A fair coin is flipped N times, where N ~ Poisson(λ). Let Y be the number of heads. Find E[Y] and Var(Y).
Answers
1. E[Y | X] = pX. E[Y] = E[pX] = pλ. Var(Y | X) = Xp(1−p). Var(Y) = E[Xp(1−p)] + Var(pX) = p(1−p)λ + p²λ = pλ(1−p+p) = pλ. 2. For 0 < x < y < 1: f_X(x) = ∫_x¹ 2 dy = 2(1−x). f_{Y|X}(y|x) = 2/(2(1−x)) = 1/(1−x) for x < y < 1 (uniform on [x, 1]). E[Y | X = x] = (x+1)/2 (midpoint). At x=0.3: E[Y|X=0.3] = (0.3+1)/2 = 0.65. 3. See proof in Core Content section 3. 4. E[Y | X = k] = k(0.5) = k/2. E[Y] = E[X/2] = (1/2)E[X] = (1/2)(1/p) = 1/(2p). 5. E[Y|X=1] = 0 + 0.8(1)(1−0) = 0.8. Var(Y|X) = 1·(1−0.64) = 0.36. E[Y²|X=1] = 0.36 + 0.64 = 1.0. 6. E[(Y − E[Y|X])h(X)] = E[E[(Y − E[Y|X])h(X) | X]]. Since h(X) and E[Y|X] are functions of X, they come out of the inner expectation: = E[h(X)(E[Y|X] − E[Y|X])] = E[h(X)·0] = 0. 7. Y | N ~ Binomial(N, 0.5). E[Y|N] = N/2, E[Y] = E[N/2] = λ/2. Var(Y|N) = N(0.5)(0.5) = N/4. Var(E[Y|N]) = Var(N/2) = λ/4. So Var(Y) = E[N/4] + λ/4 = λ/4 + λ/4 = λ/2.Summary
- E[Y | X = x] is computed from the conditional distribution; E[Y | X] is a random variable (a function of X)
- Adam's law (total expectation): E[Y] = E[E[Y | X]] — average the conditional expectations to get the unconditional expectation; this is a divide-and-conquer strategy
- Eve's law (total variance): Var(Y) = E[Var(Y | X)] + Var(E[Y | X]) — total variance = within-group + between-group variance
- For bivariate normal, E[Y | X] is linear in X with slope ρ(σ_Y/σ_X) and constant conditional variance σ_Y²(1−ρ²)
- E[Y | X] is the orthogonal projection of Y onto functions of X in L² — it is the best predictor in the mean-squared-error sense (beating any linear predictor when the relationship is nonlinear)
Pitfalls
- Confusing E[Y | X = x] (a number) with E[Y | X] (a random variable). E[Y | X = 3] returns a specific value. E[Y | X] returns a function of X — its value is unknown until X is observed. This distinction matters when computing variances: Var(E[Y | X]) is a number, while Var(Y | X) is a random variable.
- Misapplying Adam's law by forgetting the outer expectation is over X. E[Y] = E[E[Y | X]] means compute g(x) = E[Y | X = x], then average g(X) using the distribution of X: E[Y] = Σ g(x) p_X(x). Stopping after finding E[Y | X = x] and reporting it as E[Y] is incorrect — that's the conditional expectation at a specific x, not the overall mean.
- Assuming E[Y | X] is always a linear function of X. Linearity of E[Y | X] is a SPECIAL property of the bivariate normal distribution. For example, if f(x,y) = 2 for 0 < x < y < 1, then E[Y | X = x] = (x+1)/2, which is linear, but if f(x,y) = 6xy² for 0 < x,y < 1, E[Y | X] is constant — the conditional expectation can take any functional form.
- Thinking E[g(Y) | X] = g(E[Y | X]). This holds only when g is linear. In general, E[Y² | X] ≠ (E[Y | X])² — the gap is Var(Y | X). Similarly, E[e^Y | X] ≠ e^{E[Y | X]}.
- Forgetting that the best predictor in the MSE sense is E[Y | X], not the linear regression line. The linear predictor β₀ + β₁X minimizes MSE only among LINEAR functions. If the true relationship is nonlinear (e.g., Y = X² + ε), the conditional expectation E[Y | X] = X² achieves lower MSE than any straight line. The linear predictor is an approximation, not the truth.
Quiz
-
E[Y | X] is: a) Always a constant b) A random variable that is a function of X c) Equal to E[Y] d) The same as E[Y | X = 0] Answer: b. E[Y | X] = g(X) where g(x) = E[Y | X = x] — it's a random variable whose randomness comes from X.
-
Adam's law states: a) E[XY] = E[X]E[Y] b) E[Y] = E[E[Y | X]] c) E[Y] = E[Y | E[X]] d) E[Y | X] = E[Y] Answer: b. E[Y] = E[E[Y | X]] — average the conditional expectations.
-
In Eve's law, E[Var(Y | X)] represents: a) Total variance of Y b) Between-group variance c) Average within-group variance d) Covariance of X and Y Answer: c. It's the expected value of the conditional variance — the average variability within groups defined by X.
-
For bivariate normal (X, Y), E[Y | X] is: a) Always constant b) A linear function of X c) A quadratic function of X d) Equal to E[X] Answer: b. E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X) is linear in x.
-
If E[Y | X] = E[Y] (constant), then: a) Y is independent of X b) Cov(X, Y) = 0 c) X and Y are equal d) X has zero mean Answer: b. If E[Y | X] is constant, then by Adam's law it equals E[Y]. Then Var(E[Y|X]) = 0, and Cov(X,Y) = E[XY] − E[X]E[Y] = E[XE[Y|X]] − E[X]E[Y] = E[XE[Y]] − E[X]E[Y] = 0. (But X and Y could still be dependent via higher moments.)
-
For bivariate normal with ρ = 0.9, Var(Y | X) compared to Var(Y) is: a) Larger b) Equal c) Smaller d) Zero Answer: c. Var(Y | X) = σ_Y²(1−ρ²). With ρ=0.9, (1−ρ²) = 0.19, so conditional variance is 19% of unconditional — much smaller because X explains most of Y's variation.
-
The best predictor of Y given X (minimizing MSE) is: a) The best linear predictor b) E[Y | X] c) E[X | Y] d) The sample mean of Y Answer: b. E[Y | X] minimizes E[(Y − g(X))²] over all functions g. The linear predictor is best only among linear functions.
-
If X ~ Uniform(0,1) and Y | X = x ~ Exponential(1/x), then E[Y] = ? a) 1 b) ∞ c) Does not exist d) 2 Answer: a. E[Y | X] = 1/(1/X) = X. E[Y] = E[X] = 1/2. Wait — Exponential(λ) has mean 1/λ. Here λ = 1/x, so E[Y|X=x] = x. E[Y] = E[X] = 1/2.
Next Steps
Continue to 11-04 Transformations of Random Variables to learn the Jacobian method, the CDF method, and how to handle non-monotonic transformations.