Math graphic
📐 Concept diagram

Phase 11: Probability Theory II

Subject 11-03: Conditional Expectation

Prerequisites: 10-02 (Conditional Probability), 10-10 (Joint Distributions), 11-01 (Expectation for Continuous RVs)


Learning Objectives

  1. Define E[Y | X = x] for discrete and continuous random variables from conditional distributions
  2. State and apply the Law of Total Expectation (Adam's law): E[Y] = E[E[Y | X]]
  3. State and apply the Law of Total Variance (Eve's law): Var(Y) = E[Var(Y | X)] + Var(E[Y | X])
  4. Compute conditional expectations for bivariate normal distributions
  5. Use conditional expectation as a random variable E[Y | X] and compute its expectation and variance

Core Content

1. Conditional Expectation Definition

For fixed x with f_X(x) > 0, the conditional expectation of Y given X = x is:

Discrete:

$E[Y | X = x] = Σ_y y · p_{Y|X}(y|x)
$

Continuous:

$E[Y | X = x] = ∫_{−∞}^{∞} y · f_{Y|X}(y|x) dy
$

where f_{Y|X}(y|x) = f_{X,Y}(x, y) / f_X(x) is the conditional PDF.

⚠️ CRITICAL: E[Y | X = x] is a NUMBER — a function of x. For each specific value x, it gives a specific expected value. But E[Y | X] (without "= x") is a RANDOM VARIABLE — it's the function evaluated at the random X. This distinction is crucial for the laws that follow.

E[Y | X] as a random variable: Let g(x) = E[Y | X = x]. Then E[Y | X] = g(X) is a random variable — its randomness comes from X. Before observing X, we don't know what E[Y | X] will be. After observing X = x, it's the number g(x).

2. Law of Total Expectation (Adam's Law)

$E[Y] = E[E[Y | X]]
$

Discrete proof:

$E[E[Y | X]] = Σ_x E[Y | X = x] p_X(x)
            = Σ_x (Σ_y y p_{Y|X}(y|x)) p_X(x)
            = Σ_y y Σ_x p_{Y|X}(y|x) p_X(x)
            = Σ_y y p_Y(y) = E[Y]
$

Continuous: Same structure with integrals replacing sums.

Why it's powerful: To find E[Y], you can first condition on X (making the problem easier), find the conditional expectation as a function of X, then average over X. This "divide and conquer" strategy often simplifies complex problems.

Example: Find E[Y] where Y | X ~ Binomial(X, 0.5) and X ~ Poisson(λ). - E[Y | X] = X·0.5 (mean of binomial) - E[Y] = E[E[Y | X]] = E[0.5X] = 0.5E[X] = 0.5λ No need to find the marginal distribution of Y!

Adam's law generalizes: E[g(Y)] = E[E[g(Y) | X]] for any function g.

3. Law of Total Variance (Eve's Law)

$Var(Y) = E[Var(Y | X)] + Var(E[Y | X])
$

Components: - E[Var(Y | X)]: Average of the conditional variances — the "within-group" variability - Var(E[Y | X]): Variance of the conditional means — the "between-group" variability

Proof: Var(Y) = E[Y²] − (E[Y])². By Adam's law: E[Y²] = E[E[Y² | X]] (E[Y])² = (E[E[Y | X]])²

Var(Y) = E[E[Y² | X]] − (E[E[Y | X]])² = E[E[Y² | X] − (E[Y | X])²] + E[(E[Y | X])²] − (E[E[Y | X]])² = E[Var(Y | X)] + Var(E[Y | X]) ✓

Intuition: Total variance = average variance within groups + variance of group means. Think ANOVA — total variation decomposes into within-treatment and between-treatment variation.

Example: Y | X ~ N(X, X²) and X ~ N(0, 1). - E[Y | X] = X, Var(Y | X) = X² - E[Y] = E[X] = 0 - Var(Y) = E[X²] + Var(X) = 1 + 1 = 2

4. Conditional Expectation for Bivariate Normal

For (X, Y) jointly bivariate normal with parameters μ_X, μ_Y, σ_X², σ_Y², ρ:

$E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X)
$

Key properties: 1. Linearity: E[Y | X] is a LINEAR function of X — a special property of the bivariate normal 2. Regression interpretation: The slope is ρ(σ_Y/σ_X) and the intercept is μ_Y − ρ(σ_Y/σ_X)μ_X 3. Homoscedasticity: Var(Y | X) = σ_Y²(1 − ρ²) is CONSTANT — it doesn't depend on x

Implication for prediction: Given X = x, the best prediction of Y (minimizing mean squared error) is E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X). For bivariate normal, this is also the best LINEAR predictor, but more generally the conditional expectation may be nonlinear.

Regression to the mean: If ρ < 1, then E[Y | X = x] is closer to μ_Y (in standardized units) than x is to μ_X. A student scoring 3σ above the mean on test 1 is predicted to score less than 3σ above the mean on test 2. This is NOT a law of nature — it's a consequence of imperfect correlation.

5. Conditional Expectation as a Projection

In L² space (random variables with finite second moments), the conditional expectation E[Y | X] has a geometric interpretation:

E[Y | X] is the orthogonal projection of Y onto the space of all functions of X.

This means: 1. E[Y | X] is a function of X (obviously) 2. For any function h(X), the error Y − E[Y | X] is uncorrelated with h(X): E[(Y − E[Y | X]) h(X)] = 0

This is the property that makes E[Y | X] the "best predictor" of Y given X in the mean-squared-error sense:

$E[Y | X] = argmin_{g(X)} E[(Y − g(X))²]
$

Among all functions of X, the conditional expectation minimizes the expected squared prediction error.

⚠️ Important distinction: - Best LINEAR predictor: Ê[Y | X] = μ_Y + (Cov(X,Y)/Var(X))(X − μ_X) — minimizes MSE among linear functions - Best predictor: E[Y | X] — minimizes MSE among ALL functions - For bivariate normal, these coincide - For other distributions, E[Y | X] may be nonlinear and strictly better than any linear predictor



Key Terms

Worked Examples

Example 1: Adam's Law in a Hierarchical Model

X ~ Uniform(0, 1). Given X = x, Y ~ Binomial(3, x). Find E[Y] and Var(Y).

Solution:

E[Y | X] = 3X (mean of Binomial(3, X)). By Adam's law: E[Y] = E[3X] = 3E[X] = 3(0.5) = 1.5.

Var(Y | X) = 3X(1−X) (variance of Binomial(3, X)). By Eve's law: Var(Y) = E[3X(1−X)] + Var(3X).

E[3X(1−X)] = 3(E[X] − E[X²]) = 3(1/2 − 1/3) = 3(1/6) = 0.5. (E[X²] from Uniform(0,1) = 1/3). Var(3X) = 9 Var(X) = 9(1/12) = 0.75.

Var(Y) = 0.5 + 0.75 = 1.25.

Check: Directly, Y is Beta-Binomial. E[Y] = 3·(1/2) = 1.5 ✓. Var(Y) = 3·(1/2)·(1/2)·(1 + 2·(1/2))? No, for Beta-Binomial: Var = n·Ep·(1 + (n−1)ρ). But our answer of 1.25 from Eve's law is correct.


Example 2: Conditional Expectation as a Random Variable

Suppose X ~ Exponential(1), and Y | X = x ~ Uniform(0, x). Find: (a) E[Y | X] as a random variable (b) E[Y] using Adam's law (c) E[E[Y | X]²] and Var(E[Y | X])

Solution:

(a) E[Y | X = x] = x/2 (mean of Uniform(0, x)). So E[Y | X] = X/2 (a random variable).

(b) E[Y] = E[E[Y | X]] = E[X/2] = (1/2)E[X] = 1/2. (Since E[X] = 1 for Exp(1)).

(c) E[Y | X] = X/2. E[(E[Y | X])²] = E[(X/2)²] = E[X²]/4. For Exp(1), E[X²] = 2. So = 2/4 = 1/2. Var(E[Y | X]) = E[(E[Y|X])²] − (E[E[Y|X]])² = 1/2 − (1/2)² = 1/4.


Example 3: Bivariate Normal Prediction

In a population, height (H) and weight (W) are jointly normal: H ~ N(170, 100), W ~ N(70, 225), ρ = 0.7.

(a) Given someone is 185 cm tall, what's the expected weight? (b) What proportion of people 185 cm tall weigh over 85 kg?

Solution:

(a) E[W | H = 185] = 70 + 0.7·(15/10)(185−170) = 70 + 0.7·1.5·15 = 70 + 15.75 = 85.75 kg.

(b) W | H = 185 ~ N(85.75, σ²_{W|H}) where σ²_{W|H} = 225(1−0.49) = 225·0.51 = 114.75. σ_{W|H} = √114.75 ≈ 10.71.

P(W > 85 | H = 185) = P(Z > (85−85.75)/10.71) = P(Z > −0.07) = Φ(0.07) ≈ 0.528. About 53%.

Quiz

Q1: The Law of Total Expectation (Adam's Law) states:

A) E[X] = E[E[X | Y]] B) E[X] = E[X | Y] C) E[X | Y] = E[Y | X] D) E[XY] = E[X]E[Y]

Correct: A)


Q2: The Law of Total Variance (Eve's Law) decomposes Var(X) as:

A) Var(E[X|Y]) + Var(X|Y) B) E[Var(X|Y)] + Var(E[X|Y]) C) E[X|Y] + Var(X|Y) D) Var(X|Y) × E[Var(X|Y)]

Correct: B)


Q3: E[X | Y] as a function of Y is:

A) Always constant B) A random variable that is a function of Y C) Equal to E[X] D) Independent of Y

Correct: B)


Q5: For a bivariate normal distribution, E[X | Y = y] takes what form?

A) A constant independent of y B) A linear function of y: μ_X + ρ(σ_X/σ_Y)(y − μ_Y) C) A quadratic function of y D) E[X] regardless of y

Correct: B)


Practice Problems

  1. X ~ Poisson(λ). Given X = x, Y ~ Binomial(x, p). Find E[Y] and Var(Y) using Adam's and Eve's laws.
  2. f_{X,Y}(x,y) = 2 for 0 < x < y < 1. Find E[Y | X = 0.3] and E[Y | X] as a function of X.
  3. Prove Eve's law: Var(Y) = E[Var(Y | X)] + Var(E[Y | X]).
  4. X ~ Geometric(p) with P(X=k) = (1−p)^{k−1}p. Given X = k, Y is the sum of k i.i.d. Bernoulli(0.5) trials. Find E[Y].
  5. For bivariate normal (X, Y) with μ_X=μ_Y=0, σ_X=σ_Y=1, ρ=0.8, what is E[Y² | X = 1]? (Hint: E[Y²|X] = Var(Y|X) + (E[Y|X])².)
  6. Show that E[(Y − E[Y | X]) h(X)] = 0 for any function h.
  7. A fair coin is flipped N times, where N ~ Poisson(λ). Let Y be the number of heads. Find E[Y] and Var(Y).
Answers 1. E[Y | X] = pX. E[Y] = E[pX] = pλ. Var(Y | X) = Xp(1−p). Var(Y) = E[Xp(1−p)] + Var(pX) = p(1−p)λ + p²λ = pλ(1−p+p) = pλ. 2. For 0 < x < y < 1: f_X(x) = ∫_x¹ 2 dy = 2(1−x). f_{Y|X}(y|x) = 2/(2(1−x)) = 1/(1−x) for x < y < 1 (uniform on [x, 1]). E[Y | X = x] = (x+1)/2 (midpoint). At x=0.3: E[Y|X=0.3] = (0.3+1)/2 = 0.65. 3. See proof in Core Content section 3. 4. E[Y | X = k] = k(0.5) = k/2. E[Y] = E[X/2] = (1/2)E[X] = (1/2)(1/p) = 1/(2p). 5. E[Y|X=1] = 0 + 0.8(1)(1−0) = 0.8. Var(Y|X) = 1·(1−0.64) = 0.36. E[Y²|X=1] = 0.36 + 0.64 = 1.0. 6. E[(Y − E[Y|X])h(X)] = E[E[(Y − E[Y|X])h(X) | X]]. Since h(X) and E[Y|X] are functions of X, they come out of the inner expectation: = E[h(X)(E[Y|X] − E[Y|X])] = E[h(X)·0] = 0. 7. Y | N ~ Binomial(N, 0.5). E[Y|N] = N/2, E[Y] = E[N/2] = λ/2. Var(Y|N) = N(0.5)(0.5) = N/4. Var(E[Y|N]) = Var(N/2) = λ/4. So Var(Y) = E[N/4] + λ/4 = λ/4 + λ/4 = λ/2.

Summary


Pitfalls


Quiz

  1. E[Y | X] is: a) Always a constant b) A random variable that is a function of X c) Equal to E[Y] d) The same as E[Y | X = 0] Answer: b. E[Y | X] = g(X) where g(x) = E[Y | X = x] — it's a random variable whose randomness comes from X.

  2. Adam's law states: a) E[XY] = E[X]E[Y] b) E[Y] = E[E[Y | X]] c) E[Y] = E[Y | E[X]] d) E[Y | X] = E[Y] Answer: b. E[Y] = E[E[Y | X]] — average the conditional expectations.

  3. In Eve's law, E[Var(Y | X)] represents: a) Total variance of Y b) Between-group variance c) Average within-group variance d) Covariance of X and Y Answer: c. It's the expected value of the conditional variance — the average variability within groups defined by X.

  4. For bivariate normal (X, Y), E[Y | X] is: a) Always constant b) A linear function of X c) A quadratic function of X d) Equal to E[X] Answer: b. E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X) is linear in x.

  5. If E[Y | X] = E[Y] (constant), then: a) Y is independent of X b) Cov(X, Y) = 0 c) X and Y are equal d) X has zero mean Answer: b. If E[Y | X] is constant, then by Adam's law it equals E[Y]. Then Var(E[Y|X]) = 0, and Cov(X,Y) = E[XY] − E[X]E[Y] = E[XE[Y|X]] − E[X]E[Y] = E[XE[Y]] − E[X]E[Y] = 0. (But X and Y could still be dependent via higher moments.)

  6. For bivariate normal with ρ = 0.9, Var(Y | X) compared to Var(Y) is: a) Larger b) Equal c) Smaller d) Zero Answer: c. Var(Y | X) = σ_Y²(1−ρ²). With ρ=0.9, (1−ρ²) = 0.19, so conditional variance is 19% of unconditional — much smaller because X explains most of Y's variation.

  7. The best predictor of Y given X (minimizing MSE) is: a) The best linear predictor b) E[Y | X] c) E[X | Y] d) The sample mean of Y Answer: b. E[Y | X] minimizes E[(Y − g(X))²] over all functions g. The linear predictor is best only among linear functions.

  8. If X ~ Uniform(0,1) and Y | X = x ~ Exponential(1/x), then E[Y] = ? a) 1 b) ∞ c) Does not exist d) 2 Answer: a. E[Y | X] = 1/(1/X) = X. E[Y] = E[X] = 1/2. Wait — Exponential(λ) has mean 1/λ. Here λ = 1/x, so E[Y|X=x] = x. E[Y] = E[X] = 1/2.


Next Steps

Continue to 11-04 Transformations of Random Variables to learn the Jacobian method, the CDF method, and how to handle non-monotonic transformations.