📐 Concept diagram

Phase 11: Probability Theory II

Subject 11-03: Conditional Expectation

Prerequisites: 10-02 (Conditional Probability), 10-10 (Joint Distributions), 11-01 (Expectation for Continuous RVs)

Learning Objectives

Define E[Y | X = x] for discrete and continuous random variables from conditional distributions
State and apply the Law of Total Expectation (Adam's law): E[Y] = E[E[Y | X]]
State and apply the Law of Total Variance (Eve's law): Var(Y) = E[Var(Y | X)] + Var(E[Y | X])
Compute conditional expectations for bivariate normal distributions
Use conditional expectation as a random variable E[Y | X] and compute its expectation and variance

Core Content

1. Conditional Expectation Definition

For fixed x with f_X(x) > 0, the conditional expectation of Y given X = x is:

Discrete:

$E[Y | X = x] = Σ_y y · p_{Y|X}(y|x)
$

Continuous:

$E[Y | X = x] = ∫_{−∞}^{∞} y · f_{Y|X}(y|x) dy
$

where f_{Y|X}(y|x) = f_{X,Y}(x, y) / f_X(x) is the conditional PDF.

⚠️ CRITICAL: E[Y | X = x] is a NUMBER — a function of x. For each specific value x, it gives a specific expected value. But E[Y | X] (without "= x") is a RANDOM VARIABLE — it's the function evaluated at the random X. This distinction is crucial for the laws that follow.

E[Y | X] as a random variable: Let g(x) = E[Y | X = x]. Then E[Y | X] = g(X) is a random variable — its randomness comes from X. Before observing X, we don't know what E[Y | X] will be. After observing X = x, it's the number g(x).

2. Law of Total Expectation (Adam's Law)

$E[Y] = E[E[Y | X]]
$

Discrete proof:

$E[E[Y | X]] = Σ_x E[Y | X = x] p_X(x)
            = Σ_x (Σ_y y p_{Y|X}(y|x)) p_X(x)
            = Σ_y y Σ_x p_{Y|X}(y|x) p_X(x)
            = Σ_y y p_Y(y) = E[Y]
$

Continuous: Same structure with integrals replacing sums.

Why it's powerful: To find E[Y], you can first condition on X (making the problem easier), find the conditional expectation as a function of X, then average over X. This "divide and conquer" strategy often simplifies complex problems.

Example: Find E[Y] where Y | X ~ Binomial(X, 0.5) and X ~ Poisson(λ). - E[Y | X] = X·0.5 (mean of binomial) - E[Y] = E[E[Y | X]] = E[0.5X] = 0.5E[X] = 0.5λ No need to find the marginal distribution of Y!

Adam's law generalizes: E[g(Y)] = E[E[g(Y) | X]] for any function g.

3. Law of Total Variance (Eve's Law)

$Var(Y) = E[Var(Y | X)] + Var(E[Y | X])
$

Components: - E[Var(Y | X)]: Average of the conditional variances — the "within-group" variability - Var(E[Y | X]): Variance of the conditional means — the "between-group" variability

Proof: Var(Y) = E[Y²] − (E[Y])². By Adam's law: E[Y²] = E[E[Y² | X]] (E[Y])² = (E[E[Y | X]])²

Intuition: Total variance = average variance within groups + variance of group means. Think ANOVA — total variation decomposes into within-treatment and between-treatment variation.

Example: Y | X ~ N(X, X²) and X ~ N(0, 1). - E[Y | X] = X, Var(Y | X) = X² - E[Y] = E[X] = 0 - Var(Y) = E[X²] + Var(X) = 1 + 1 = 2

4. Conditional Expectation for Bivariate Normal

For (X, Y) jointly bivariate normal with parameters μ_X, μ_Y, σ_X², σ_Y², ρ:

$E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X)
$

Key properties: 1. Linearity: E[Y | X] is a LINEAR function of X — a special property of the bivariate normal 2. Regression interpretation: The slope is ρ(σ_Y/σ_X) and the intercept is μ_Y − ρ(σ_Y/σ_X)μ_X 3. Homoscedasticity: Var(Y | X) = σ_Y²(1 − ρ²) is CONSTANT — it doesn't depend on x

Implication for prediction: Given X = x, the best prediction of Y (minimizing mean squared error) is E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X). For bivariate normal, this is also the best LINEAR predictor, but more generally the conditional expectation may be nonlinear.

Regression to the mean: If ρ < 1, then E[Y | X = x] is closer to μ_Y (in standardized units) than x is to μ_X. A student scoring 3σ above the mean on test 1 is predicted to score less than 3σ above the mean on test 2. This is NOT a law of nature — it's a consequence of imperfect correlation.

5. Conditional Expectation as a Projection

In L² space (random variables with finite second moments), the conditional expectation E[Y | X] has a geometric interpretation:

E[Y | X] is the orthogonal projection of Y onto the space of all functions of X.

This means: 1. E[Y | X] is a function of X (obviously) 2. For any function h(X), the error Y − E[Y | X] is uncorrelated with h(X): E[(Y − E[Y | X]) h(X)] = 0

This is the property that makes E[Y | X] the "best predictor" of Y given X in the mean-squared-error sense:

$E[Y | X] = argmin_{g(X)} E[(Y − g(X))²]
$

Among all functions of X, the conditional expectation minimizes the expected squared prediction error.

⚠️ Important distinction: - Best LINEAR predictor: Ê[Y | X] = μ_Y + (Cov(X,Y)/Var(X))(X − μ_X) — minimizes MSE among linear functions - Best predictor: E[Y | X] — minimizes MSE among ALL functions - For bivariate normal, these coincide - For other distributions, E[Y | X] may be nonlinear and strictly better than any linear predictor

Key Terms

11 03 Conditional Expectation
11-04 Transformations of Random Variables
Answer: a.
Answer: b.
Answer: c.
Subject 11-03: Conditional Expectation

Worked Examples

Example 1: Adam's Law in a Hierarchical Model

X ~ Uniform(0, 1). Given X = x, Y ~ Binomial(3, x). Find E[Y] and Var(Y).

Solution:

E[Y | X] = 3X (mean of Binomial(3, X)). By Adam's law: E[Y] = E[3X] = 3E[X] = 3(0.5) = 1.5.

Var(Y | X) = 3X(1−X) (variance of Binomial(3, X)). By Eve's law: Var(Y) = E[3X(1−X)] + Var(3X).

E[3X(1−X)] = 3(E[X] − E[X²]) = 3(1/2 − 1/3) = 3(1/6) = 0.5. (E[X²] from Uniform(0,1) = 1/3). Var(3X) = 9 Var(X) = 9(1/12) = 0.75.

Var(Y) = 0.5 + 0.75 = 1.25.

Check: Directly, Y is Beta-Binomial. E[Y] = 3·(1/2) = 1.5 ✓. Var(Y) = 3·(1/2)·(1/2)·(1 + 2·(1/2))? No, for Beta-Binomial: Var = n·Ep·(1 + (n−1)ρ). But our answer of 1.25 from Eve's law is correct.

Example 2: Conditional Expectation as a Random Variable

Suppose X ~ Exponential(1), and Y | X = x ~ Uniform(0, x). Find: (a) E[Y | X] as a random variable (b) E[Y] using Adam's law (c) E[E[Y | X]²] and Var(E[Y | X])

Solution:

(a) E[Y | X = x] = x/2 (mean of Uniform(0, x)). So E[Y | X] = X/2 (a random variable).

(b) E[Y] = E[E[Y | X]] = E[X/2] = (1/2)E[X] = 1/2. (Since E[X] = 1 for Exp(1)).

Example 3: Bivariate Normal Prediction

In a population, height (H) and weight (W) are jointly normal: H ~ N(170, 100), W ~ N(70, 225), ρ = 0.7.

(a) Given someone is 185 cm tall, what's the expected weight? (b) What proportion of people 185 cm tall weigh over 85 kg?

Solution:

(a) E[W | H = 185] = 70 + 0.7·(15/10)(185−170) = 70 + 0.7·1.5·15 = 70 + 15.75 = 85.75 kg.

(b) W | H = 185 ~ N(85.75, σ²_{W|H}) where σ²_{W|H} = 225(1−0.49) = 225·0.51 = 114.75. σ_{W|H} = √114.75 ≈ 10.71.

P(W > 85 | H = 185) = P(Z > (85−85.75)/10.71) = P(Z > −0.07) = Φ(0.07) ≈ 0.528. About 53%.

Quiz

Q1: The Law of Total Expectation (Adam's Law) states:

A) E[X] = E[E[X | Y]] B) E[X] = E[X | Y] C) E[X | Y] = E[Y | X] D) E[XY] = E[X]E[Y]

Correct: A)

If you chose A: Correct! E[X] = E[E[X|Y]] — the expected value of the conditional expectation equals the unconditional expectation. This is a powerful tool for computing expectations by conditioning.
If you chose B: E[X|Y] is a random variable (function of Y), while E[X] is a constant. They're not equal in general.
If you chose C: This is generally false — it would require symmetry not present in most joint distributions.
If you chose D: This holds only when X and Y are uncorrelated; Adam's Law is much more general.

Q2: The Law of Total Variance (Eve's Law) decomposes Var(X) as:

Correct: B)

If you chose B: Correct! Var(X) = E[Var(X|Y)] + Var(E[X|Y]). "Within-group variance" plus "between-group variance."
If you chose A: The first term should use expectation, and Var(X|Y) is a random variable.
If you chose C: This mixes expectations and variances incorrectly.
If you chose D: This is not a valid decomposition of variance.

Q3: E[X | Y] as a function of Y is:

A) Always constant B) A random variable that is a function of Y C) Equal to E[X] D) Independent of Y

Correct: B)

If you chose B: Correct! E[X|Y] is a random variable — its value depends on the observed Y. When Y = y, it equals the conditional mean of X given that value.
If you chose A: It's only constant when X and Y are independent.
If you chose C: The unconditional expectation E[X] equals the expectation OF E[X|Y], not E[X|Y] itself.
If you chose D: By definition, E[X|Y] is a function of Y, so it depends on Y.

Q5: For a bivariate normal distribution, E[X | Y = y] takes what form?

A) A constant independent of y B) A linear function of y: μ_X + ρ(σ_X/σ_Y)(y − μ_Y) C) A quadratic function of y D) E[X] regardless of y

Correct: B)

If you chose B: Correct! For bivariate normal, the conditional expectation is linear in y. The slope is ρ·σ_X/σ_Y, reflecting the regression-to-the-mean effect.
If you chose A: Only if ρ = 0 (uncorrelated, hence independent for normal).
If you chose C: The conditional expectation is linear for bivariate normal, not quadratic.
If you chose D: This ignores the information Y provides — the conditional mean updates with Y.

Practice Problems

X ~ Poisson(λ). Given X = x, Y ~ Binomial(x, p). Find E[Y] and Var(Y) using Adam's and Eve's laws.
f_{X,Y}(x,y) = 2 for 0 < x < y < 1. Find E[Y | X = 0.3] and E[Y | X] as a function of X.
Prove Eve's law: Var(Y) = E[Var(Y | X)] + Var(E[Y | X]).
X ~ Geometric(p) with P(X=k) = (1−p)^{k−1}p. Given X = k, Y is the sum of k i.i.d. Bernoulli(0.5) trials. Find E[Y].
For bivariate normal (X, Y) with μ_X=μ_Y=0, σ_X=σ_Y=1, ρ=0.8, what is E[Y² | X = 1]? (Hint: E[Y²|X] = Var(Y|X) + (E[Y|X])².)
Show that E[(Y − E[Y | X]) h(X)] = 0 for any function h.
A fair coin is flipped N times, where N ~ Poisson(λ). Let Y be the number of heads. Find E[Y] and Var(Y).

Answers

1. E[Y | X] = pX. E[Y] = E[pX] = pλ. Var(Y | X) = Xp(1−p). Var(Y) = E[Xp(1−p)] + Var(pX) = p(1−p)λ + p²λ = pλ(1−p+p) = pλ. 2. For 0 < x < y < 1: f_X(x) = ∫_x¹ 2 dy = 2(1−x). f_{Y|X}(y|x) = 2/(2(1−x)) = 1/(1−x) for x < y < 1 (uniform on [x, 1]). E[Y | X = x] = (x+1)/2 (midpoint). At x=0.3: E[Y|X=0.3] = (0.3+1)/2 = 0.65. 3. See proof in Core Content section 3. 4. E[Y | X = k] = k(0.5) = k/2. E[Y] = E[X/2] = (1/2)E[X] = (1/2)(1/p) = 1/(2p). 5. E[Y|X=1] = 0 + 0.8(1)(1−0) = 0.8. Var(Y|X) = 1·(1−0.64) = 0.36. E[Y²|X=1] = 0.36 + 0.64 = 1.0. 6. E[(Y − E[Y|X])h(X)] = E[E[(Y − E[Y|X])h(X) | X]]. Since h(X) and E[Y|X] are functions of X, they come out of the inner expectation: = E[h(X)(E[Y|X] − E[Y|X])] = E[h(X)·0] = 0. 7. Y | N ~ Binomial(N, 0.5). E[Y|N] = N/2, E[Y] = E[N/2] = λ/2. Var(Y|N) = N(0.5)(0.5) = N/4. Var(E[Y|N]) = Var(N/2) = λ/4. So Var(Y) = E[N/4] + λ/4 = λ/4 + λ/4 = λ/2.

Summary

E[Y | X = x] is computed from the conditional distribution; E[Y | X] is a random variable (a function of X)
Adam's law (total expectation): E[Y] = E[E[Y | X]] — average the conditional expectations to get the unconditional expectation; this is a divide-and-conquer strategy
Eve's law (total variance): Var(Y) = E[Var(Y | X)] + Var(E[Y | X]) — total variance = within-group + between-group variance
For bivariate normal, E[Y | X] is linear in X with slope ρ(σ_Y/σ_X) and constant conditional variance σ_Y²(1−ρ²)
E[Y | X] is the orthogonal projection of Y onto functions of X in L² — it is the best predictor in the mean-squared-error sense (beating any linear predictor when the relationship is nonlinear)

Pitfalls

Confusing E[Y | X = x] (a number) with E[Y | X] (a random variable). E[Y | X = 3] returns a specific value. E[Y | X] returns a function of X — its value is unknown until X is observed. This distinction matters when computing variances: Var(E[Y | X]) is a number, while Var(Y | X) is a random variable.
Misapplying Adam's law by forgetting the outer expectation is over X. E[Y] = E[E[Y | X]] means compute g(x) = E[Y | X = x], then average g(X) using the distribution of X: E[Y] = Σ g(x) p_X(x). Stopping after finding E[Y | X = x] and reporting it as E[Y] is incorrect — that's the conditional expectation at a specific x, not the overall mean.
Assuming E[Y | X] is always a linear function of X. Linearity of E[Y | X] is a SPECIAL property of the bivariate normal distribution. For example, if f(x,y) = 2 for 0 < x < y < 1, then E[Y | X = x] = (x+1)/2, which is linear, but if f(x,y) = 6xy² for 0 < x,y < 1, E[Y | X] is constant — the conditional expectation can take any functional form.
Thinking E[g(Y) | X] = g(E[Y | X]). This holds only when g is linear. In general, E[Y² | X] ≠ (E[Y | X])² — the gap is Var(Y | X). Similarly, E[e^Y | X] ≠ e^{E[Y | X]}.
Forgetting that the best predictor in the MSE sense is E[Y | X], not the linear regression line. The linear predictor β₀ + β₁X minimizes MSE only among LINEAR functions. If the true relationship is nonlinear (e.g., Y = X² + ε), the conditional expectation E[Y | X] = X² achieves lower MSE than any straight line. The linear predictor is an approximation, not the truth.

Quiz

E[Y | X] is: a) Always a constant b) A random variable that is a function of X c) Equal to E[Y] d) The same as E[Y | X = 0] Answer: b. E[Y | X] = g(X) where g(x) = E[Y | X = x] — it's a random variable whose randomness comes from X.
Adam's law states: a) E[XY] = E[X]E[Y] b) E[Y] = E[E[Y | X]] c) E[Y] = E[Y | E[X]] d) E[Y | X] = E[Y] Answer: b. E[Y] = E[E[Y | X]] — average the conditional expectations.
In Eve's law, E[Var(Y | X)] represents: a) Total variance of Y b) Between-group variance c) Average within-group variance d) Covariance of X and Y Answer: c. It's the expected value of the conditional variance — the average variability within groups defined by X.
For bivariate normal (X, Y), E[Y | X] is: a) Always constant b) A linear function of X c) A quadratic function of X d) Equal to E[X] Answer: b. E[Y | X = x] = μ_Y + ρ(σ_Y/σ_X)(x − μ_X) is linear in x.
If E[Y | X] = E[Y] (constant), then: a) Y is independent of X b) Cov(X, Y) = 0 c) X and Y are equal d) X has zero mean Answer: b. If E[Y | X] is constant, then by Adam's law it equals E[Y]. Then Var(E[Y|X]) = 0, and Cov(X,Y) = E[XY] − E[X]E[Y] = E[XE[Y|X]] − E[X]E[Y] = E[XE[Y]] − E[X]E[Y] = 0. (But X and Y could still be dependent via higher moments.)
For bivariate normal with ρ = 0.9, Var(Y | X) compared to Var(Y) is: a) Larger b) Equal c) Smaller d) Zero Answer: c. Var(Y | X) = σ_Y²(1−ρ²). With ρ=0.9, (1−ρ²) = 0.19, so conditional variance is 19% of unconditional — much smaller because X explains most of Y's variation.
The best predictor of Y given X (minimizing MSE) is: a) The best linear predictor b) E[Y | X] c) E[X | Y] d) The sample mean of Y Answer: b. E[Y | X] minimizes E[(Y − g(X))²] over all functions g. The linear predictor is best only among linear functions.
If X ~ Uniform(0,1) and Y | X = x ~ Exponential(1/x), then E[Y] = ? a) 1 b) ∞ c) Does not exist d) 2 Answer: a. E[Y | X] = 1/(1/X) = X. E[Y] = E[X] = 1/2. Wait — Exponential(λ) has mean 1/λ. Here λ = 1/x, so E[Y|X=x] = x. E[Y] = E[X] = 1/2.

Next Steps

Continue to 11-04 Transformations of Random Variables to learn the Jacobian method, the CDF method, and how to handle non-monotonic transformations.

Progress

Phases

Phase 11: Probability Theory II

Subject 11-03: Conditional Expectation

Learning Objectives

Core Content

1. Conditional Expectation Definition

2. Law of Total Expectation (Adam's Law)

3. Law of Total Variance (Eve's Law)

4. Conditional Expectation for Bivariate Normal

5. Conditional Expectation as a Projection

Key Terms

Worked Examples

Quiz

Practice Problems

Summary

Pitfalls

Quiz

Next Steps