Phase 10: Probability Theory
Subject 10-06: Expectation of Discrete Random Variables
Prerequisites: 10-04 (Discrete Random Variables), basic summation
Learning Objectives
- Define expected value E[X] for discrete random variables and compute it from the PMF
- Apply linearity of expectation E[aX + bY] = aE[X] + bE[Y] without independence requirements
- Use the Law of the Unconscious Statistician (LOTUS) to compute E[g(X)] directly from the PMF
- Define variance Var(X) = E[(X − μ)²], standard deviation, and their properties under linear transformations
- Define covariance Cov(X, Y) and correlation, and explain how they measure linear dependence
Core Content
1. Definition of Expected Value
For a discrete random variable X with PMF p_X(x), the expected value (mean) is:
$E[X] = Σ_x x · p_X(x) $
where the sum is over all values x in the support of X. For the sum to be well-defined, we require absolute convergence Σ |x| p_X(x) < ∞.
Interpretation: E[X] is the probability-weighted average of all possible values — the "long-run average" if the experiment were repeated infinitely many times.
Expected values of common discrete distributions:
| Distribution | E[X] |
|---|---|
| Bernoulli(p) | p |
| Binomial(n, p) | np |
| Geometric(p) | 1/p |
| NegBin(r, p) | r/p |
| Poisson(λ) | λ |
| Hypergeometric(N, K, n) | n·(K/N) |
Derivation for Binomial: Let X ~ Binomial(n, p). Using the definition:
E[X] = Σ_{k=0}^{n} k · C(n,k) pᵏ (1−p)^{n−k}
Using identity k·C(n,k) = n·C(n−1, k−1):
$E[X] = Σ_{k=1}^{n} n·C(n−1, k−1) pᵏ (1−p)^{n−k}
= np Σ_{k=1}^{n} C(n−1, k−1) p^{k−1} (1−p)^{(n−1)−(k−1)}
= np Σ_{j=0}^{n−1} C(n−1, j) pʲ (1−p)^{(n−1)−j} = np · 1 = np
$
Derivation for Geometric:
$E[X] = Σ_{k=1}^{∞} k (1−p)^{k−1} p = p · (1/p²) = 1/p
$
using the identity Σ_{k=1}^{∞} k q^{k−1} = 1/(1−q)² for |q| < 1 with q = 1−p.
2. Linearity of Expectation
Theorem: For any random variables X, Y and constants a, b:
$E[aX + bY] = aE[X] + bE[Y] $
Crucially, this holds even when X and Y are DEPENDENT. This is one of the most powerful tools in probability.
Proof sketch:
$E[aX + bY] = Σ_x Σ_y (a x + b y) P(X=x, Y=y)
= a Σ_x x Σ_y P(X=x, Y=y) + b Σ_y y Σ_x P(X=x, Y=y)
= a Σ_x x P(X=x) + b Σ_y y P(Y=y)
= a E[X] + b E[Y]
$
Corollary: E[Σ Xᵢ] = Σ E[Xᵢ] for any finite collection. This extends to countably infinite collections under absolute convergence.
Example — Indicator variables: Let I_A be the indicator of event A: I_A = 1 if A occurs, 0 otherwise. Then E[I_A] = 1·P(A) + 0·P(Aᶜ) = P(A). This simple fact combined with linearity is extremely powerful.
Application to Binomial mean (alternative derivation): X ~ Binomial(n, p) is the sum of n independent Bernoulli(p) random variables. By linearity: E[X] = Σ_{i=1}^{n} E[I_i] = Σ p = np. Much simpler than the direct sum!
Application to Hypergeometric mean: A sample of n items drawn without replacement from N items with K successes. Even though draws are dependent, by indicator variables and symmetry: each of the n draws has probability K/N of being a success. So E[X] = n·K/N.
3. Law of the Unconscious Statistician (LOTUS)
For any function g(·):
$E[g(X)] = Σ_x g(x) · p_X(x) $
"Unconscious" because you don't need to find the distribution of Y = g(X) — just plug X's PMF into g.
Example: E[X²] = Σ_x x² p_X(x).
Warning: E[g(X)] ≠ g(E[X]) in general (Jensen's inequality). For convex g, E[g(X)] ≥ g(E[X]).
Edge case: If X takes infinitely many values, the sum must converge absolutely for E[g(X)] to exist.
4. Variance and Standard Deviation
Definition: Let μ = E[X]. The variance of X is:
$Var(X) = E[(X − μ)²] = Σ_x (x − μ)² p_X(x) $
Alternative computational formula:
$Var(X) = E[X²] − (E[X])² $
Proof: Var(X) = E[(X−μ)²] = E[X² − 2μX + μ²] = E[X²] − 2μE[X] + μ² = E[X²] − 2μ² + μ² = E[X²] − μ².
Standard deviation: σ_X = √Var(X) — same units as X.
Variances of common distributions:
| Distribution | Var(X) |
|---|---|
| Bernoulli(p) | p(1−p) |
| Binomial(n, p) | np(1−p) |
| Geometric(p) | (1−p)/p² |
| Poisson(λ) | λ |
| NegBin(r, p) | r(1−p)/p² |
Properties of variance: - Var(X) ≥ 0 (variance is non-negative) - Var(aX + b) = a² Var(X) — adding a constant doesn't change variance - Var(X) = 0 if and only if P(X = c) = 1 for some constant c (degenerate distribution) - For any constants a, b: Var(aX + bY) = a²Var(X) + b²Var(Y) + 2ab Cov(X, Y)
5. Covariance and Correlation
Covariance:
$Cov(X, Y) = E[(X − E[X])(Y − E[Y])] = E[XY] − E[X]E[Y] $
Properties: - Cov(X, X) = Var(X) - Cov(X, Y) = Cov(Y, X) (symmetric) - Cov(aX + b, cY + d) = ac·Cov(X, Y) - Cov(X, Y + Z) = Cov(X, Y) + Cov(X, Z) (bilinear)
Independence ⇒ zero covariance (but not conversely!): If X and Y are independent, then E[XY] = E[X]E[Y], so Cov(X, Y) = 0. Zero covariance does NOT imply independence — it only means no linear relationship.
Correlation coefficient:
$ρ(X, Y) = Cov(X, Y) / (σ_X σ_Y) $
Always satisfies −1 ≤ ρ ≤ 1 (by Cauchy-Schwarz). ρ = ±1 iff Y = aX + b almost surely (perfect linear relationship).
Variance of sum:
$Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) $
For independent X, Y: Var(X + Y) = Var(X) + Var(Y).
General sum:
$Var(Σ Xᵢ) = Σ Var(Xᵢ) + 2 Σ_{i<j} Cov(Xᵢ, Xⱼ)
$
Key Terms
- 10 06 Expectation Of Discrete Rvs
- 10-07 Continuous Random Variables
- Answer: a.
- Answer: b.
- Answer: c.
- Distribution
- Independence ⇒ zero covariance
- Subject 10-06: Expectation of Discrete Random Variables
- expected value
Worked Examples
Example 1: Expected Value of Geometric Distribution
Let X ~ Geometric(p). Compute E[X] directly from the definition.
Solution:
$E[X] = Σ_{k=1}^{∞} k (1−p)^{k−1} p = p Σ_{k=1}^{∞} k q^{k−1} where q = 1−p
$
Recall: Σ_{k=1}^{∞} k q^{k−1} = d/dq (Σ_{k=0}^{∞} qᵏ) = d/dq (1/(1−q)) = 1/(1−q)² = 1/p².
Therefore E[X] = p · (1/p²) = 1/p.
For p=0.5 (fair coin, waiting for first heads), E[X] = 2 flips on average. ✓
Example 2: Variance via LOTUS
Let X be the outcome of a fair die roll. Find Var(X).
Solution:
PMF: p(k) = 1/6 for k = 1, ..., 6.
E[X] = (1+2+3+4+5+6)/6 = 21/6 = 3.5.
E[X²] = (1+4+9+16+25+36)/6 = 91/6 ≈ 15.1667.
Var(X) = E[X²] − (E[X])² = 91/6 − (7/2)² = 91/6 − 49/4 = (182 − 147)/12 = 35/12 ≈ 2.917.
σ = √(35/12) ≈ 1.708.
Example 3: Covariance and Correlation
Joint PMF of (X, Y):
| X\Y | 0 | 1 |
|---|---|---|
| 0 | 0.2 | 0.1 |
| 1 | 0.3 | 0.4 |
Find Cov(X, Y) and ρ(X, Y).
Solution:
Marginals: P(X=0) = 0.3, P(X=1) = 0.7; P(Y=0) = 0.5, P(Y=1) = 0.5.
E[X] = 0·0.3 + 1·0.7 = 0.7 E[Y] = 0·0.5 + 1·0.5 = 0.5 E[XY] = 0·0·0.2 + 0·1·0.1 + 1·0·0.3 + 1·1·0.4 = 0.4
Cov(X, Y) = E[XY] − E[X]E[Y] = 0.4 − (0.7)(0.5) = 0.4 − 0.35 = 0.05
Var(X) = E[X²] − (E[X])². E[X²] = 0²·0.3 + 1²·0.7 = 0.7. Var(X) = 0.7 − 0.49 = 0.21. Var(Y) = E[Y²] − (E[Y])² = 0.5 − 0.25 = 0.25.
ρ = 0.05 / √(0.21 · 0.25) = 0.05 / √0.0525 = 0.05 / 0.229 ≈ 0.218
Moderate positive linear relationship.
Quiz
Q1: Linearity of expectation E[aX + bY] = aE[X] + bE[Y] holds:
A) Only when X and Y are independent B) Only when X and Y are identically distributed C) For any random variables X and Y, regardless of dependence D) Only for discrete random variables
Correct: C)
- If you chose C: Correct! Linearity of expectation is one of the most powerful tools in probability precisely because it requires NO assumptions about dependence or distribution.
- If you chose A: This is a common misconception. Independence is NOT required for linearity of expectation.
- If you chose B: The distributions can be completely different — the identity still holds.
- If you chose D: Linearity holds for both discrete and continuous random variables.
Q2: E[X] for a Bernoulli(p) random variable equals:
A) p(1−p) B) p C) 1/p D) √(p(1−p))
Correct: B)
- If you chose B: Correct! X takes value 1 with probability p and 0 with probability 1−p. E[X] = 1·p + 0·(1−p) = p.
- If you chose A: This is the variance Var(X) = p(1−p), not the mean.
- If you chose C: This is E[X] for a Geometric(p) distribution, not Bernoulli.
- If you chose D: This is the standard deviation √Var(X), not the mean.
Q3: The Law of the Unconscious Statistician (LOTUS) states that E[g(X)] equals:
A) g(E[X]) B) Σ g(x) p_X(x) for discrete X C) g(Σ x p_X(x)) D) E[X] · E[g(X)]
Correct: B)
- If you chose B: Correct! LOTUS says E[g(X)] = Σ g(x) p_X(x) — you apply g to each value and weight by its probability. No need to find the distribution of g(X) first.
- If you chose A: This would be true ONLY if g is linear (g(x) = ax + b). In general, E[g(X)] ≠ g(E[X]), as Jensen's inequality shows.
- If you chose C: This is a single number, not a sum of weighted values.
- If you chose D: This is circular and incorrect.
Q4: If X ~ Binomial(n, p), then E[X] equals:
A) np B) n/p C) p/n D) np²
Correct: A)
- If you chose A: Correct! E[X] = np. This can be derived using linearity: X = I₁ + I₂ + ... + Iₙ where each Iᵢ ~ Bernoulli(p), so E[X] = n·p.
- If you chose B: This would be the mean of a NegativeBinomial(r,p) with r = n: E[X] = r/p.
- If you chose C: This doesn't have the right units and is much too small for large n.
- If you chose D: This is the second moment contribution, not the mean.
Q5: Var(aX + b) for constants a and b equals:
A) a·Var(X) + b B) a²·Var(X) C) a·Var(X) D) Var(X) + b²
Correct: B)
- If you chose B: Correct! Var(aX + b) = a²·Var(X). Adding a constant b shifts the distribution but doesn't change spread. The factor a gets squared.
- If you chose A: Constants added don't affect variance, and the factor isn't squared.
- If you chose C: The scaling factor must be squared — variance has squared units relative to X.
- If you chose D: Adding b doesn't affect variance at all; only the multiplicative constant matters.
Q6: For which discrete distribution does E[X] = 1/p?
A) Binomial(n, p) B) Poisson(p) C) Geometric(p) D) Bernoulli(p)
Correct: C)
- If you chose C: Correct! For Geometric(p), the expected number of trials until the first success is 1/p.
- If you chose A: E[Binomial(n,p)] = np, not 1/p.
- If you chose B: E[Poisson(λ)] = λ, not 1/λ.
- If you chose D: E[Bernoulli(p)] = p, not 1/p.
Q7: Covariance Cov(X, Y) measures:
A) The probability that X and Y are independent B) The strength of linear dependence between X and Y C) Whether X is always larger than Y D) The ratio of E[X] to E[Y]
Correct: B)
- If you chose B: Correct! Cov(X,Y) = E[(X−μₓ)(Y−μᵧ)] measures how X and Y vary together linearly. Positive covariance means they tend to move in the same direction.
- If you chose A: Independence implies zero covariance, but the converse is false — zero covariance ≠ independence (except for multivariate normal).
- If you chose C: Covariance doesn't say anything about the relative magnitudes of X and Y.
- If you chose D: This is a ratio of means, unrelated to covariance.
Practice Problems
-
Derive E[X] for X ~ Bernoulli(p) directly from the definition. Then derive Var(X).
-
Let X ~ NegBin(2, 0.3). Find E[X] directly from the formula r/p.
-
A random variable X has PMF: p(1) = 0.2, p(2) = 0.3, p(3) = 0.5. Find E[X], E[X²], Var(X), and E[1/X].
-
Using indicator variables, find the expected number of heads when 10 fair coins are flipped. Then find the variance.
-
Prove that Var(aX + b) = a² Var(X) for constants a, b.
-
If Var(X) = 3, Var(Y) = 5, and Cov(X, Y) = 2, find Var(2X − 3Y).
-
Show that if P(X = c) = 1, then E[X] = c and Var(X) = 0. Is the converse true?
Answers
1. E[X] = 0·(1−p) + 1·p = p. E[X²] = 0²·(1−p) + 1²·p = p. Var(X) = E[X²] − (E[X])² = p − p² = p(1−p). 2. E[X] = r/p = 2/0.3 ≈ 6.667. On average, it takes about 6.67 trials to get 2 successes. 3. E[X] = 1(0.2) + 2(0.3) + 3(0.5) = 0.2 + 0.6 + 1.5 = 2.3. E[X²] = 1(0.2) + 4(0.3) + 9(0.5) = 0.2 + 1.2 + 4.5 = 5.9. Var(X) = 5.9 − (2.3)² = 5.9 − 5.29 = 0.61. E[1/X] = 1(0.2) + (1/2)(0.3) + (1/3)(0.5) = 0.2 + 0.15 + 0.1667 = 0.5167. 4. Let Iᵢ = indicator of heads on flip i. E[Iᵢ] = 0.5. By linearity, E[X] = 10·0.5 = 5. Since flips are independent, Var(X) = Σ Var(Iᵢ) = 10·(0.5·0.5) = 10·0.25 = 2.5. 5. Var(aX+b) = E[(aX+b − E[aX+b])²] = E[(aX+b − aE[X]−b)²] = E[a²(X−E[X])²] = a² E[(X−E[X])²] = a² Var(X). 6. Var(2X−3Y) = 4Var(X) + 9Var(Y) − 12Cov(X,Y) = 4·3 + 9·5 − 12·2 = 12 + 45 − 24 = 33. 7. If P(X=c)=1, then E[X] = c·1 = c. E[X²] = c², so Var(X) = c² − c² = 0. Yes, the converse is true: if Var(X) = 0, then E[(X−μ)²] = 0. Since (X−μ)² ≥ 0 almost surely, the only way its expectation is zero is if P(X=μ) = 1.Summary
- E[X] = Σ x p(x) is the probability-weighted average; linearity of expectation E[aX+bY] = aE[X]+bE[Y] holds WITHOUT independence — it's one of probability's most powerful tools
- LOTUS lets you compute E[g(X)] = Σ g(x) p(x) without finding the distribution of g(X)
- Variance Var(X) = E[(X−μ)²] = E[X²] − (E[X])² measures spread; Var(aX+b) = a²Var(X)
- Covariance Cov(X,Y) = E[XY]−E[X]E[Y] measures linear dependence; independence ⇒ Cov=0, but Cov=0 ⇏ independence
- Correlation ρ = Cov/(σ_X σ_Y) ∈ [−1, 1] standardizes covariance; ρ = ±1 means perfect linear relationship
Pitfalls
- Thinking linearity of expectation requires independence. E[X + Y] = E[X] + E[Y] holds ALWAYS, regardless of dependence. This is arguably probability's most useful tool precisely because it's unconditional. If you find yourself checking independence before summing expectations, you're adding unnecessary work.
- Assuming E[g(X)] = g(E[X]). LOTUS says E[g(X)] = Σ g(x) p(x), which is NOT the same as plugging E[X] into g. For example, E[X²] ≠ (E[X])² — the gap is exactly Var(X). Only linear functions satisfy E[g(X)] = g(E[X]).
- Forgetting to square the constant in variance scaling. Var(aX + b) = a²Var(X), not a·Var(X). The constant b shifts the distribution but doesn't affect spread. A common error is writing Var(2X) = 2Var(X) instead of 4Var(X).
- Concluding independence from zero covariance. Cov(X, Y) = 0 means no LINEAR relationship, but X and Y can be perfectly functionally dependent yet uncorrelated (e.g., X ~ N(0,1), Y = X²). Only for jointly normal variables does zero correlation force independence.
- Computing Var(X + Y) without the covariance term. Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y). Only when Cov = 0 (e.g., independent) does it reduce to Var(X) + Var(Y). Forgetting the cross-term leads to incorrect variances for correlated data.
Quiz
-
Linearity of expectation E[X + Y] = E[X] + E[Y] holds: a) Only if X and Y are independent b) Only if X and Y have the same distribution c) Always (provided expectations exist) d) Only for continuous random variables Answer: c. Linearity is unconditional — it follows from the definition of expectation and the distributive property of sums.
-
If X ~ Binomial(100, 0.2), E[X] is: a) 20 b) 80 c) 5 d) 50 Answer: a. E[X] = np = 100·0.2 = 20.
-
The Law of the Unconscious Statistician (LOTUS) states: a) E[g(X)] = g(E[X]) b) E[g(X)] = Σ g(x) p_X(x) c) E[g(X)] = ∫ g(x) dx d) E[g(X)] = E[X] · E[g(X)] Answer: b. You compute the expectation of g(X) by summing g(x) weighted by the PMF of X.
-
Var(X) = 0 implies: a) X has a symmetric distribution b) X is constant with probability 1 c) E[X] = 0 d) X is discrete Answer: b. Zero variance means no variability; the random variable is degenerate (constant almost surely).
-
If Cov(X, Y) = 0, which must be true? a) X and Y are independent b) E[XY] = E[X]E[Y] c) Var(X + Y) > Var(X) + Var(Y) d) ρ = 1 Answer: b. Cov(X,Y) = E[XY] − E[X]E[Y] = 0 ⇔ E[XY] = E[X]E[Y]. Independence is sufficient but not necessary for zero covariance.
-
The variance of a sum Var(X + Y) equals: a) Var(X) + Var(Y) b) Var(X) + Var(Y) + Cov(X, Y) c) Var(X) + Var(Y) + 2Cov(X, Y) d) Var(X)Var(Y) + 2Cov(X, Y) Answer: c. Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y). When independent, the covariance term drops.
-
For X ~ Geometric(0.25), E[X] = ? a) 2 b) 4 c) 0.25 d) 8 Answer: b. E[X] = 1/p = 1/0.25 = 4.
-
Which is always true about the correlation coefficient ρ? a) ρ > 0 b) −1 ≤ ρ ≤ 1 c) If ρ = 0, X and Y are independent d) ρ = Cov(X,Y) · Var(X) · Var(Y) Answer: b. By Cauchy-Schwarz, |Cov(X,Y)| ≤ σ_X σ_Y, so −1 ≤ ρ ≤ 1.
Next Steps
Continue to 10-07 Continuous Random Variables to learn about PDFs, CDFs, the uniform distribution, and the exponential distribution in the continuous setting.