📐 Concept diagram

Phase 10: Probability Theory

Subject 10-06: Expectation of Discrete Random Variables

Prerequisites: 10-04 (Discrete Random Variables), basic summation

Learning Objectives

Define expected value E[X] for discrete random variables and compute it from the PMF
Apply linearity of expectation E[aX + bY] = aE[X] + bE[Y] without independence requirements
Use the Law of the Unconscious Statistician (LOTUS) to compute E[g(X)] directly from the PMF
Define variance Var(X) = E[(X − μ)²], standard deviation, and their properties under linear transformations
Define covariance Cov(X, Y) and correlation, and explain how they measure linear dependence

Core Content

1. Definition of Expected Value

For a discrete random variable X with PMF p_X(x), the expected value (mean) is:

$E[X] = Σ_x x · p_X(x)
$

where the sum is over all values x in the support of X. For the sum to be well-defined, we require absolute convergence Σ |x| p_X(x) < ∞.

Interpretation: E[X] is the probability-weighted average of all possible values — the "long-run average" if the experiment were repeated infinitely many times.

Expected values of common discrete distributions:

Distribution	E[X]
Bernoulli(p)	p
Binomial(n, p)	np
Geometric(p)	1/p
NegBin(r, p)	r/p
Poisson(λ)	λ
Hypergeometric(N, K, n)	n·(K/N)

Derivation for Binomial: Let X ~ Binomial(n, p). Using the definition:

E[X] = Σ_{k=0}^{n} k · C(n,k) pᵏ (1−p)^{n−k}

Using identity k·C(n,k) = n·C(n−1, k−1):

$E[X] = Σ_{k=1}^{n} n·C(n−1, k−1) pᵏ (1−p)^{n−k}
     = np Σ_{k=1}^{n} C(n−1, k−1) p^{k−1} (1−p)^{(n−1)−(k−1)}
     = np Σ_{j=0}^{n−1} C(n−1, j) pʲ (1−p)^{(n−1)−j} = np · 1 = np
$

Derivation for Geometric:

$E[X] = Σ_{k=1}^{∞} k (1−p)^{k−1} p = p · (1/p²) = 1/p
$

using the identity Σ_{k=1}^{∞} k q^{k−1} = 1/(1−q)² for |q| < 1 with q = 1−p.

2. Linearity of Expectation

Theorem: For any random variables X, Y and constants a, b:

$E[aX + bY] = aE[X] + bE[Y]
$

Crucially, this holds even when X and Y are DEPENDENT. This is one of the most powerful tools in probability.

Proof sketch:

$E[aX + bY] = Σ_x Σ_y (a x + b y) P(X=x, Y=y)
           = a Σ_x x Σ_y P(X=x, Y=y) + b Σ_y y Σ_x P(X=x, Y=y)
           = a Σ_x x P(X=x) + b Σ_y y P(Y=y)
           = a E[X] + b E[Y]
$

Corollary: E[Σ Xᵢ] = Σ E[Xᵢ] for any finite collection. This extends to countably infinite collections under absolute convergence.

Example — Indicator variables: Let I_A be the indicator of event A: I_A = 1 if A occurs, 0 otherwise. Then E[I_A] = 1·P(A) + 0·P(Aᶜ) = P(A). This simple fact combined with linearity is extremely powerful.

Application to Binomial mean (alternative derivation): X ~ Binomial(n, p) is the sum of n independent Bernoulli(p) random variables. By linearity: E[X] = Σ_{i=1}^{n} E[I_i] = Σ p = np. Much simpler than the direct sum!

Application to Hypergeometric mean: A sample of n items drawn without replacement from N items with K successes. Even though draws are dependent, by indicator variables and symmetry: each of the n draws has probability K/N of being a success. So E[X] = n·K/N.

3. Law of the Unconscious Statistician (LOTUS)

For any function g(·):

$E[g(X)] = Σ_x g(x) · p_X(x)
$

"Unconscious" because you don't need to find the distribution of Y = g(X) — just plug X's PMF into g.

Example: E[X²] = Σ_x x² p_X(x).

Warning: E[g(X)] ≠ g(E[X]) in general (Jensen's inequality). For convex g, E[g(X)] ≥ g(E[X]).

Edge case: If X takes infinitely many values, the sum must converge absolutely for E[g(X)] to exist.

4. Variance and Standard Deviation

Definition: Let μ = E[X]. The variance of X is:

$Var(X) = E[(X − μ)²] = Σ_x (x − μ)² p_X(x)
$

Alternative computational formula:

$Var(X) = E[X²] − (E[X])²
$

Proof: Var(X) = E[(X−μ)²] = E[X² − 2μX + μ²] = E[X²] − 2μE[X] + μ² = E[X²] − 2μ² + μ² = E[X²] − μ².

Standard deviation: σ_X = √Var(X) — same units as X.

Variances of common distributions:

Distribution	Var(X)
Bernoulli(p)	p(1−p)
Binomial(n, p)	np(1−p)
Geometric(p)	(1−p)/p²
Poisson(λ)	λ
NegBin(r, p)	r(1−p)/p²

Properties of variance: - Var(X) ≥ 0 (variance is non-negative) - Var(aX + b) = a² Var(X) — adding a constant doesn't change variance - Var(X) = 0 if and only if P(X = c) = 1 for some constant c (degenerate distribution) - For any constants a, b: Var(aX + bY) = a²Var(X) + b²Var(Y) + 2ab Cov(X, Y)

5. Covariance and Correlation

Covariance:

$Cov(X, Y) = E[(X − E[X])(Y − E[Y])] = E[XY] − E[X]E[Y]
$

Properties: - Cov(X, X) = Var(X) - Cov(X, Y) = Cov(Y, X) (symmetric) - Cov(aX + b, cY + d) = ac·Cov(X, Y) - Cov(X, Y + Z) = Cov(X, Y) + Cov(X, Z) (bilinear)

Independence ⇒ zero covariance (but not conversely!): If X and Y are independent, then E[XY] = E[X]E[Y], so Cov(X, Y) = 0. Zero covariance does NOT imply independence — it only means no linear relationship.

Correlation coefficient:

$ρ(X, Y) = Cov(X, Y) / (σ_X σ_Y)
$

Always satisfies −1 ≤ ρ ≤ 1 (by Cauchy-Schwarz). ρ = ±1 iff Y = aX + b almost surely (perfect linear relationship).

Variance of sum:

$Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)
$

For independent X, Y: Var(X + Y) = Var(X) + Var(Y).

General sum:

$Var(Σ Xᵢ) = Σ Var(Xᵢ) + 2 Σ_{i<j} Cov(Xᵢ, Xⱼ)
$

Key Terms

10 06 Expectation Of Discrete Rvs
10-07 Continuous Random Variables
Answer: a.
Answer: b.
Answer: c.
Distribution
Independence ⇒ zero covariance
Subject 10-06: Expectation of Discrete Random Variables
expected value

Worked Examples

Example 1: Expected Value of Geometric Distribution

Let X ~ Geometric(p). Compute E[X] directly from the definition.

Solution:

$E[X] = Σ_{k=1}^{∞} k (1−p)^{k−1} p = p Σ_{k=1}^{∞} k q^{k−1}    where q = 1−p
$

Recall: Σ_{k=1}^{∞} k q^{k−1} = d/dq (Σ_{k=0}^{∞} qᵏ) = d/dq (1/(1−q)) = 1/(1−q)² = 1/p².

Therefore E[X] = p · (1/p²) = 1/p.

For p=0.5 (fair coin, waiting for first heads), E[X] = 2 flips on average. ✓

Example 2: Variance via LOTUS

Let X be the outcome of a fair die roll. Find Var(X).

Solution:

PMF: p(k) = 1/6 for k = 1, ..., 6.

E[X] = (1+2+3+4+5+6)/6 = 21/6 = 3.5.

E[X²] = (1+4+9+16+25+36)/6 = 91/6 ≈ 15.1667.

Var(X) = E[X²] − (E[X])² = 91/6 − (7/2)² = 91/6 − 49/4 = (182 − 147)/12 = 35/12 ≈ 2.917.

σ = √(35/12) ≈ 1.708.

Example 3: Covariance and Correlation

Joint PMF of (X, Y):

X\Y	0	1
0	0.2	0.1
1	0.3	0.4

Find Cov(X, Y) and ρ(X, Y).

Solution:

Marginals: P(X=0) = 0.3, P(X=1) = 0.7; P(Y=0) = 0.5, P(Y=1) = 0.5.

E[X] = 0·0.3 + 1·0.7 = 0.7 E[Y] = 0·0.5 + 1·0.5 = 0.5 E[XY] = 0·0·0.2 + 0·1·0.1 + 1·0·0.3 + 1·1·0.4 = 0.4

Cov(X, Y) = E[XY] − E[X]E[Y] = 0.4 − (0.7)(0.5) = 0.4 − 0.35 = 0.05

Var(X) = E[X²] − (E[X])². E[X²] = 0²·0.3 + 1²·0.7 = 0.7. Var(X) = 0.7 − 0.49 = 0.21. Var(Y) = E[Y²] − (E[Y])² = 0.5 − 0.25 = 0.25.

ρ = 0.05 / √(0.21 · 0.25) = 0.05 / √0.0525 = 0.05 / 0.229 ≈ 0.218

Moderate positive linear relationship.

Quiz

Q1: Linearity of expectation E[aX + bY] = aE[X] + bE[Y] holds:

A) Only when X and Y are independent B) Only when X and Y are identically distributed C) For any random variables X and Y, regardless of dependence D) Only for discrete random variables

Correct: C)

If you chose C: Correct! Linearity of expectation is one of the most powerful tools in probability precisely because it requires NO assumptions about dependence or distribution.
If you chose A: This is a common misconception. Independence is NOT required for linearity of expectation.
If you chose B: The distributions can be completely different — the identity still holds.
If you chose D: Linearity holds for both discrete and continuous random variables.

Q2: E[X] for a Bernoulli(p) random variable equals:

A) p(1−p) B) p C) 1/p D) √(p(1−p))

Correct: B)

If you chose B: Correct! X takes value 1 with probability p and 0 with probability 1−p. E[X] = 1·p + 0·(1−p) = p.
If you chose A: This is the variance Var(X) = p(1−p), not the mean.
If you chose C: This is E[X] for a Geometric(p) distribution, not Bernoulli.
If you chose D: This is the standard deviation √Var(X), not the mean.

Q3: The Law of the Unconscious Statistician (LOTUS) states that E[g(X)] equals:

A) g(E[X]) B) Σ g(x) p_X(x) for discrete X C) g(Σ x p_X(x)) D) E[X] · E[g(X)]

Correct: B)

If you chose B: Correct! LOTUS says E[g(X)] = Σ g(x) p_X(x) — you apply g to each value and weight by its probability. No need to find the distribution of g(X) first.
If you chose A: This would be true ONLY if g is linear (g(x) = ax + b). In general, E[g(X)] ≠ g(E[X]), as Jensen's inequality shows.
If you chose C: This is a single number, not a sum of weighted values.
If you chose D: This is circular and incorrect.

Q4: If X ~ Binomial(n, p), then E[X] equals:

A) np B) n/p C) p/n D) np²

Correct: A)

If you chose A: Correct! E[X] = np. This can be derived using linearity: X = I₁ + I₂ + ... + Iₙ where each Iᵢ ~ Bernoulli(p), so E[X] = n·p.
If you chose B: This would be the mean of a NegativeBinomial(r,p) with r = n: E[X] = r/p.
If you chose C: This doesn't have the right units and is much too small for large n.
If you chose D: This is the second moment contribution, not the mean.

Q5: Var(aX + b) for constants a and b equals:

A) a·Var(X) + b B) a²·Var(X) C) a·Var(X) D) Var(X) + b²

Correct: B)

If you chose B: Correct! Var(aX + b) = a²·Var(X). Adding a constant b shifts the distribution but doesn't change spread. The factor a gets squared.
If you chose A: Constants added don't affect variance, and the factor isn't squared.
If you chose C: The scaling factor must be squared — variance has squared units relative to X.
If you chose D: Adding b doesn't affect variance at all; only the multiplicative constant matters.

Q6: For which discrete distribution does E[X] = 1/p?

A) Binomial(n, p) B) Poisson(p) C) Geometric(p) D) Bernoulli(p)

Correct: C)

If you chose C: Correct! For Geometric(p), the expected number of trials until the first success is 1/p.
If you chose A: E[Binomial(n,p)] = np, not 1/p.
If you chose B: E[Poisson(λ)] = λ, not 1/λ.
If you chose D: E[Bernoulli(p)] = p, not 1/p.

Q7: Covariance Cov(X, Y) measures:

A) The probability that X and Y are independent B) The strength of linear dependence between X and Y C) Whether X is always larger than Y D) The ratio of E[X] to E[Y]

Correct: B)

If you chose B: Correct! Cov(X,Y) = E[(X−μₓ)(Y−μᵧ)] measures how X and Y vary together linearly. Positive covariance means they tend to move in the same direction.
If you chose A: Independence implies zero covariance, but the converse is false — zero covariance ≠ independence (except for multivariate normal).
If you chose C: Covariance doesn't say anything about the relative magnitudes of X and Y.
If you chose D: This is a ratio of means, unrelated to covariance.

Practice Problems

Derive E[X] for X ~ Bernoulli(p) directly from the definition. Then derive Var(X).
Let X ~ NegBin(2, 0.3). Find E[X] directly from the formula r/p.
A random variable X has PMF: p(1) = 0.2, p(2) = 0.3, p(3) = 0.5. Find E[X], E[X²], Var(X), and E[1/X].
Using indicator variables, find the expected number of heads when 10 fair coins are flipped. Then find the variance.
Prove that Var(aX + b) = a² Var(X) for constants a, b.
If Var(X) = 3, Var(Y) = 5, and Cov(X, Y) = 2, find Var(2X − 3Y).
Show that if P(X = c) = 1, then E[X] = c and Var(X) = 0. Is the converse true?

Answers

1. E[X] = 0·(1−p) + 1·p = p. E[X²] = 0²·(1−p) + 1²·p = p. Var(X) = E[X²] − (E[X])² = p − p² = p(1−p). 2. E[X] = r/p = 2/0.3 ≈ 6.667. On average, it takes about 6.67 trials to get 2 successes. 3. E[X] = 1(0.2) + 2(0.3) + 3(0.5) = 0.2 + 0.6 + 1.5 = 2.3. E[X²] = 1(0.2) + 4(0.3) + 9(0.5) = 0.2 + 1.2 + 4.5 = 5.9. Var(X) = 5.9 − (2.3)² = 5.9 − 5.29 = 0.61. E[1/X] = 1(0.2) + (1/2)(0.3) + (1/3)(0.5) = 0.2 + 0.15 + 0.1667 = 0.5167. 4. Let Iᵢ = indicator of heads on flip i. E[Iᵢ] = 0.5. By linearity, E[X] = 10·0.5 = 5. Since flips are independent, Var(X) = Σ Var(Iᵢ) = 10·(0.5·0.5) = 10·0.25 = 2.5. 5. Var(aX+b) = E[(aX+b − E[aX+b])²] = E[(aX+b − aE[X]−b)²] = E[a²(X−E[X])²] = a² E[(X−E[X])²] = a² Var(X). 6. Var(2X−3Y) = 4Var(X) + 9Var(Y) − 12Cov(X,Y) = 4·3 + 9·5 − 12·2 = 12 + 45 − 24 = 33. 7. If P(X=c)=1, then E[X] = c·1 = c. E[X²] = c², so Var(X) = c² − c² = 0. Yes, the converse is true: if Var(X) = 0, then E[(X−μ)²] = 0. Since (X−μ)² ≥ 0 almost surely, the only way its expectation is zero is if P(X=μ) = 1.

Summary

E[X] = Σ x p(x) is the probability-weighted average; linearity of expectation E[aX+bY] = aE[X]+bE[Y] holds WITHOUT independence — it's one of probability's most powerful tools
LOTUS lets you compute E[g(X)] = Σ g(x) p(x) without finding the distribution of g(X)
Variance Var(X) = E[(X−μ)²] = E[X²] − (E[X])² measures spread; Var(aX+b) = a²Var(X)
Covariance Cov(X,Y) = E[XY]−E[X]E[Y] measures linear dependence; independence ⇒ Cov=0, but Cov=0 ⇏ independence
Correlation ρ = Cov/(σ_X σ_Y) ∈ [−1, 1] standardizes covariance; ρ = ±1 means perfect linear relationship

Pitfalls

Thinking linearity of expectation requires independence. E[X + Y] = E[X] + E[Y] holds ALWAYS, regardless of dependence. This is arguably probability's most useful tool precisely because it's unconditional. If you find yourself checking independence before summing expectations, you're adding unnecessary work.
Assuming E[g(X)] = g(E[X]). LOTUS says E[g(X)] = Σ g(x) p(x), which is NOT the same as plugging E[X] into g. For example, E[X²] ≠ (E[X])² — the gap is exactly Var(X). Only linear functions satisfy E[g(X)] = g(E[X]).
Forgetting to square the constant in variance scaling. Var(aX + b) = a²Var(X), not a·Var(X). The constant b shifts the distribution but doesn't affect spread. A common error is writing Var(2X) = 2Var(X) instead of 4Var(X).
Concluding independence from zero covariance. Cov(X, Y) = 0 means no LINEAR relationship, but X and Y can be perfectly functionally dependent yet uncorrelated (e.g., X ~ N(0,1), Y = X²). Only for jointly normal variables does zero correlation force independence.
Computing Var(X + Y) without the covariance term. Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y). Only when Cov = 0 (e.g., independent) does it reduce to Var(X) + Var(Y). Forgetting the cross-term leads to incorrect variances for correlated data.

Quiz

Linearity of expectation E[X + Y] = E[X] + E[Y] holds: a) Only if X and Y are independent b) Only if X and Y have the same distribution c) Always (provided expectations exist) d) Only for continuous random variables Answer: c. Linearity is unconditional — it follows from the definition of expectation and the distributive property of sums.
If X ~ Binomial(100, 0.2), E[X] is: a) 20 b) 80 c) 5 d) 50 Answer: a. E[X] = np = 100·0.2 = 20.
The Law of the Unconscious Statistician (LOTUS) states: a) E[g(X)] = g(E[X]) b) E[g(X)] = Σ g(x) p_X(x) c) E[g(X)] = ∫ g(x) dx d) E[g(X)] = E[X] · E[g(X)] Answer: b. You compute the expectation of g(X) by summing g(x) weighted by the PMF of X.
Var(X) = 0 implies: a) X has a symmetric distribution b) X is constant with probability 1 c) E[X] = 0 d) X is discrete Answer: b. Zero variance means no variability; the random variable is degenerate (constant almost surely).
If Cov(X, Y) = 0, which must be true? a) X and Y are independent b) E[XY] = E[X]E[Y] c) Var(X + Y) > Var(X) + Var(Y) d) ρ = 1 Answer: b. Cov(X,Y) = E[XY] − E[X]E[Y] = 0 ⇔ E[XY] = E[X]E[Y]. Independence is sufficient but not necessary for zero covariance.
The variance of a sum Var(X + Y) equals: a) Var(X) + Var(Y) b) Var(X) + Var(Y) + Cov(X, Y) c) Var(X) + Var(Y) + 2Cov(X, Y) d) Var(X)Var(Y) + 2Cov(X, Y) Answer: c. Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y). When independent, the covariance term drops.
For X ~ Geometric(0.25), E[X] = ? a) 2 b) 4 c) 0.25 d) 8 Answer: b. E[X] = 1/p = 1/0.25 = 4.
Which is always true about the correlation coefficient ρ? a) ρ > 0 b) −1 ≤ ρ ≤ 1 c) If ρ = 0, X and Y are independent d) ρ = Cov(X,Y) · Var(X) · Var(Y) Answer: b. By Cauchy-Schwarz, |Cov(X,Y)| ≤ σ_X σ_Y, so −1 ≤ ρ ≤ 1.

Next Steps

Continue to 10-07 Continuous Random Variables to learn about PDFs, CDFs, the uniform distribution, and the exponential distribution in the continuous setting.

Progress

Phases

Phase 10: Probability Theory

Subject 10-06: Expectation of Discrete Random Variables

Learning Objectives

Core Content

1. Definition of Expected Value

2. Linearity of Expectation

3. Law of the Unconscious Statistician (LOTUS)

4. Variance and Standard Deviation

5. Covariance and Correlation

Key Terms

Worked Examples

Quiz

Practice Problems

Summary

Pitfalls

Quiz

Next Steps