Phase 11: Probability Theory II
Subject 11-02: Covariance and Correlation
Prerequisites: 10-06 (Expectation), 10-10 (Joint Distributions), 10-08 (Normal Distribution)
Learning Objectives
- Define covariance and correlation for two random variables and compute them from joint distributions
- Prove the Cauchy-Schwarz inequality for expectations and derive −1 ≤ ρ ≤ 1
- Construct and interpret the covariance matrix Σ for random vectors
- Explain the geometric interpretation of correlation as the cosine of an angle in L² space
- Distinguish correlation from causation and recognize when ρ = 0 fails to imply independence (with counterexamples)
Core Content
1. Covariance: Definition and Properties
For random variables X and Y with finite second moments:
$Cov(X, Y) = E[(X − E[X])(Y − E[Y])] = E[XY] − E[X]E[Y] $
Properties (follow directly from linearity of expectation): 1. Symmetry: Cov(X, Y) = Cov(Y, X) 2. Bilinearity: Cov(aX + bY, Z) = a Cov(X, Z) + b Cov(Y, Z) 3. Variance as special case: Cov(X, X) = Var(X) 4. Constants: Cov(X, c) = 0 for any constant c 5. Scaling: Cov(aX + b, cY + d) = ac · Cov(X, Y)
⚠️ CRITICAL: Independence ⇒ Cov(X, Y) = 0, but Cov(X, Y) = 0 ⇏ Independence. Covariance only measures LINEAR dependence. Variables can be perfectly functionally related yet have zero covariance.
Covariance and variance of sums:
$Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)
Var(Σ Xᵢ) = Σ Var(Xᵢ) + 2 Σ_{i<j} Cov(Xᵢ, Xⱼ)
$
2. Correlation Coefficient
The Pearson correlation coefficient standardizes covariance to [−1, 1]:
$ρ(X, Y) = Cov(X, Y) / (σ_X σ_Y) $
Properties: - −1 ≤ ρ ≤ 1 (by Cauchy-Schwarz) - ρ = +1 ⇔ Y = aX + b with a > 0 (perfect positive linear relationship, probability 1) - ρ = −1 ⇔ Y = aX + b with a < 0 (perfect negative linear relationship) - ρ = 0 ⇔ no linear relationship (but possibly nonlinear dependence!) - ρ is unitless and invariant to linear transformations: ρ(aX+b, cY+d) = sign(ac) · ρ(X, Y)
Proof of −1 ≤ ρ ≤ 1 (Cauchy-Schwarz): Consider Var(X/σ_X ± Y/σ_Y) ≥ 0:
$Var(X/σ_X ± Y/σ_Y) = Var(X)/σ_X² + Var(Y)/σ_Y² ± 2Cov(X,Y)/(σ_Xσ_Y)
= 1 + 1 ± 2ρ = 2(1 ± ρ) ≥ 0
$
So 1 + ρ ≥ 0 and 1 − ρ ≥ 0, hence −1 ≤ ρ ≤ 1.
Equality ρ = ±1 occurs iff Var(X/σ_X ∓ Y/σ_Y) = 0, meaning X/σ_X ∓ Y/σ_Y is constant almost surely — i.e., Y is an exact linear function of X.
3. The Covariance Matrix
For a random vector X = (X₁, X₂, ..., Xₚ)ᵀ, the covariance matrix Σ is:
$Σ = E[(X − μ)(X − μ)ᵀ] $
where μ = E[X] is the mean vector.
Entry-wise: - Σ_{ii} = Var(Xᵢ) (diagonal entries are variances) - Σ_{ij} = Cov(Xᵢ, Xⱼ) for i ≠ j (off-diagonal entries are covariances)
Properties of Σ: 1. Symmetric: Σ = Σᵀ (since Cov(Xᵢ, Xⱼ) = Cov(Xⱼ, Xᵢ)) 2. Positive semi-definite: For any vector a, aᵀΣa = Var(aᵀX) ≥ 0 3. Positive definite iff no non-trivial linear combination of the components is constant
For linear transformations: If Y = AX + b where A is a matrix and b is a vector:
$Cov(Y) = A Σ_X Aᵀ $
Example — Bivariate case:
$Σ = [σ_X² ρσ_Xσ_Y ]
[ρσ_Xσ_Y σ_Y² ]
$
The correlation ρ determines the off-diagonal entries.
The correlation matrix R standardizes Σ: R_{ij} = Σ_{ij} / √(Σ_{ii} Σ_{jj}). All diagonal entries are 1, off-diagonals are the pairwise correlations.
4. Geometric Interpretation: L² Space
Consider the vector space of random variables with finite second moments, equipped with the inner product:
$⟨X, Y⟩ = E[XY] $
Then: - Norm: ‖X‖ = √E[X²] - Centered variables: Let X̃ = X − E[X]. Then Cov(X, Y) = ⟨X̃, Ỹ⟩. - Variance as squared norm: Var(X) = ‖X̃‖² - Correlation as cosine: ρ(X, Y) = ⟨X̃, Ỹ⟩ / (‖X̃‖ ‖Ỹ‖) = cos(θ) where θ is the angle between X̃ and Ỹ in L² space
This geometric view yields powerful insights:
- ρ = 0 means X̃ ⟂ Ỹ (orthogonal — uncorrelated)
- ρ = ±1 means X̃ and Ỹ are collinear (angle 0° or 180°)
- Projection formula: the best linear predictor of Y given X is the orthogonal projection of Ỹ onto X̃:
E[Y|X] ≈ μ_Y + ρ(σ_Y/σ_X)(X − μ_X) (exact for bivariate normal)
5. Correlation ≠ Causation, and Zero Correlation ≠ Independence
Counterexample 1: Zero correlation with perfect dependence
Let X ~ N(0, 1) and Y = X². Then: - E[XY] = E[X³] = 0 (all odd moments of standard normal are zero) - E[X] = 0, so Cov(X, Y) = 0 and ρ = 0 - But Y is perfectly determined by X! (Knowing X gives Y exactly.)
Counterexample 2: Strong correlation without causation
Ice cream sales and drowning deaths are positively correlated (~0.7 in summer months). The common cause is hot weather — not a causal link. This is a classic confounding variable scenario.
Counterexample 3: Correlation is not transitive
If ρ(X, Y) = 0.9 and ρ(Y, Z) = 0.9, ρ(X, Z) can be anywhere from 0.62 to 1.0, depending on the partial correlation structure.
⚠️ Common Pitfall: "Correlation implies causation" is the most prevalent statistical fallacy. Correlation establishes association, not direction. Controlled experiments, natural experiments, or causal inference methods (instrumental variables, difference-in-differences, regression discontinuity) are needed to establish causation.
Key Terms
- Pearson correlation coefficient
- Positive definite
- Properties
- The correlation matrix
Worked Examples
Example 1: Computing Covariance from Joint PMF
| X\Y | 0 | 1 | 2 |
|---|---|---|---|
| 0 | 0.10 | 0.05 | 0.05 |
| 1 | 0.15 | 0.25 | 0.10 |
| 2 | 0.05 | 0.15 | 0.10 |
Compute Cov(X, Y) and ρ(X, Y).
Solution:
Marginals: p_X(0) = 0.20, p_X(1) = 0.50, p_X(2) = 0.30. p_Y(0) = 0.30, p_Y(1) = 0.45, p_Y(2) = 0.25.
E[X] = 0·0.20 + 1·0.50 + 2·0.30 = 1.10 E[Y] = 0·0.30 + 1·0.45 + 2·0.25 = 0.95
E[XY] = Σ x·y·p(x,y) = 0·0·0.10 + 0·1·0.05 + ... + 2·1·0.15 + 2·2·0.10 = 0 + 0 + 0 + 0 + 1·0.25 + 2·0.10 + 0 + 2·0.15 + 4·0.10 = 0.25 + 0.20 + 0.30 + 0.40 = 1.15
Cov(X, Y) = 1.15 − (1.10)(0.95) = 1.15 − 1.045 = 0.105
E[X²] = 0·0.20 + 1·0.50 + 4·0.30 = 1.70. Var(X) = 1.70 − 1.21 = 0.49. E[Y²] = 0·0.30 + 1·0.45 + 4·0.25 = 1.45. Var(Y) = 1.45 − 0.9025 = 0.5475.
ρ = 0.105 / √(0.49 · 0.5475) = 0.105 / √0.2683 = 0.105 / 0.518 = 0.203. Weak positive correlation.
Example 2: Covariance Matrix for Linear Combinations
Let X = (X₁, X₂)ᵀ have μ = (2, 3)ᵀ and Σ = [[4, 1], [1, 9]]. Define Y₁ = X₁ + X₂ and Y₂ = 2X₁ − X₂. Find Cov(Y₁, Y₂).
Solution:
Y = AX where A = [[1, 1], [2, −1]].
Cov(Y) = A Σ Aᵀ = [[1,1],[2,−1]] [[4,1],[1,9]] [[1,2],[1,−1]]
= [[1,1],[2,−1]] [[4(1)+1(1), 4(2)+1(−1)], [1(1)+9(1), 1(2)+9(−1)]] = [[1,1],[2,−1]] [[5, 7], [10, −7]] = [[1(5)+1(10), 1(7)+1(−7)], [2(5)+(−1)(10), 2(7)+(−1)(−7)]] = [[15, 0], [0, 21]]
So Var(Y₁) = 15, Var(Y₂) = 21, Cov(Y₁, Y₂) = 0, ρ(Y₁, Y₂) = 0. The linear combinations are uncorrelated!
Example 3: Counterexample — Zero Correlation, Perfect Dependence
Let X ~ Uniform(−1, 1) and Y = |X|. Show Cov(X, Y) = 0 but X and Y are dependent.
Solution:
E[X] = 0 (symmetric about 0). E[Y] = E[|X|] = ∫_{-1}^{1} |x|·(1/2) dx = 2∫₀¹ x·(1/2) dx = 1/2.
E[XY] = E[X|X|]. Since the integrand x|x| is odd on [−1, 1], E[X|X|] = 0.
Cov(X, Y) = 0 − 0·(1/2) = 0.
But X and Y are clearly dependent: P(Y < 0.1 | X = 0.9) = P(|0.9| < 0.1) = 0, while P(Y < 0.1) = P(|X| < 0.1) = 0.1. The conditional probability differs from the marginal — dependent.
Quiz
Q1: Cov(X, Y) = 0 implies:
A) X and Y are independent B) X and Y are uncorrelated but may be dependent C) Var(X + Y) = Var(X) + Var(Y) regardless D) X and Y are identically distributed
Correct: B)
- If you chose B: Correct! Zero covariance means no LINEAR relationship, but X and Y could have a nonlinear dependence (e.g., Y = X² with X symmetric about 0).
- If you chose A: Independence → zero covariance, but the converse is false (except for multivariate normal).
- If you chose C: Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y). Zero covariance indeed makes this true, but the question asks what zero covariance IMPLIES, and B is the more fundamental answer.
- If you chose D: Correlation and identical distributions are unrelated concepts.
Q2: The correlation coefficient ρ(X, Y) always lies in which interval?
A) [0, 1] B) [−1, 1] C) [−∞, ∞] D) [0, ∞)
Correct: B)
- If you chose B: Correct! By Cauchy-Schwarz: |Cov(X,Y)| ≤ √Var(X)√Var(Y), so |ρ| = |Cov|/(σ_X σ_Y) ≤ 1.
- If you chose A: Correlation can be negative, indicating inverse linear relationship.
- If you chose C: Covariance is unbounded, but correlation is standardized.
- If you chose D: Correlation can be negative.
Q3: If Var(X) = 4, Var(Y) = 9, and Cov(X,Y) = 3, what is ρ(X,Y)?
A) 3/36 B) 3/6 = 0.5 C) 3 D) 1/3
Correct: B)
- If you chose B: Correct! ρ = Cov(X,Y)/(σ_X σ_Y) = 3/(2 · 3) = 3/6 = 0.5.
- If you chose A: This divides by the product of variances instead of standard deviations.
- If you chose C: This is the raw covariance, not standardized.
- If you chose D: This is Cov/(Var(X) + Var(Y)), incorrect formula.
Q5: The covariance matrix Σ of a random vector is always:
A) Diagonal B) Symmetric and positive semidefinite C) Orthogonal D) Invertible
Correct: B)
- If you chose B: Correct! Σ_{ij} = Cov(X_i, X_j), so Σ is symmetric. It's positive semidefinite because a^T Σ a = Var(a^T X) ≥ 0.
- If you chose A: Diagonal only when all components are uncorrelated.
- If you chose C: Orthogonal matrices have orthonormal columns; covariance matrices don't.
- If you chose D: The covariance matrix can be singular (e.g., if one variable is a linear combination of others).
Practice Problems
- If Var(X) = 4, Var(Y) = 9, and Cov(X, Y) = 3, find ρ(X, Y) and Var(2X − 3Y).
- Prove the bilinearity property: Cov(aX + bY, Z) = a Cov(X, Z) + b Cov(Y, Z).
- If X and Y are independent, prove Cov(X, Y) = 0. Show the converse is false with a counterexample.
- Let Σ = [[2, 0.5], [0.5, 1]]. Find the correlation matrix R. Are X₁ and X₂ positively or negatively correlated?
- For the bivariate normal with μ_X = 0, μ_Y = 0, σ_X = 1, σ_Y = 2, ρ = 0.5, write the 2×2 covariance matrix Σ.
- Show that for any random variables X and Y, |Cov(X, Y)| ≤ √(Var(X) Var(Y)).
- If ρ(X, Y) = 0.8 and you define U = (X − μ_X)/σ_X, V = (Y − μ_Y)/σ_Y, what are E[U], E[V], Var(U), Var(V), and Cov(U, V)?
Answers
1. ρ = 3/√(4·9) = 3/6 = 0.5. Var(2X−3Y) = 4(4) + 9(9) − 12(3) = 16 + 81 − 36 = 61. 2. Cov(aX+bY, Z) = E[(aX+bY − E[aX+bY])(Z − E[Z])] = E[(a(X−E[X]) + b(Y−E[Y]))(Z−E[Z])] = a E[(X−E[X])(Z−E[Z])] + b E[(Y−E[Y])(Z−E[Z])] = a Cov(X,Z) + b Cov(Y,Z). 3. If X⊥Y, then E[XY] = E[X]E[Y], so Cov(X,Y) = E[XY] − E[X]E[Y] = 0. Counterexample: X ~ N(0,1), Y = X². Then Cov = 0 but Y is perfectly determined by X. 4. σ_X = √2 ≈ 1.414, σ_Y = 1. ρ = 0.5/(√2·1) = 0.5/√2 ≈ 0.354. R = [[1, 0.354], [0.354, 1]]. Positively correlated. 5. Σ = [[σ_X², ρσ_Xσ_Y], [ρσ_Xσ_Y, σ_Y²]] = [[1, 1], [1, 4]]. (ρσ_Xσ_Y = 0.5·1·2 = 1). 6. This is Cauchy-Schwarz: (E[(X−μ_X)(Y−μ_Y)])² ≤ E[(X−μ_X)²] E[(Y−μ_Y)²] = Var(X)Var(Y). Taking square roots gives |Cov| ≤ σ_Xσ_Y. 7. U = (X−μ_X)/σ_X, V = (Y−μ_Y)/σ_Y are standardized: E[U] = E[V] = 0, Var(U) = Var(V) = 1, Cov(U,V) = Cov(X,Y)/(σ_Xσ_Y) = ρ = 0.8.Summary
- Cov(X, Y) = E[XY] − E[X]E[Y] measures linear co-movement; it is bilinear, symmetric, and Cov(X, X) = Var(X)
- Correlation ρ = Cov/(σ_Xσ_Y) ∈ [−1, 1] standardizes covariance; ρ = ±1 iff perfect linear relationship
- The covariance matrix Σ = E[(X−μ)(X−μ)ᵀ] is symmetric, positive semi-definite, and captures all pairwise covariances for a random vector
- In L² space, centered RVs form an inner product space where Cov = inner product and ρ = cos(θ) — giving geometric meaning to correlation
- Cov(X, Y) = 0 does NOT imply independence (only no linear dependence); correlation measures association, NOT causation
Pitfalls
- Confusing correlation with causation. ρ(X, Y) = 0.7 means X and Y move together linearly — it says NOTHING about whether X causes Y, Y causes X, or both are caused by a third variable Z. The classic example: ice cream sales and drowning deaths are positively correlated, but both are driven by hot weather. Causal inference requires controlled experiments or specialized methods (instrumental variables, difference-in-differences).
- Believing zero correlation implies independence. Cov(X, Y) = 0 guarantees E[XY] = E[X]E[Y], nothing more. Y = X² with X ~ N(0, 1) gives Cov = 0 but Y is perfectly determined by X. Correlation only measures LINEAR dependence. To prove independence, you must show the joint distribution factors, not just compute correlation.
- Confusing the correlation coefficient ρ with the slope of the regression line. The slope of Y on X is ρ(σ_Y/σ_X), not ρ alone. A correlation of 0.9 with σ_X = 100 and σ_Y = 1 gives slope = 0.9(1/100) = 0.009 — very shallow despite strong correlation. ρ measures strength of linear relationship; the slope measures magnitude of change.
- Interpreting ρ as the "percentage of variation explained." ρ² = R² is the proportion of variance in Y explained by linear regression on X. ρ = 0.5 means R² = 0.25 — only 25% of variance explained, not 50%. ρ = 0.7 still leaves 51% of variance unexplained. The square makes a big difference.
- Thinking covariance is transitive. If Cov(X, Y) > 0 and Cov(Y, Z) > 0, Cov(X, Z) could be positive, zero, or even negative. Correlation is similarly not transitive. Each pairwise relationship must be evaluated individually from the joint distribution.
Quiz
-
If Cov(X, Y) = 0, which of the following is guaranteed? a) X and Y are independent b) E[XY] = E[X]E[Y] c) X and Y are normally distributed d) ρ(X, Y) = −1 Answer: b. Cov(X,Y) = 0 ⇔ E[XY] = E[X]E[Y]. Independence is sufficient but not necessary.
-
ρ(X, Y) = 0.5 means: a) 50% of Y is determined by X b) A moderate positive linear association c) X causes 50% of Y's variation d) The slope of Y on X is 0.5 Answer: b. Correlation measures strength of linear association, not explained variance or causal effect. (R² = ρ² = 0.25 would be 25% variance explained.)
-
The covariance matrix Σ is always: a) Diagonal b) Invertible c) Symmetric and positive semi-definite d) The identity matrix Answer: c. By construction Σ = Σᵀ and aᵀΣa = Var(aᵀX) ≥ 0.
-
For a 2×2 covariance matrix, the off-diagonal entries are: a) Always zero b) Cov(X₁, X₂) = Cov(X₂, X₁) c) Var(X₁) d) Unrelated to correlation Answer: b. By symmetry of covariance.
-
If Y = 2X + 3, then ρ(X, Y) = ? a) 2 b) 1 c) 0.5 d) Cannot be determined Answer: b. Perfect positive linear relationship: ρ = +1 (slope positive).
-
The correlation coefficient is unitless because: a) It divides by the product of standard deviations b) It's always between −1 and 1 c) It's computed from standardized variables d) All of the above Answer: d. Dividing by σ_Xσ_Y removes units, and using standardized variables U = (X−μ_X)/σ_X gives the same ρ directly.
-
Which is a valid covariance matrix? a) [[1, 2], [2, 1]] b) [[4, 1], [1, 4]] c) [[1, −1], [−1, 0]] d) [[2, 0], [0, −1]] Answer: b. [[4,1],[1,4]] gives correlation ρ = 1/4 = 0.25 ∈ [−1,1] and is positive definite (eigenvalues 3, 5 > 0). (a) gives ρ=2>1, invalid. (c) has zero variance for X₂ and ρ undefined. (d) has negative variance.
-
Which would produce ρ ≈ 0 despite strong dependence? a) Y = X² where X ~ N(0, 1) b) Y = 2X c) Y = X and X ~ Uniform(0, 1) d) Y = −X Answer: a. Y = X² is perfectly dependent on X but symmetric, so linear correlation is zero.
Next Steps
Continue to 11-03 Conditional Expectation for advanced treatment of E[Y|X], the law of total expectation, and the law of total variance.