📐 Concept diagram

21-06 — Causality and Causal Inference

Phase: 21 — Probability & Statistics for ML (Advanced) Subject: 21-06 Prerequisites: 21-01 (Bayesian Inference), 10-02 (Conditional Probability), 12-05 (Hypothesis Testing), 11-01 (Markov Chains — for graphical models) Next subject: 22-01 — Autoencoders

Learning Objectives

By the end of this subject, you will be able to:

Distinguish correlation from causation using the do-operator — prove that P(Y | do(X=x)) ≠ P(Y | X=x) when confounders exist
Apply the backdoor criterion to identify causal effects from observational data — derive adjustment formulas
Construct and interpret Structural Causal Models (SCMs) as systems of structural equations with independent exogenous noise
Compute counterfactual quantities using the three-step abduction-action-prediction procedure
Explain why ML models that optimize P(Y|X) can fail catastrophically under distribution shift — connecting causality to robust ML

Core Content

1. Why Causality Matters for ML

Standard ML learns P(Y | X) — correlation. But correlation ≠ causation:

Ice cream sales correlate with drowning deaths. Does ice cream cause drowning? No — both are effects of a common cause (hot weather).
A model trained to predict hospital readmission from patient history may learn that "receiving palliative care" reduces readmission. Should we give everyone palliative care? No — palliative care is given to the sickest patients, and the correlation is due to confounding.

Correlation predicts the past. Causation predicts the future under INTERVENTIONS. When we deploy an ML model and act on its predictions, we are intervening — and correlation-based models can fail catastrophically.

⚠️ THIS IS CRITICAL — Causal reasoning is the key gap between current ML systems and human-level intelligence. An ML model that can only learn correlations will break whenever the data-generating process changes, which is inevitable in deployment. Causal models are robust to intervention and distribution shift.

2. The Do-Operator and Interventions

Pearl's do-operator formalizes the difference between observing and intervening:

P(Y | X=x): The probability of Y given that we OBSERVE X=x. This is the conditional distribution learned from observational data.
P(Y | do(X=x)): The probability of Y given that we INTERVENE to set X=x. This is the causal effect of X on Y.

The fundamental distinction:

$P(Y | X=x) ≠ P(Y | do(X=x))   in the presence of confounders
$

Example: Let X = "taking a drug," Y = "recovery," Z = "severity of illness" (confounder — affects both X and Y).

P(recovery | drug) = high     [sicker patients take the drug, but also recover at lower rates]
P(recovery | do(drug)) = ?    [the true causal effect, isolated from confounding]

To compute P(Y | do(X=x)), we must adjust for confounding — using the backdoor criterion.

3. Causal Graphical Models (DAGs)

A Directed Acyclic Graph (DAG) represents causal relationships: - Nodes = variables - Edges X → Y = "X directly causes Y"

Key structures:

Chain (mediator): X → Z → Y X and Y are dependent, but conditioning on Z BLOCKS the path (X ⟂ Y | Z).

Fork (confounder): X ← Z → Y X and Y are dependent (spurious correlation). Conditioning on Z blocks the path (makes X ⟂ Y).

Collider: X → Z ← Y X and Y are independent, but conditioning on Z OPENS the path (X ⟂ Y | Z). This is "explaining away" — conditioning on a common effect creates dependence between independent causes.

The Backdoor Criterion: A set of variables S satisfies the backdoor criterion for (X, Y) if: 1. S blocks all "backdoor" paths from X to Y (paths starting with an arrow INTO X) 2. No node in S is a descendant of X

Then:

$P(Y | do(X=x)) = Σ_s P(Y | X=x, S=s) · P(S=s)
$

This is the adjustment formula — we can estimate causal effects from observational data by conditioning on (stratifying by) the right confounders.

4. Structural Causal Models (SCMs)

An SCM is a set of structural equations:

X_j = f_j(PA_j, U_j)    for j = 1, ..., d

where PA_j = causal parents of X_j, and U_j = exogenous (unexplained) noise variables, assumed jointly independent.

Example: Drug → Recovery, with confounding by Severity:

$S = U_S                                           [severity — exogenous]
X = f_X(S, U_X) = I(S + U_X > 2)                 [drug prescription depends on severity]
Y = f_Y(S, X, U_Y) = αX − βS + U_Y               [recovery — caused by drug and severity]
$

With SCMs, computing P(Y | do(X=1)) means replacing the structural equation for X:

$X = 1   [intervention — override the natural mechanism]
$

Then:

$P(Y | do(X=1)) = P(f_Y(S, 1, U_Y))  [under original distributions of S and U_Y]
$

This is fundamentally different from conditioning: we SET X=1 rather than selecting cases where X=1.

5. Counterfactuals

Counterfactuals answer "what would have happened if...?" questions — the most powerful form of causal reasoning.

Three-step procedure (Pearl):

Abduction: Given observed evidence E=e, compute the posterior distribution of exogenous variables U | E=e
Action: Modify the SCM by intervening (do(X=x)) — replace structural equations for intervened variables
Prediction: Compute the counterfactual outcome using the modified SCM and the posterior over U

Example: "If the patient had NOT taken the drug, would they have recovered?"

Given: Patient took drug (X=1), recovered (Y=1), severity S=0.8.

Step 1 (Abduction): Infer U_X, U_Y from observed X=1, Y=1, S=0.8. From X = I(0.8 + U_X > 2): with S=0.8, X=1 ⇒ U_X > 1.2. Sample U_X ~ p(U_X | U_X > 1.2). From Y = α·1 − β·0.8 + U_Y = 1: U_Y = 1 − α + 0.8β.

Step 2 (Action): Set X=0 (counterfactual intervention).

Step 3 (Prediction): Y_cf = α·0 − β·0.8 + U_Y = −0.8β + (1 − α + 0.8β) = 1 − α.

If α > 1, the patient would NOT have recovered without the drug → the drug was necessary.

6. Causal Discovery

Given observational data, can we LEARN the causal DAG? Sometimes.

Constraint-based methods (PC algorithm): Use conditional independence tests to narrow down the set of possible DAGs. The output is a Markov equivalence class — all DAGs that encode the same conditional independencies.

Limitation: Some causal structures are observationally equivalent. X→Y and X←Y encode the SAME conditional independence relations (none, in this bivariate case). Without interventions or strong parametric assumptions, we can only identify the causal DAG up to its Markov equivalence class.

Score-based methods: Search over DAGs, scoring each by how well it fits the data (with a penalty for complexity). Combinatorial explosion limits this to ~30 variables.

Interventional data is gold: If we can perform randomized experiments, causal discovery becomes trivial — randomize X and observe Y. This is why RCTs (Randomized Controlled Trials) are the gold standard.

7. Causality in Modern ML

Domain generalization: Causal models identify invariant relationships across environments. If X CAUSES Y, then P(Y | do(X)) is stable under distribution shift, while P(Y | X) is not. Causal representations are more transferable.

Fairness: Causal definitions of fairness (counterfactual fairness) ask: would the decision have been different if the protected attribute were different, with everything else held constant? This is more principled than correlational fairness metrics.

Reinforcement learning: RL is inherently causal — actions are interventions. Model-based RL with causal world models can plan counterfactually (what if I had taken action a instead?).

Explainability: "Why did the model predict Y?" A causal explanation identifies the actual causes, not just correlated features. Feature attribution methods (SHAP, LIME) provide correlational explanations — causal explanations require a causal model.

Worked Examples

Example 1: Backdoor Adjustment

Problem: Given DAG: Z → X, Z → Y, X → Y. (Z confounds X and Y.) Derive P(Y | do(X=x)) using the backdoor criterion.

Solution:

Backdoor path from X to Y: X ← Z → Y. This path carries spurious association — Z is a common cause.

S = {Z} satisfies the backdoor criterion: 1. Z blocks X ← Z → Y ✓ 2. Z is not a descendant of X ✓

Adjustment formula:

$P(Y | do(X=x)) = Σ_z P(Y | X=x, Z=z) · P(Z=z)
$

This is the "stratify by confounders" strategy. Compare to the observational:

$P(Y | X=x) = Σ_z P(Y | X=x, Z=z) · P(Z=z | X=x)
$

The difference: P(Z=z) vs P(Z=z | X=x). The observational conditional uses the distribution of Z GIVEN X, which is tainted by the X←Z backdoor. The do-formula uses the marginal distribution of Z, breaking the backdoor path.

Example 2: Simpson's Paradox Explained Causally

Problem: A treatment shows higher recovery rates overall (80% vs 70%), but when stratified by gender, it shows LOWER rates for both men and women separately. Explain using causal DAGs.

Solution:

DAG: Gender → Treatment, Gender → Recovery, Treatment → Recovery.

Gender is a confounder: it affects both who gets treatment and recovery rates. Women are both more likely to get treatment AND have naturally higher recovery rates.

Table:

	Men (Treated)	Men (Control)	Women (Treated)	Women (Control)
Recovered	60/100 (60%)	20/50 (40%)	180/200 (90%)	10/20 (50%)
Total	100	50	200	20

Overall: Treatment = 240/300 = 80%, Control = 30/70 = 43%. Treatment looks amazing!

But within each gender: Treatment helps (60% > 40% for men, 90% > 50% for women). The paradox is that the aggregated numbers reverse the direction because women (who recover more) are overrepresented in the treatment group.

Causal resolution:

$P(Recovery | do(Treatment)) = Σ_g P(Recovery | Treatment, Gender=g) · P(Gender=g)
$

Using P(Gender) (population proportions) rather than P(Gender | Treatment) (study proportions). If the population is 50/50:

For treatment: 0.5·60% + 0.5·90% = 75% For control: 0.5·40% + 0.5·50% = 45%

Treatment effect = 75% − 45% = +30pp. The correct causal effect, adjusted for the confounded sampling.

Example 3: Counterfactual Computation

Problem: SCM: Y = 2X + U, X = Z + U, where U ~ N(0,1), Z ~ N(0,1), and U ⟂ Z. We observe X=3, Y=8. What would Y have been if X had been 1 instead?

Solution:

Step 1 (Abduction): From X=3: 3 = Z + U. From Y=8: 8 = 2·3 + U ⇒ U = 8−6 = 2. So U=2, Z=1.

Step 2 (Action): Set X=1 (override X = Z+U).

Step 3 (Prediction): Y_cf = 2·1 + U = 2 + 2 = 4.

So: Y would have been 4 (instead of 8). The individual causal effect of changing X from 3 to 1 is 8→4 = −4, which is exactly 2×(3−1) = 4 — matching the structural coefficient β=2.

Contrast with a purely statistical model that only knows P(Y|X): it might predict E[Y|X=1] using the regression line, which could be very different from the counterfactual Y(do(X=1)) if U and X are correlated in the observational distribution.

Quiz

Q1: What does the concept of The backdoor criterion primarily refer to in this subject?

A) A visual representation of The backdoor criterion B) The definition and application of The backdoor criterion C) A computational error related to The backdoor criterion D) A historical anecdote about The backdoor criterion

Correct: B)

If you chose A: This is incorrect. The backdoor criterion is defined as: the definition and application of the backdoor criterion. The other options describe different aspects that are not the primary focus.
If you chose B: The backdoor criterion is defined as: the definition and application of the backdoor criterion. The other options describe different aspects that are not the primary focus. Correct!
If you chose C: This is incorrect. The backdoor criterion is defined as: the definition and application of the backdoor criterion. The other options describe different aspects that are not the primary focus.
If you chose D: This is incorrect. The backdoor criterion is defined as: the definition and application of the backdoor criterion. The other options describe different aspects that are not the primary focus.

Q2: What is the primary purpose of Counterfactuals?

A) It is used only in advanced research contexts B) It is used to counterfactuals in mathematical analysis C) It replaces all other methods in this domain D) It is primarily a historical notation system

Correct: B)

If you chose A: This is incorrect. Counterfactuals serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose B: Counterfactuals serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
If you chose C: This is incorrect. Counterfactuals serves the purpose described in the correct answer. The other options misrepresent its role.
If you chose D: This is incorrect. Counterfactuals serves the purpose described in the correct answer. The other options misrepresent its role.

Q3: Which statement about Causal ML is TRUE?

A) Causal ML is an advanced topic beyond this subject's scope B) Causal ML is not related to this subject C) Causal ML is a fundamental concept covered in this subject D) Causal ML is mentioned only as a historical footnote

Correct: C)

If you chose A: This is incorrect. Causal ML is a fundamental concept covered in this subject. This subject covers Causal ML as part of its core content.
If you chose B: This is incorrect. Causal ML is a fundamental concept covered in this subject. This subject covers Causal ML as part of its core content.
If you chose C: Causal ML is a fundamental concept covered in this subject. This subject covers Causal ML as part of its core content. Correct!
If you chose D: This is incorrect. Causal ML is a fundamental concept covered in this subject. This subject covers Causal ML as part of its core content.

Q4: Based on the worked examples in this subject, what is the correct result?

A) - X ← W → Z → Y B) A different result from a common mistake C) An unrelated numerical value D) The inverse of the correct answer

Correct: A)

If you chose A: The worked examples show that the result is - X ← W → Z → Y. The other options represent common errors. Correct!
If you chose B: This is incorrect. The worked examples show that the result is - X ← W → Z → Y. The other options represent common errors.
If you chose C: This is incorrect. The worked examples show that the result is - X ← W → Z → Y. The other options represent common errors.
If you chose D: This is incorrect. The worked examples show that the result is - X ← W → Z → Y. The other options represent common errors.

Q5: How are Causal ML and Confounder related?

A) Causal ML and Confounder are completely unrelated topics B) Causal ML is a special case of Confounder C) Causal ML and Confounder are closely related concepts D) Causal ML is the inverse of Confounder

Correct: C)

If you chose A: This is incorrect. Both Causal ML and Confounder are covered in this subject as interconnected topics.
If you chose B: This is incorrect. Both Causal ML and Confounder are covered in this subject as interconnected topics.
If you chose C: Both Causal ML and Confounder are covered in this subject as interconnected topics. Correct!
If you chose D: This is incorrect. Both Causal ML and Confounder are covered in this subject as interconnected topics.

Q6: What is a common pitfall when working with Backdoor criterion?

A) A common mistake is confusing Backdoor criterion with a similar concept B) The main error with Backdoor criterion is using it when it is not needed C) Backdoor criterion has no common misconceptions D) Backdoor criterion is always computed the same way in all contexts

Correct: A)

If you chose A: Students often confuse Backdoor criterion with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
If you chose B: This is incorrect. Students often confuse Backdoor criterion with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose C: This is incorrect. Students often confuse Backdoor criterion with similar-sounding or related concepts. Pay attention to the precise definitions.
If you chose D: This is incorrect. Students often confuse Backdoor criterion with similar-sounding or related concepts. Pay attention to the precise definitions.

Q7: When should you apply Adjustment formula?

A) Adjustment formula is not practically useful B) Apply Adjustment formula to solve problems in this subject's domain C) Use Adjustment formula only in pure mathematics contexts D) Avoid Adjustment formula unless explicitly instructed

Correct: B)

If you chose A: This is incorrect. Adjustment formula is a practical tool used throughout this subject to solve relevant problems.
If you chose B: Adjustment formula is a practical tool used throughout this subject to solve relevant problems. Correct!
If you chose C: This is incorrect. Adjustment formula is a practical tool used throughout this subject to solve relevant problems.
If you chose D: This is incorrect. Adjustment formula is a practical tool used throughout this subject to solve relevant problems.

Practice Problems

Problem 1

Identify all backdoor paths between X and Y in the DAG: W → X, W → Z, Z → Y, X → Y. What set of variables satisfies the backdoor criterion?

Answer

Backdoor paths from X to Y (paths starting with arrow into X): - X ← W → Z → Y S = {W} blocks this path (conditioning on W breaks W→X and W→Z→Y). Check: W is not a descendant of X ✓ (X→Y, no path from X to W). So: P(Y | do(X=x)) = Σ_w P(Y | X=x, W=w) · P(W=w). Alternatively, S = {W, Z} also works but is not minimal. Minimal sufficient adjustment sets are preferred — they preserve more statistical power.

Problem 2

Explain why conditioning on a collider creates spurious correlation. Give a concrete real-world example.

Answer

Structure: X → Z ← Y. X and Y are independent (no causal connection). Conditioning on Z means we only look at cases where Z takes a specific value. If Z = X + Y + noise, then for Z to be high: - Either X was high, OR Y was high, OR the noise was high. If we condition on Z=high and observe that X is LOW, then Y must be high (to compensate). This creates a negative correlation between X and Y in the conditional distribution, even though they're independent in the population. **Real-world example:** Talent (X) and luck (Y) are independent causes of success (Z). Among SUCCESSFUL people (conditioning on Z=high), talent and luck are NEGATIVELY correlated — very talented successful people tend to have been less lucky, and very lucky successful people tend to be less talented. This is "explaining away." The Berkson's paradox is the classic example: among hospital patients (conditioned on being sick enough to be hospitalized), diseases are negatively correlated even if independent in the general population.

Problem 3

Derive why P(Y | do(X=x)) = P(Y | X=x) when there are no confounders (i.e., no backdoor paths from X to Y).

Answer

If there are no backdoor paths, then in the mutilated graph (where incoming edges to X are removed), the distribution of Y given X is the same as in the original graph. Formally, with no backdoor paths, X and all potential confounders are independent. The adjustment formula with empty S becomes:

$P(Y | do(X=x)) = P(Y | X=x)   [empty sum over s]
$

Intuitively: the only association between X and Y is the causal effect X→Y (and possibly some mediators that are descendants of X). There are no common causes creating spurious correlation. So conditioning and intervening give the same result. This is why RCTs work: randomization BREAKS all incoming edges to treatment X, eliminating backdoor paths. In an RCT, P(Y | do(X=x)) = P(Y | X=x).

Problem 4

A hiring algorithm uses ML to predict job performance Y from features X that include "years of experience." Experience is caused partially by gender discrimination (Gender → Experience). The company wants to use the model to make FAIR hiring decisions. Explain the causal fairness problem and propose an adjustment.

Answer

DAG: Gender → Experience → Job Performance. Gender may also directly affect Performance (through unmeasured discrimination channels). **Problem:** If the model uses Experience as a predictor, it perpetuates historical discrimination. A female candidate with 3 years experience and high potential will be ranked below a male candidate with 5 years experience and similar potential — but the experience gap was caused by discrimination. **Counterfactual fairness (Kusner et al., 2017):** A decision is fair if it would have been the same in the counterfactual world where the protected attribute were different:

$P(Decision | do(Gender=male), observed_data) = P(Decision | do(Gender=female), observed_data)
$

**Adjustment:** Use only "resolved" features that are NOT descendants of the protected attribute. In the DAG, only use features that have no directed path from Gender. Alternatively, explicitly model the causal pathways and adjust predictions to remove the discriminatory effect. This is fundamentally different from "demographic parity" (equal acceptance rates) or "equalized odds" (equal error rates), which are statistical, not causal, fairness criteria.

Problem 5

The "do-calculus" has three rules. State Rule 2 and explain when it allows removing the do-operator.

Answer

**Rule 2 (Action/observation exchange):**

$P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)
$

if Y ⟂ Z | X, W in the graph where incoming edges to X and outgoing edges from Z are removed. This rule says: we can replace conditioning on an intervention do(Z) with conditioning on the observation Z, WHEN Z has no direct effect on Y (through paths not going through X). **Use case:** When we want to estimate the effect of X on Y, but Z is a secondary intervention that only affects Y THROUGH X (a mediator of the X→Y relationship that we've intervened on). In this case, conditioning on do(Z) = observing Z — the do-operator can be dropped. **Practical importance:** Rule 2 allows translating causal queries with multiple interventions into purely observational quantities when certain graphical conditions hold. This is how causal effects that seem to require experiments can sometimes be estimated from observational data alone.

Summary

Correlation ≠ causation — the do-operator P(Y | do(X)) captures causal effects, while P(Y | X) captures mere association, and they differ whenever confounders exist
The backdoor criterion identifies which variables to adjust for to recover causal effects from observational data — stratifying by confounders removes spurious associations
SCMs represent causal knowledge as structural equations with independent exogenous noise — they support both interventional and counterfactual queries
Counterfactuals use abduction-action-prediction to answer "what if" questions — the most powerful form of causal reasoning
Causal ML addresses the brittleness of purely correlational models — causal invariances are stable under distribution shift, interventions, and deployment

Pitfalls

Conditioning on colliders and introducing selection bias. A classic mistake in observational studies: conditioning on a common effect (collider) of two independent variables creates spurious correlation between them. For example, conditioning on "hospitalization" (an effect of both disease severity and access to care) makes disease severity and access to care appear negatively correlated. Always check your DAG before adding control variables — conditioning on a collider is worse than controlling for nothing.
Assuming correlation implies causation without a causal model. P(Y|X) can be high because X causes Y, Y causes X, or a confounder Z causes both. Without a causal DAG (or an experiment), these are indistinguishable. ML models trained to predict Y from X learn P(Y|X) — the observational conditional — which is useless for answering "what happens if we change X?" Always specify (and justify) your causal assumptions before making causal claims.
Applying the backdoor criterion without verifying that the DAG is correct. The backdoor adjustment formula P(Y|do(X)) = Σ_s P(Y|X,S=s)P(S=s) is only valid if S blocks ALL backdoor paths and contains NO descendants of X. If you miss a confounder (e.g., an unmeasured variable that affects both X and Y), the adjustment is biased. Causal inference from observational data is only as good as the causal graph — always conduct sensitivity analyses for unmeasured confounding.
Confusing P(Y|do(X)) with P(Y|X) in the presence of confounders. The do-operator represents an intervention that breaks the natural causal mechanism. In observational data with confounders, P(Y|X) includes both the causal effect of X→Y AND the spurious association through backdoor paths. Acting on a model that predicts P(Y|X) (standard ML) when you need P(Y|do(X)) (causal effect) leads to systematically wrong decisions — this is why purely predictive models fail under policy changes.
Building predictive models that break under distribution shift. A model trained to predict Y from X on historical data may achieve excellent test-set performance but fail catastrophically when deployed — because deployment changes the distribution of X (an intervention) or shifts the confounders. Causal relationships P(Y|do(X)) are invariant across environments, while correlational relationships P(Y|X) are not. For robust ML systems that must work under changing conditions, causal modeling is not optional — it's essential.

Key Terms

Term	Definition
do-operator	P(Y
Confounder	A variable that causes both the treatment and the outcome — creates spurious correlation
Backdoor criterion	Graphical condition for identifying causal effects: block all backdoor paths without conditioning on descendants of X
Adjustment formula	P(Y
SCM	Structural Causal Model — a set of structural equations X_j = f_j(PA_j, U_j) with independent U's
Counterfactual	"What would have happened if..." — computed via abduction→action→prediction
Collider bias	Conditioning on a common effect creates spurious dependence between its independent causes
Markov equivalence class	Set of DAGs encoding the same conditional independencies — causal discovery's fundamental ambiguity
Causal invariance	P(Y

Next Steps

Continue to 22-01 — Autoencoders to learn about unsupervised representation learning — how neural networks can compress and reconstruct data, forming the basis for dimensionality reduction, denoising, and generative modeling.

Progress

Phases

21-06 — Causality and Causal Inference

Learning Objectives

Core Content

1. Why Causality Matters for ML

2. The Do-Operator and Interventions

3. Causal Graphical Models (DAGs)

4. Structural Causal Models (SCMs)

5. Counterfactuals

6. Causal Discovery

7. Causality in Modern ML

Worked Examples

Example 1: Backdoor Adjustment

Example 2: Simpson's Paradox Explained Causally

Example 3: Counterfactual Computation

Quiz

Practice Problems

Problem 1

Problem 2

Problem 3

Problem 4

Problem 5

Summary

Pitfalls

Key Terms

Next Steps