Math graphic
πŸ“ Concept diagram

21-06 β€” Causality and Causal Inference

Phase: 21 β€” Probability & Statistics for ML (Advanced) Subject: 21-06 Prerequisites: 21-01 (Bayesian Inference), 10-02 (Conditional Probability), 12-05 (Hypothesis Testing), 11-01 (Markov Chains β€” for graphical models) Next subject: 22-01 β€” Autoencoders


Learning Objectives

By the end of this subject, you will be able to:

  1. Distinguish correlation from causation using the do-operator β€” prove that P(Y | do(X=x)) β‰  P(Y | X=x) when confounders exist
  2. Apply the backdoor criterion to identify causal effects from observational data β€” derive adjustment formulas
  3. Construct and interpret Structural Causal Models (SCMs) as systems of structural equations with independent exogenous noise
  4. Compute counterfactual quantities using the three-step abduction-action-prediction procedure
  5. Explain why ML models that optimize P(Y|X) can fail catastrophically under distribution shift β€” connecting causality to robust ML

Core Content

1. Why Causality Matters for ML

Standard ML learns P(Y | X) β€” correlation. But correlation β‰  causation:

Correlation predicts the past. Causation predicts the future under INTERVENTIONS. When we deploy an ML model and act on its predictions, we are intervening β€” and correlation-based models can fail catastrophically.

⚠️ THIS IS CRITICAL β€” Causal reasoning is the key gap between current ML systems and human-level intelligence. An ML model that can only learn correlations will break whenever the data-generating process changes, which is inevitable in deployment. Causal models are robust to intervention and distribution shift.


2. The Do-Operator and Interventions

Pearl's do-operator formalizes the difference between observing and intervening:

The fundamental distinction:

$P(Y | X=x) β‰  P(Y | do(X=x))   in the presence of confounders
$

Example: Let X = "taking a drug," Y = "recovery," Z = "severity of illness" (confounder β€” affects both X and Y).

P(recovery | drug) = high     [sicker patients take the drug, but also recover at lower rates]
P(recovery | do(drug)) = ?    [the true causal effect, isolated from confounding]

To compute P(Y | do(X=x)), we must adjust for confounding β€” using the backdoor criterion.


3. Causal Graphical Models (DAGs)

A Directed Acyclic Graph (DAG) represents causal relationships: - Nodes = variables - Edges X β†’ Y = "X directly causes Y"

Key structures:

Chain (mediator): X β†’ Z β†’ Y X and Y are dependent, but conditioning on Z BLOCKS the path (X βŸ‚ Y | Z).

Fork (confounder): X ← Z β†’ Y X and Y are dependent (spurious correlation). Conditioning on Z blocks the path (makes X βŸ‚ Y).

Collider: X β†’ Z ← Y X and Y are independent, but conditioning on Z OPENS the path (X βŸ‚ Y | Z). This is "explaining away" β€” conditioning on a common effect creates dependence between independent causes.

The Backdoor Criterion: A set of variables S satisfies the backdoor criterion for (X, Y) if: 1. S blocks all "backdoor" paths from X to Y (paths starting with an arrow INTO X) 2. No node in S is a descendant of X

Then:

$P(Y | do(X=x)) = Ξ£_s P(Y | X=x, S=s) Β· P(S=s)
$

This is the adjustment formula β€” we can estimate causal effects from observational data by conditioning on (stratifying by) the right confounders.


4. Structural Causal Models (SCMs)

An SCM is a set of structural equations:

X_j = f_j(PA_j, U_j)    for j = 1, ..., d

where PA_j = causal parents of X_j, and U_j = exogenous (unexplained) noise variables, assumed jointly independent.

Example: Drug β†’ Recovery, with confounding by Severity:

$S = U_S                                           [severity β€” exogenous]
X = f_X(S, U_X) = I(S + U_X > 2)                 [drug prescription depends on severity]
Y = f_Y(S, X, U_Y) = Ξ±X βˆ’ Ξ²S + U_Y               [recovery β€” caused by drug and severity]
$

With SCMs, computing P(Y | do(X=1)) means replacing the structural equation for X:

$X = 1   [intervention β€” override the natural mechanism]
$

Then:

$P(Y | do(X=1)) = P(f_Y(S, 1, U_Y))  [under original distributions of S and U_Y]
$

This is fundamentally different from conditioning: we SET X=1 rather than selecting cases where X=1.


5. Counterfactuals

Counterfactuals answer "what would have happened if...?" questions β€” the most powerful form of causal reasoning.

Three-step procedure (Pearl):

  1. Abduction: Given observed evidence E=e, compute the posterior distribution of exogenous variables U | E=e
  2. Action: Modify the SCM by intervening (do(X=x)) β€” replace structural equations for intervened variables
  3. Prediction: Compute the counterfactual outcome using the modified SCM and the posterior over U

Example: "If the patient had NOT taken the drug, would they have recovered?"

Given: Patient took drug (X=1), recovered (Y=1), severity S=0.8.

Step 1 (Abduction): Infer U_X, U_Y from observed X=1, Y=1, S=0.8. From X = I(0.8 + U_X > 2): with S=0.8, X=1 β‡’ U_X > 1.2. Sample U_X ~ p(U_X | U_X > 1.2). From Y = Ξ±Β·1 βˆ’ Ξ²Β·0.8 + U_Y = 1: U_Y = 1 βˆ’ Ξ± + 0.8Ξ².

Step 2 (Action): Set X=0 (counterfactual intervention).

Step 3 (Prediction): Y_cf = Ξ±Β·0 βˆ’ Ξ²Β·0.8 + U_Y = βˆ’0.8Ξ² + (1 βˆ’ Ξ± + 0.8Ξ²) = 1 βˆ’ Ξ±.

If Ξ± > 1, the patient would NOT have recovered without the drug β†’ the drug was necessary.


6. Causal Discovery

Given observational data, can we LEARN the causal DAG? Sometimes.

Constraint-based methods (PC algorithm): Use conditional independence tests to narrow down the set of possible DAGs. The output is a Markov equivalence class β€” all DAGs that encode the same conditional independencies.

Limitation: Some causal structures are observationally equivalent. Xβ†’Y and X←Y encode the SAME conditional independence relations (none, in this bivariate case). Without interventions or strong parametric assumptions, we can only identify the causal DAG up to its Markov equivalence class.

Score-based methods: Search over DAGs, scoring each by how well it fits the data (with a penalty for complexity). Combinatorial explosion limits this to ~30 variables.

Interventional data is gold: If we can perform randomized experiments, causal discovery becomes trivial β€” randomize X and observe Y. This is why RCTs (Randomized Controlled Trials) are the gold standard.


7. Causality in Modern ML

Domain generalization: Causal models identify invariant relationships across environments. If X CAUSES Y, then P(Y | do(X)) is stable under distribution shift, while P(Y | X) is not. Causal representations are more transferable.

Fairness: Causal definitions of fairness (counterfactual fairness) ask: would the decision have been different if the protected attribute were different, with everything else held constant? This is more principled than correlational fairness metrics.

Reinforcement learning: RL is inherently causal β€” actions are interventions. Model-based RL with causal world models can plan counterfactually (what if I had taken action a instead?).

Explainability: "Why did the model predict Y?" A causal explanation identifies the actual causes, not just correlated features. Feature attribution methods (SHAP, LIME) provide correlational explanations β€” causal explanations require a causal model.


Worked Examples

Example 1: Backdoor Adjustment

Problem: Given DAG: Z β†’ X, Z β†’ Y, X β†’ Y. (Z confounds X and Y.) Derive P(Y | do(X=x)) using the backdoor criterion.

Solution:

Backdoor path from X to Y: X ← Z β†’ Y. This path carries spurious association β€” Z is a common cause.

S = {Z} satisfies the backdoor criterion: 1. Z blocks X ← Z β†’ Y βœ“ 2. Z is not a descendant of X βœ“

Adjustment formula:

$P(Y | do(X=x)) = Ξ£_z P(Y | X=x, Z=z) Β· P(Z=z)
$

This is the "stratify by confounders" strategy. Compare to the observational:

$P(Y | X=x) = Ξ£_z P(Y | X=x, Z=z) Β· P(Z=z | X=x)
$

The difference: P(Z=z) vs P(Z=z | X=x). The observational conditional uses the distribution of Z GIVEN X, which is tainted by the X←Z backdoor. The do-formula uses the marginal distribution of Z, breaking the backdoor path.


Example 2: Simpson's Paradox Explained Causally

Problem: A treatment shows higher recovery rates overall (80% vs 70%), but when stratified by gender, it shows LOWER rates for both men and women separately. Explain using causal DAGs.

Solution:

DAG: Gender β†’ Treatment, Gender β†’ Recovery, Treatment β†’ Recovery.

Gender is a confounder: it affects both who gets treatment and recovery rates. Women are both more likely to get treatment AND have naturally higher recovery rates.

Table:

Men (Treated) Men (Control) Women (Treated) Women (Control)
Recovered 60/100 (60%) 20/50 (40%) 180/200 (90%) 10/20 (50%)
Total 100 50 200 20

Overall: Treatment = 240/300 = 80%, Control = 30/70 = 43%. Treatment looks amazing!

But within each gender: Treatment helps (60% > 40% for men, 90% > 50% for women). The paradox is that the aggregated numbers reverse the direction because women (who recover more) are overrepresented in the treatment group.

Causal resolution:

$P(Recovery | do(Treatment)) = Ξ£_g P(Recovery | Treatment, Gender=g) Β· P(Gender=g)
$

Using P(Gender) (population proportions) rather than P(Gender | Treatment) (study proportions). If the population is 50/50:

For treatment: 0.5Β·60% + 0.5Β·90% = 75% For control: 0.5Β·40% + 0.5Β·50% = 45%

Treatment effect = 75% βˆ’ 45% = +30pp. The correct causal effect, adjusted for the confounded sampling.


Example 3: Counterfactual Computation

Problem: SCM: Y = 2X + U, X = Z + U, where U ~ N(0,1), Z ~ N(0,1), and U βŸ‚ Z. We observe X=3, Y=8. What would Y have been if X had been 1 instead?

Solution:

Step 1 (Abduction): From X=3: 3 = Z + U. From Y=8: 8 = 2Β·3 + U β‡’ U = 8βˆ’6 = 2. So U=2, Z=1.

Step 2 (Action): Set X=1 (override X = Z+U).

Step 3 (Prediction): Y_cf = 2Β·1 + U = 2 + 2 = 4.

So: Y would have been 4 (instead of 8). The individual causal effect of changing X from 3 to 1 is 8β†’4 = βˆ’4, which is exactly 2Γ—(3βˆ’1) = 4 β€” matching the structural coefficient Ξ²=2.

Contrast with a purely statistical model that only knows P(Y|X): it might predict E[Y|X=1] using the regression line, which could be very different from the counterfactual Y(do(X=1)) if U and X are correlated in the observational distribution.



Quiz

Q1: What does the concept of The backdoor criterion primarily refer to in this subject?

A) A visual representation of The backdoor criterion B) The definition and application of The backdoor criterion C) A computational error related to The backdoor criterion D) A historical anecdote about The backdoor criterion

Correct: B)

Q2: What is the primary purpose of Counterfactuals?

A) It is used only in advanced research contexts B) It is used to counterfactuals in mathematical analysis C) It replaces all other methods in this domain D) It is primarily a historical notation system

Correct: B)

Q3: Which statement about Causal ML is TRUE?

A) Causal ML is an advanced topic beyond this subject's scope B) Causal ML is not related to this subject C) Causal ML is a fundamental concept covered in this subject D) Causal ML is mentioned only as a historical footnote

Correct: C)

Q4: Based on the worked examples in this subject, what is the correct result?

A) - X ← W β†’ Z β†’ Y B) A different result from a common mistake C) An unrelated numerical value D) The inverse of the correct answer

Correct: A)

Q5: How are Causal ML and Confounder related?

A) Causal ML and Confounder are completely unrelated topics B) Causal ML is a special case of Confounder C) Causal ML and Confounder are closely related concepts D) Causal ML is the inverse of Confounder

Correct: C)

Q6: What is a common pitfall when working with Backdoor criterion?

A) A common mistake is confusing Backdoor criterion with a similar concept B) The main error with Backdoor criterion is using it when it is not needed C) Backdoor criterion has no common misconceptions D) Backdoor criterion is always computed the same way in all contexts

Correct: A)

Q7: When should you apply Adjustment formula?

A) Adjustment formula is not practically useful B) Apply Adjustment formula to solve problems in this subject's domain C) Use Adjustment formula only in pure mathematics contexts D) Avoid Adjustment formula unless explicitly instructed

Correct: B)

Practice Problems

Problem 1

Identify all backdoor paths between X and Y in the DAG: W β†’ X, W β†’ Z, Z β†’ Y, X β†’ Y. What set of variables satisfies the backdoor criterion?

Answer Backdoor paths from X to Y (paths starting with arrow into X): - X ← W β†’ Z β†’ Y S = {W} blocks this path (conditioning on W breaks Wβ†’X and Wβ†’Zβ†’Y). Check: W is not a descendant of X βœ“ (Xβ†’Y, no path from X to W). So: P(Y | do(X=x)) = Ξ£_w P(Y | X=x, W=w) Β· P(W=w). Alternatively, S = {W, Z} also works but is not minimal. Minimal sufficient adjustment sets are preferred β€” they preserve more statistical power.

Problem 2

Explain why conditioning on a collider creates spurious correlation. Give a concrete real-world example.

Answer Structure: X β†’ Z ← Y. X and Y are independent (no causal connection). Conditioning on Z means we only look at cases where Z takes a specific value. If Z = X + Y + noise, then for Z to be high: - Either X was high, OR Y was high, OR the noise was high. If we condition on Z=high and observe that X is LOW, then Y must be high (to compensate). This creates a negative correlation between X and Y in the conditional distribution, even though they're independent in the population. **Real-world example:** Talent (X) and luck (Y) are independent causes of success (Z). Among SUCCESSFUL people (conditioning on Z=high), talent and luck are NEGATIVELY correlated β€” very talented successful people tend to have been less lucky, and very lucky successful people tend to be less talented. This is "explaining away." The Berkson's paradox is the classic example: among hospital patients (conditioned on being sick enough to be hospitalized), diseases are negatively correlated even if independent in the general population.

Problem 3

Derive why P(Y | do(X=x)) = P(Y | X=x) when there are no confounders (i.e., no backdoor paths from X to Y).

Answer If there are no backdoor paths, then in the mutilated graph (where incoming edges to X are removed), the distribution of Y given X is the same as in the original graph. Formally, with no backdoor paths, X and all potential confounders are independent. The adjustment formula with empty S becomes:
$P(Y | do(X=x)) = P(Y | X=x)   [empty sum over s]
$
Intuitively: the only association between X and Y is the causal effect X→Y (and possibly some mediators that are descendants of X). There are no common causes creating spurious correlation. So conditioning and intervening give the same result. This is why RCTs work: randomization BREAKS all incoming edges to treatment X, eliminating backdoor paths. In an RCT, P(Y | do(X=x)) = P(Y | X=x).

Problem 4

A hiring algorithm uses ML to predict job performance Y from features X that include "years of experience." Experience is caused partially by gender discrimination (Gender β†’ Experience). The company wants to use the model to make FAIR hiring decisions. Explain the causal fairness problem and propose an adjustment.

Answer DAG: Gender β†’ Experience β†’ Job Performance. Gender may also directly affect Performance (through unmeasured discrimination channels). **Problem:** If the model uses Experience as a predictor, it perpetuates historical discrimination. A female candidate with 3 years experience and high potential will be ranked below a male candidate with 5 years experience and similar potential β€” but the experience gap was caused by discrimination. **Counterfactual fairness (Kusner et al., 2017):** A decision is fair if it would have been the same in the counterfactual world where the protected attribute were different:
$P(Decision | do(Gender=male), observed_data) = P(Decision | do(Gender=female), observed_data)
$
**Adjustment:** Use only "resolved" features that are NOT descendants of the protected attribute. In the DAG, only use features that have no directed path from Gender. Alternatively, explicitly model the causal pathways and adjust predictions to remove the discriminatory effect. This is fundamentally different from "demographic parity" (equal acceptance rates) or "equalized odds" (equal error rates), which are statistical, not causal, fairness criteria.

Problem 5

The "do-calculus" has three rules. State Rule 2 and explain when it allows removing the do-operator.

Answer **Rule 2 (Action/observation exchange):**
$P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)
$
if Y βŸ‚ Z | X, W in the graph where incoming edges to X and outgoing edges from Z are removed. This rule says: we can replace conditioning on an intervention do(Z) with conditioning on the observation Z, WHEN Z has no direct effect on Y (through paths not going through X). **Use case:** When we want to estimate the effect of X on Y, but Z is a secondary intervention that only affects Y THROUGH X (a mediator of the Xβ†’Y relationship that we've intervened on). In this case, conditioning on do(Z) = observing Z β€” the do-operator can be dropped. **Practical importance:** Rule 2 allows translating causal queries with multiple interventions into purely observational quantities when certain graphical conditions hold. This is how causal effects that seem to require experiments can sometimes be estimated from observational data alone.

Summary

  1. Correlation β‰  causation β€” the do-operator P(Y | do(X)) captures causal effects, while P(Y | X) captures mere association, and they differ whenever confounders exist
  2. The backdoor criterion identifies which variables to adjust for to recover causal effects from observational data β€” stratifying by confounders removes spurious associations
  3. SCMs represent causal knowledge as structural equations with independent exogenous noise β€” they support both interventional and counterfactual queries
  4. Counterfactuals use abduction-action-prediction to answer "what if" questions β€” the most powerful form of causal reasoning
  5. Causal ML addresses the brittleness of purely correlational models β€” causal invariances are stable under distribution shift, interventions, and deployment

Pitfalls


Key Terms

Term Definition
do-operator P(Y
Confounder A variable that causes both the treatment and the outcome β€” creates spurious correlation
Backdoor criterion Graphical condition for identifying causal effects: block all backdoor paths without conditioning on descendants of X
Adjustment formula P(Y
SCM Structural Causal Model β€” a set of structural equations X_j = f_j(PA_j, U_j) with independent U's
Counterfactual "What would have happened if..." — computed via abduction→action→prediction
Collider bias Conditioning on a common effect creates spurious dependence between its independent causes
Markov equivalence class Set of DAGs encoding the same conditional independencies β€” causal discovery's fundamental ambiguity
Causal invariance P(Y

Next Steps

Continue to 22-01 β€” Autoencoders to learn about unsupervised representation learning β€” how neural networks can compress and reconstruct data, forming the basis for dimensionality reduction, denoising, and generative modeling.