• 📖 Cover
  • Contents

Chapter 3: Causal Inference Beyond Regression

Chapter Introduction

Every method in Chapter 2 answered the same question: given the inputs \(X\), what is our best prediction of \(Y\)? That is the prediction question. Most decisions that matter in the real world ask a different question: if we intervene and change \(X\), how does \(Y\) change? That is the causal question, and a regression coefficient is, in general, the wrong answer to it.

The distinction is everywhere. A clinical trialist asks whether a new drug causes better outcomes — not whether patients who happen to take it have better outcomes. A marketing team asks whether their promotional email caused the lift in conversions — not whether customers who happened to receive it (because they were already engaged) converted at a higher rate. A policy economist asks whether a minimum-wage increase caused changes in employment — not whether states with higher minimum wages also have different employment patterns for a hundred other reasons. In each case the regression coefficient is a partial correlation. The causal effect is a different number, and getting it right requires different machinery.

This chapter teaches that machinery — the five classical workhorses of causal inference, in the order an analyst typically reaches for them. The Randomized Controlled Trial (RCT) is the gold standard; when you can run one, the analysis is essentially trivial. Propensity Score Matching is the workaround when randomisation is impossible but you have all the confounders measured. Instrumental Variables (IV) lets you estimate causal effects even when unmeasured confounding is present, by finding an exogenous “lottery” in the data. Regression Discontinuity (RD) exploits sharp cut-offs in policy or program eligibility to recover causal effects almost as cleanly as an RCT. Difference-in-Differences (DiD) uses a parallel control group through time to net out pre-existing trends.

These five methods underpin essentially every result in modern empirical economics, in pharmacoeconomics, in marketing-mix modelling, in behavioural science, and in any policy evaluation that takes its own conclusions seriously. They are also the conceptual bedrock for the AI-scale Double Machine Learning we will return to in Chapter 9. The chapter closes with a sensitivity-analysis section — because every causal estimate is one untested assumption away from being wrong, and a working analyst must know which assumption supports which estimate.


Table of Contents

  1. The Causal Question vs. The Correlation Question
  2. Randomized Controlled Trials — The Gold Standard
  3. Confounding and the Backdoor Path
  4. Propensity Score Matching
  5. Instrumental Variables (IV)
  6. Regression Discontinuity Design (RD)
  7. Difference-in-Differences (DiD)
  8. Sensitivity Analysis and Common Pitfalls

The Causal Question vs. The Correlation Question

The potential-outcomes framework (Rubin 1974) makes the distinction sharp. For each unit \(i\), define two potential outcomes:

\[ Y_i(1) = \text{outcome if treated},\qquad Y_i(0) = \text{outcome if not treated}. \]

The individual treatment effect is \(\tau_i = Y_i(1) - Y_i(0)\). The average treatment effect (ATE) in a population is \(\tau = \mathbb{E}[Y(1) - Y(0)]\).

The fundamental problem of causal inference: for any single unit you only observe one of \(Y_i(1)\) or \(Y_i(0)\) — the one corresponding to the treatment they actually received. You cannot observe the other. Causal inference is fundamentally a missing-data problem.

The observed difference of means between treated and untreated groups is

\[ \underbrace{\mathbb{E}[Y \mid T=1] - \mathbb{E}[Y \mid T=0]}_{\text{observable}} \;=\; \underbrace{\mathbb{E}[Y(1) - Y(0) \mid T = 1]}_{\text{ATT (causal)}} + \underbrace{\big( \mathbb{E}[Y(0) \mid T=1] - \mathbb{E}[Y(0) \mid T=0]\big)}_{\text{selection bias}}. \]

When treatment is randomly assigned, the selection bias term is zero by construction, and the difference of means is the causal effect. When treatment is not randomly assigned, the difference of means is contaminated by selection bias — exactly the contamination this chapter exists to remove.

Randomized Controlled Trials — The Gold Standard

In a Randomized Controlled Trial, the treatment \(T\) is assigned by a random mechanism independent of every potential outcome and every covariate. Under randomisation:

\[ \mathbb{E}[Y \mid T=1] - \mathbb{E}[Y \mid T=0] \;=\; \mathbb{E}[Y(1) - Y(0)] \;=\; \tau. \]

A two-sample t-test (Chapter 1) is now a causal test, not a correlation test. The discipline of RCTs is what makes them the gold standard for FDA drug approval, for every Netflix product change, for Google’s search-ranking experiments, and for the most rigorous published findings in economics and social science.

Module reference — scipy.stats two-sample tests
  • stats.ttest_ind(a, b, equal_var=False) — Welch’s two-sample t-test for the ATE.
  • stats.mannwhitneyu(a, b) — non-parametric alternative.
  • For binary outcomes: statsmodels.stats.proportion.proportions_ztest(count, nobs) for the difference of proportions.

Power analysis — designing the trial before you run it

Before running the trial, every experimenter computes the sample size required to detect an effect of plausible magnitude with adequate power (typically 80%). The standard formula for a two-sample t-test, with effect size \(d = (\mu_1 - \mu_0) / \sigma\), significance level \(\alpha\), and power \(1 - \beta\):

\[ n_{\text{per group}} \;\approx\; \left(\frac{z_{1-\alpha/2} + z_{1-\beta}}{d}\right)^2 \cdot 2. \]

For \(\alpha = 0.05\), \(1-\beta = 0.80\), and \(d = 0.2\) (small effect), \(n \approx 393\) per group. For \(d = 0.5\) (medium), \(n \approx 64\). The smaller the effect you wish to detect, the larger the required sample — and committing the sample size before seeing the data is what makes the inference honest.

Confounding and the Backdoor Path

When randomisation is impossible — observational data, retrospective studies, archival records — selection bias re-enters the picture. The formal language for this is the Directed Acyclic Graph (DAG): nodes are variables, edges are causal influences. A confounder \(W\) is a variable that causally influences both the treatment \(T\) and the outcome \(Y\).

  W ──> T ──> Y
   └──────────^

The edge \(W \to T\) and the edge \(W \to Y\) together create a backdoor path from \(T\) to \(Y\) that flows through \(W\) without going through the causal arrow. A naive regression of \(Y\) on \(T\) picks up both the direct effect \(T \to Y\) and the spurious correlation through the backdoor \(T \leftarrow W \to Y\). Pearl’s backdoor criterion says: to recover the causal effect, condition on a set of variables that closes every backdoor path.

In observational practice this means: include the confounders as regressors. But you must include the right ones — including the wrong variables (mediators, colliders) can introduce bias rather than remove it.

The colliders pitfall

A collider is a variable caused by both \(T\) and \(Y\). Conditioning on a collider opens a non-causal path between \(T\) and \(Y\) — the opposite of what you want. The classic example is studying the relationship between height and basketball talent among NBA players: among NBA players (selection on a collider), height and talent are negatively correlated, because being short and talented is the only way for a short person to make the NBA. The naive analysis of the conditioned sample reverses the sign of the relationship.

The takeaway: drawing the DAG before running the regression is not a formality. It is what tells you which variables belong in the model and which do not.

Propensity Score Matching

When randomisation is impossible but every confounder is measured, propensity score matching (Rosenbaum & Rubin 1983) constructs a balanced comparison group:

  1. Propensity score: for each unit, estimate \(e(W) = P(T = 1 \mid W)\) — the probability of being treated given the confounders. Usually a logistic regression of \(T\) on \(W\).
  2. Match: each treated unit is paired with one (or more) control unit(s) with similar \(e(W)\).
  3. Compare: compute the ATE on the matched sample. Under correct specification, the matched comparison estimates the causal effect.

PSM is the workhorse of observational pharmacoepidemiology, of policy evaluation when RCTs are infeasible, and of any marketing attribution analysis on customers who self-selected into a campaign.

Module reference — propensity-score tools
  • sklearn.linear_model.LogisticRegression() — fit the propensity model \(P(T \mid W)\).
  • scipy.spatial.distance.cdist or sklearn.neighbors.NearestNeighbors — nearest-neighbour matching on propensity score.
  • For production use: statsmodels matching utilities, or the pymatch and causalinference packages outside Pyodide.

The naive difference is biased upward because treated units have on-average higher confounders that also raise the outcome. After matching, the comparison is between treated units and control units who look statistically similar in \(W\), and the recovered ATE is close to the truth.

PSM’s Achilles heel: it controls only for measured confounders. Any unmeasured confounder remains a threat. That is the gap IV is designed to close.

Instrumental Variables (IV)

An instrument \(Z\) is a variable that influences the outcome \(Y\) only through the treatment \(T\), never directly. Three conditions:

  1. Relevance: \(Z\) has a non-zero effect on \(T\).
  2. Exclusion: \(Z\) has no direct effect on \(Y\) (it only acts through \(T\)).
  3. Independence: \(Z\) is uncorrelated with the unobserved confounders.

When you have such an instrument, two-stage least squares (2SLS) recovers the causal effect even under unmeasured confounding:

  • Stage 1: regress \(T\) on \(Z\) (and any controls). Get \(\hat T\).
  • Stage 2: regress \(Y\) on \(\hat T\) (and any controls). The coefficient is the local average treatment effect (LATE) — the effect for “compliers” who change behaviour in response to the instrument.

Real-world instruments: distance to college as an instrument for years of education (Card 1995), draft lottery number as an instrument for military service (Angrist 1990), physician prescribing preference as an instrument for drug choice (McClellan, McNeil & Newhouse 1994), rainfall as an instrument for economic growth in agricultural economies (Miguel et al. 2004).

Module reference — statsmodels 2SLS
  • from statsmodels.sandbox.regression.gmm import IV2SLS — instrumental variables 2SLS.
  • IV2SLS(y, X_with_T, X_with_Z).fit() — pass the design matrices for the structural and instrument regressions.
  • For more sophisticated IV (LIML, GMM, weak-IV-robust inference): the linearmodels package outside Pyodide.

The naive OLS is biased upward because the confounder \(U\) contaminates both \(T\) and \(Y\). 2SLS uses the variation in \(T\) that comes from \(Z\) — which is exogenous by assumption — and recovers the causal effect cleanly.

When IV fails

IV is fragile in two failure modes. Weak instruments (\(Z\) barely influences \(T\)) blow up standard errors and bias estimates. The rule of thumb is a first-stage \(F\) statistic above 10. Exclusion-restriction violations (\(Z\) has a direct effect on \(Y\)) destroy the entire identification strategy and cannot be statistically tested — they must be defended on substantive grounds.

  1. Relevance: does the instrument actually predict the treatment? Compute the first-stage F-statistic; >10 is the conventional threshold. (2) Exclusion: does distance-to-college affect wages only through years of education, not through any other channel (e.g., distance proxies for urban/rural which proxies for labour-market opportunities)? This is not testable from data alone — it requires a substantive argument about the economic mechanism.

Regression Discontinuity Design (RD)

A regression discontinuity design exploits sharp cut-offs in policy or programme eligibility. When treatment is assigned by whether some “running variable” \(X\) crosses a threshold \(c\):

\[ T_i = \mathbf{1}\{X_i \ge c\}, \]

units just above and just below the threshold are essentially exchangeable — they differ in treatment status but not in unobserved characteristics. The discontinuity in \(\mathbb{E}[Y \mid X]\) at \(X = c\) identifies the causal effect at the threshold.

Real-world RD examples: Florida’s school-grading policy (Chiang 2009), medication dose recommendations based on biomarker thresholds, employment-tax credits triggered by income cut-offs, draft eligibility at age 18, scholarship awards at a GPA threshold.

The analysis is a local linear regression on each side of the cut-off, and the causal effect is the difference of intercepts at \(X = c\).

Difference-in-Differences (DiD)

DiD is the right tool when a policy or event affects one group at a known point in time and a comparable control group is unaffected. The estimator nets out (a) pre-existing differences between groups and (b) general time trends.

For two groups (treated, control) and two periods (before, after):

\[ \hat\tau_{\text{DiD}} = (\bar Y^{\text{treat}}_{\text{post}} - \bar Y^{\text{treat}}_{\text{pre}}) \;-\; (\bar Y^{\text{ctrl}}_{\text{post}} - \bar Y^{\text{ctrl}}_{\text{pre}}). \]

The identifying assumption is parallel trends — that the two groups would have followed parallel paths in the absence of treatment. You cannot test parallel trends in the post-treatment period (the treatment is in the way), but you should always plot the pre-treatment trends and verify they look parallel before running the analysis.

Real-world DiD: minimum-wage studies comparing neighbouring states (Card & Krueger 1994), tobacco-tax effects on smoking rates, schooling-reform impacts, vaccination-policy rollouts, advertising-spend lift studies in marketing.

Sensitivity Analysis and Common Pitfalls

Every causal estimate rests on an untestable identifying assumption. The discipline is to enumerate the assumption, assess its plausibility, and quantify how much your conclusion would change if the assumption were violated by a plausible amount.

Method Identifying assumption What breaks it
RCT Random treatment assignment Failed randomisation, attrition
PSM All confounders measured (“unconfoundedness”) Unmeasured confounding
IV Instrument valid (relevance + exclusion + independence) Weak instrument or exclusion violation
RD Continuity of potential outcomes at the threshold Manipulation of the running variable
DiD Parallel trends in absence of treatment Group-specific shocks at the treatment time

Common pitfalls to memorise:

  • Controlling for a collider. Adding a variable downstream of both treatment and outcome introduces bias rather than removing it. Draw the DAG first.
  • Bad controls. Variables that are caused by the treatment (“post-treatment variables”) should not be in the regression.
  • Manipulation around the threshold (RD). If subjects can self-select into “just above the cut-off,” the RD assumption fails. Plot the density of the running variable around the threshold — a spike at \(c^+\) is the canonical warning sign.
  • Spillovers (RCT and DiD). Treatment of one unit affects untreated units (vaccine herd immunity, classroom-peer effects). This violates the stable-unit-treatment-value assumption (SUTVA). Use cluster-randomised designs or model the spillover explicitly.
  • Heterogeneity. Many methods estimate an average effect, but the effect may vary widely across subgroups. Report effect modification when economically meaningful.

The single most important habit: never report a causal estimate without naming the identifying assumption it rests on, and without checking how sensitive the estimate is to plausible violations of that assumption. This is what distinguishes a published causal study that survives replication from one that doesn’t.

A test for unmeasured confounding sensitivity — for example, Rosenbaum bounds or the E-value (VanderWeele & Ding 2017). The question is: how strong would an unmeasured confounder have to be — both in its association with treatment and with the outcome — to fully explain away the observed effect? If the answer is “modest” (an unmeasured factor correlated with both at the level of, say, the strongest confounder you already adjusted for), the causal claim is fragile. If the answer is “implausibly large,” the claim is robust.

Chapter Wrap-up

The five workhorses of classical causal inference cover most of what an empirical analyst will encounter:

  1. RCT when you can randomise.
  2. PSM when you have all confounders measured.
  3. IV when you have an exogenous instrument.
  4. RD when there is a sharp threshold.
  5. DiD when you have a parallel control group through time.

In Chapter 9 we extend this toolkit to settings with high-dimensional, nonlinear confounding — Double Machine Learning, Doubly Robust estimation, and structural causal models — that use modern ML as the nuisance-model layer while preserving the causal-inference logic developed here.

Causal inference is the difference between a number you can publish and a number you can act on. Master it and you have moved one rung up from “applied statistician” to “applied scientist.”

← Chapter 2  ·  Contents  ·  Chapter 4: Bayesian Methods →

 

Prof. Xuhu Wan · HKUST ISOM · Learning Statistics in Python