Chapter 9: Foundation Models, Causal AI & Symbolic Regression

Chapter Introduction

Chapter 8 installed the front-end of the AI-driven workflow: embeddings, vector search, GNNs, and LLM-aided extraction that turn unstructured data into a clean numeric feature matrix. This chapter installs the back-end — three machine-learning-scale techniques that operate on that matrix and produce decisions you can defend.

The three:

Foundation models for time series — large transformer or state-space models pretrained on hundreds of millions of time series across every domain (energy, web traffic, retail, healthcare, finance). They produce zero-shot forecasts on series they have never seen and act as a universal baseline against which every custom forecasting model must justify itself.
Causal AI at ML scale — Double Machine Learning (DoubleML) and structural causal models extend the classical causal-inference toolkit of Chapter 3 to settings with hundreds of nonlinear controls. This is the machinery that lets a research team estimate a clean treatment effect under high-dimensional confounding.
Symbolic regression — let the computer search the space of equations themselves. The output is a one-line formula a regulator or stakeholder can audit. This is the answer to “explainable AI” for any setting where a black-box neural network is unacceptable.

Real-world grounding: Nubank uses Amazon’s Chronos foundation model in production for credit-loss forecasting. Microsoft’s Causal Inference team and Amazon’s CausalLift built DoubleML into their experimentation platforms. NASA-JPL uses PySR to rediscover physical laws from sensor data, and recent quantitative-finance papers use it to derive options-pricing formulas. Bloomberg’s BloombergGPT and Google’s TimesFM are commercial foundation models trained on financial and general time-series data respectively. None of this is research speculation — all are deployed and producing decisions today.

A note on browser-honesty. Real foundation models are gigabytes and require GPUs; they don’t run in Pyodide. What runs in the browser is the workflow: a smaller surrogate model standing in for Chronos in the same harness; a hand-rolled DoubleML in a dozen lines of NumPy and sklearn; a LASSO-driven symbolic regression that demonstrates the same principle as PySR. In production you swap the surrogate for the real model; the surrounding code is identical.

Foundation Models for Time Series
Tabular Foundation Models — TabPFN and TabLLM
Causal AI — Double Machine Learning and Structural Models
Causal Discovery — Learning the DAG from Data
Heterogeneous Treatment Effects and Causal Forest
Mediation Analysis — Decomposing Direct and Indirect Effects
Symbolic Regression — Discovering the Equation Itself
SINDy — Sparse Identification of Nonlinear Dynamics
The AI-Driven Research Workflow and Its New Pitfalls

Foundation Models for Time Series

The 2022–2024 period brought a genuinely new class of model into mainstream production: foundation models for time series. These are large transformer or state-space models pretrained on hundreds of millions of time series across every domain — energy, web traffic, retail, finance, healthcare — released as off-the-shelf forecasters that produce zero-shot or few-shot predictions on series they have never seen.

The leading names:

TimeGPT (Nixtla, 2023) — closed-source API; the first commercial foundation forecaster.
Chronos (Amazon Science, 2024) — open-source, available on Hugging Face; runs locally on CPU/GPU. Nubank reports using Chronos in production for default-loss forecasting.
Lag-Llama (ServiceNow, 2024) — decoder-only autoregressive foundation TS model.
Moirai (Salesforce, 2024) — universal time-series forecasting model.
TimesFM (Google, 2024) — decoder-only, pretrained on 100 billion time points.

The remarkable property is zero-shot transfer: feed the model a series it has never seen — say, daily ridership of a transit system, hourly energy demand at a utility, or daily admissions at a hospital — and it produces a forecast that is competitive with a carefully-tuned ARIMA-GARCH or LSTM. The trade-off is interpretability: you cannot inspect the foundation model’s “AR(1) coefficient.” You can only inspect its forecasts.

In a working research pipeline these models are most useful as a baseline — if your hand-tuned model can’t beat Chronos zero-shot on out-of-sample MAE, you have not yet built a model. They are also useful for rapid prototyping on series where you don’t yet have the domain knowledge to specify ARIMA orders.

Foundation models are too large to run in Pyodide. The cell below simulates the conceptual comparison: a “frozen baseline” forecaster (exponential smoothing as a Chronos stand-in) versus a fitted ARIMA on a held-out tail. In production you would replace the surrogate with chronos.predict() from a real model, and observe that on many series the foundation model is competitive with an ARIMA that took an experienced analyst an hour to specify and fit.

Production rule of thumb: always run a foundation-model baseline before you ship any custom time-series model. If the custom one can’t beat the foundation model, the custom one isn’t a model — it’s a story.

Tabular Foundation Models — TabPFN and TabLLM

Foundation models started in NLP, expanded to images and time series, and reached tabular data in 2023. The two front-runners as of 2026:

TabPFN (Hollmann et al., 2023; v2 in 2024) — a transformer pre-trained on millions of synthetic tabular datasets. At inference time you pass your entire training set plus a query row, and the model returns a prediction in a single forward pass. No gradient training on your data. Competitive with — and on small datasets often better than — XGBoost.
TabLLM (Hegselmann et al., 2023) — turns rows into natural-language sentences, prompts a general LLM, parses out the prediction. Strong few-shot performance with as few as four labelled examples.

Two reasons these matter for a working analyst:

Speed of iteration. No grid search, no cross-validated hyperparameter tuning. You hand the model a small dataset, you get predictions in seconds. The right tool for rapid prototyping.
Few-shot regimes. Classical ML needs hundreds to thousands of labels. Foundation tabular models perform well with tens. This unlocks problems where labelling is expensive (medical, legal, scientific annotation).

The published benchmarks: TabPFN beats XGBoost on most of the OpenML AutoML benchmark suites at \(n < 1{,}000\) rows. Past \(\sim 10{,}000\) rows the gap closes; past \(\sim 100{,}000\) rows gradient-boosting still wins. So the rule is: for small tabular data, try TabPFN first.

Module reference — tabular foundation models

Outside Pyodide: - pip install tabpfn; then from tabpfn import TabPFNClassifier. Pretrained weights download on first use. - TabPFN handles classification (up to 100 features, 10 classes, 10,000 samples in v2). Regression variants are in development. - TabLLM is a research codebase rather than a pip package; production use requires a hosted LLM endpoint. - For non-foundation tabular ML, xgboost, lightgbm, catboost remain the standard.

Below is a conceptual demonstration in browser-friendly form: we show what “in-context learning” looks like on tabular data using a nearest-neighbour stand-in (since TabPFN itself doesn’t run in Pyodide). The structural property — predictions emerge from a single pass over training context, with no per-dataset training — is preserved.

The point of the cell is structural: a model that requires no training on your specific dataset can still be predictive if it has been pre-trained on enough representative tabular tasks. That is exactly the bet TabPFN makes — and on small data it pays off.

Causal AI — Double Machine Learning and Structural Models

Chapter 3 introduced the classical causal toolkit (RCTs, IV, RD, DiD, PSM). All of those methods carry an implicit assumption: the controls are a small, well-chosen set. In modern problems — predicting customer response under hundreds of nonlinear demographic, behavioural, and contextual variables; estimating treatment effects in massive observational health datasets — the right “control set” is high-dimensional and the relationships between controls and outcomes are nonlinear. Classical methods break under those conditions.

Double Machine Learning (DoubleML) (Chernozhukov et al., 2018) is the modern extension. Two-step procedure:

Step 1: Fit a flexible (ML) model of \(y\) on the controls \(W\), and another of \(x\) on \(W\). Compute the residuals \(\tilde y\) and \(\tilde x\).
Step 2: Run OLS of \(\tilde y\) on \(\tilde x\). The coefficient is the causal effect, robust to flexible nonlinear confounding.

DoubleML is the workhorse at Microsoft’s causal-inference team, at Amazon’s CausalLift, at Uber’s experimentation platform, and increasingly at any organisation that must estimate treatment effects from observational data. The Python package econml ships the full machinery; for a one-off you can implement DoubleML in twelve lines of NumPy / sklearn.

Module reference — causal libraries

Outside Pyodide: - econml (Microsoft) — DoubleML, instrumental variables, heterogeneous treatment effects. Production-grade. - dowhy (Microsoft) — DAG-first causal inference with explicit identification step. - causalnex — Bayesian-network causal graphs. - causalml (Uber) — uplift modelling. Inside Pyodide you can implement DoubleML by hand with sklearn.ensemble.RandomForestRegressor for the nuisance models and statsmodels.api.OLS for the final step — that is what the cell below does.

The naïve regression coefficient is biased because \(X\) and \(y\) are both functions of \(W\). DoubleML’s two-stage residualisation removes the contamination and recovers the true causal effect. In production this is the difference between deciding to deploy a model that appears to add value and one that actually does.

DoubleML’s relationship to Chapter 3:

Chapter 3 method	DoubleML extension
OLS with controls	DoubleML with ML nuisance models
Propensity score matching	Doubly-robust estimator (AIPW)
Instrumental variables	DMLIV (DoubleML for IV)
Difference-in-differences	DML-DiD (Chang 2020)

Pick the classical method when you have a small, well-defined set of controls and parametric assumptions you trust. Pick DoubleML when you have many controls and need flexible nonlinear adjustment.

The OLS β is the partial correlation — not the causal effect — unless every confounder of “saw the email” and “churn” is also in the regression. Likely confounders: customer tenure, recent engagement, plan type, time-since-last-purchase. Re-run with DoubleML (or PSM from Chapter 3) controlling for these. If the DoubleML estimate collapses to near zero, the original effect was confounded by the targeting rule that decided who got the email; if it remains negative the causal claim is supported.

Causal Discovery — Learning the DAG from Data

Chapter 3 assumed you already knew the DAG. Often you don’t. Causal discovery is the family of algorithms that learn the DAG — or at least its statistically identifiable part — from observational data. Three algorithm families:

Constraint-based (PC, FCI). Test conditional independencies between variables; rule out edges inconsistent with the observed independencies. Robust when the data is clean and the independence tests are powerful.
Score-based (GES, BIC search). Search over DAG structures; pick the one with the highest penalised likelihood.
Functional / continuous-optimisation (NoTears, Zheng et al. 2018; DAGMA, Bello et al. 2022). Reformulate the discrete DAG-search as a continuous optimisation problem with a “no cycles” smooth constraint. Used at Microsoft Research and DeepMind.

A DAG learned from observational data alone is identifiable only up to a Markov equivalence class — a family of DAGs with the same set of conditional independencies. The output is conventionally a CPDAG (partially-directed DAG): some edges are oriented; others are reversible. To break the equivalence and orient the remaining edges, you need an intervention (an RCT) or strong domain knowledge.

Causal discovery is in production at Google’s experimentation team, Microsoft’s AzureML “Why Labs,” healthcare research at Mayo Clinic, and increasingly at any firm trying to extract causal hypotheses from observational logs at scale.

Module reference — causal discovery

Outside Pyodide: - causal-learn — Python implementations of PC, GES, FCI, NoTears, GIN, and more (CMU-supported). - dowhy includes a dowhy.gcm module with structural-causal-model fitting and counterfactual reasoning. - pgmpy — Bayesian-network learning and inference. - For NoTears specifically: the notears package (linear and nonlinear variants).

A small PC-algorithm-style sketch in NumPy: test all pairwise correlations conditional on each candidate third variable; remove edges where conditional correlation is near zero. Production code does this systematically up to higher-order conditioning sets and orients edges using v-structures, but the conceptual core is the same.

The discovered skeleton — the undirected edges of the CPDAG — matches the true graph. With orientation rules (v-structures, Meek’s rules) you would then orient as many edges as the data allows. The discovered DAG is the starting point for any subsequent causal-effect estimation: it tells you which backdoor adjustment is valid before you run the DoubleML.

Run a causal discovery algorithm (PC or NoTears) to learn an approximate DAG. (2) Read the backdoor set for the treatment-of-interest from the learned DAG. (3) Run DoubleML with that backdoor set as controls. (4) Run sensitivity analysis (E-value, Rosenbaum bounds) to assess robustness to unmeasured confounding. The first step suggests a model; the second and third estimate the effect; the fourth audits the assumption.

Heterogeneous Treatment Effects and Causal Forest

Average treatment effects (ATE) tell you what happens to a typical unit. Heterogeneous (or Conditional) Average Treatment Effects (CATE) tell you how the effect varies across subgroups:

\[ \tau(x) \;=\; \mathbb{E}[Y(1) - Y(0) \mid X = x]. \]

Knowing \(\tau(x)\) is the bedrock of personalisation: in marketing, target the campaign at customers for whom \(\tau(x) > 0\) and skip the rest. In healthcare, give the treatment to patients with the largest positive effect. In policy, allocate scarce resources to recipients who benefit most.

Three algorithms dominate:

T-Learner — fit one outcome model on treated, another on control; difference of predictions is \(\hat\tau(x)\).
S-Learner — fit a single model on \((X, T)\); difference \(\hat f(x, 1) - \hat f(x, 0)\).
X-Learner (Künzel et al. 2019) — combines the above with cross-fitted residuals. Strong when one group is much smaller than the other.
Causal Forest (Wager & Athey 2018) — a random forest whose splits maximise heterogeneity of treatment effect. Built into Microsoft’s econml library.

Real-world: Uber Eats uses causal forests to decide which customers see promotions; Microsoft Ads uses CATE estimation for bid optimisation; Optum Labs uses causal forests to identify which patients benefit most from specific care-management interventions.

Once you have \(\hat\tau(x)\), the decision rule is direct: treat the units for whom \(\hat\tau(x) > c\) (the cost threshold). This is uplift modelling, the staple of any modern direct-marketing pipeline.

Mediation Analysis — Decomposing Direct and Indirect Effects

Sometimes the causal effect of \(X\) on \(Y\) flows through an intermediate variable \(M\) — the mediator. Mediation analysis decomposes the total effect into the part that goes through \(M\) (indirect effect) and the part that doesn’t (direct effect):

\[ \underbrace{\text{Total effect}}_{X \to Y} = \underbrace{\text{Direct effect}}_{X \to Y \text{ holding } M \text{ fixed}} + \underbrace{\text{Indirect effect}}_{X \to M \to Y}. \]

The classical approach (Baron & Kenny 1986) requires two regressions:

\(M = a_0 + a \cdot X + \epsilon_1\) (effect of \(X\) on the mediator).
\(Y = b_0 + b \cdot M + c \cdot X + \epsilon_2\) (effect of \(M\) on \(Y\) given \(X\)).

Indirect effect = \(a \cdot b\). Direct effect = \(c\). Total effect = \(a \cdot b + c\).

The modern approach uses the potential-outcomes framework (Imai et al. 2010) to define natural direct and natural indirect effects under explicit assumptions and to allow nonlinear mediators / outcomes.

Use cases: marketing wants to know if an ad campaign drives sales directly or via brand-awareness lift; epidemiology wants to know if a treatment effect operates through a measured biomarker; policy wants to know if a job-training programme raises wages directly or via the credentials it confers.

The decomposition recovers the true direct and indirect effects. In a marketing context the right reading would be “60% of the total ad-campaign effect on sales operates through brand-awareness lift; the remaining 40% is direct.” That decomposition tells the campaign team where to invest next.

Symbolic Regression — Discovering the Equation Itself

Symbolic regression searches not over the parameters of a fixed model class, but over the space of equations themselves. Given \((X, y)\), it returns an explicit mathematical expression — like \(y = 2.1 x_1 \sin(x_2) + 0.8 \log(x_3)\) — that fits the data well and is short enough to be interpretable.

The leading implementations:

PySR (Cranmer et al., 2023) — a Python wrapper around the Julia-backed SymbolicRegression.jl. Used at NASA-JPL, Princeton physics, and in recently published quantitative-finance papers to rediscover Black–Scholes-like option-pricing formulas from data.
gplearn — pure-Python genetic-programming symbolic regressor; slower but pip-installable everywhere.
AI Feynman (Tegmark group, 2020) — symbolic regression specialised for physics; recovered ~100 textbook physics equations from raw data.

Symbolic regression is the antithesis of black-box ML: it tries to write a paragraph the model can explain to you. Where neural networks produce predictions you can’t audit, symbolic regression produces equations you can.

A small symbolic regressor in plain NumPy fits in 60 lines. The example below uses a LASSO-driven feature search to find \(y = 2 x \sin(x) + 1\) from noisy data.

The recovered equation picks the right term (x*sin(x) with coefficient near 2) and the right constant (near 1), with the rest of the terms set to zero by the \(\ell_1\) penalty. In a production symbolic regressor (PySR), the search space includes nested operations and the algorithm is genetic programming rather than LASSO — but the principle is the one shown above: search a structured space of expressions and report the short, well-fitting ones.

Where symbolic regression beats neural networks in real production: physics, chemistry, engineering, options pricing, ESG-rating modelling, and any regulated setting where the auditor needs a one-line explanation of why the model produced its number. Chapter 10 will return to the broader question of explainability and how to communicate results from any model — including the ones whose equations you cannot write down.

SINDy — Sparse Identification of Nonlinear Dynamics

A specialised cousin of symbolic regression, SINDy (Brunton, Proctor & Kutz 2016) is designed for dynamical systems — settings where you observe a trajectory \(x(t)\) and want to recover the ODE \(\dot x = f(x)\) that produced it. The algorithm:

Compute numerical derivatives \(\dot x\) from the trajectory.
Build a library of candidate basis functions \(\Theta(x)\): polynomials, sines, products.
Solve a sparse linear system \(\dot x = \Theta(x) \,\xi\), where most entries of \(\xi\) are forced to zero by LASSO or sequential thresholding.

The output is a parsimonious differential equation. SINDy has been used to rediscover the Lorenz attractor, fluid-dynamics PDEs, climate-model equations, and metabolic pathway dynamics from data — sometimes from just a few thousand observations. Production use cases include digital-twin modelling at GE Aerospace, equation discovery at Pasteur Institute, and high-frequency tick-data dynamics at certain quant teams.

The Lasso-driven SINDy recovers the structural form of the Lorenz equations and gets the coefficients close to truth. The remarkable point is that the algorithm has no prior knowledge of which terms should appear — it tests a library of 30 candidates and the \(\ell_1\) penalty zeroes out 27 of them, leaving the right three terms for each equation.

The same procedure works on any sparsely-driven dynamical system. The reach of the method is limited mainly by the quality of the numerical derivatives and the completeness of the candidate library.

The AI-Driven Research Workflow and Its New Pitfalls

A working modern research pipeline looks like:

Ingest raw data — structured + unstructured.
Embed unstructured pieces into numeric vectors (Chapter 8).
LLM-extract structured fields where applicable (Chapter 8).
Classical statistical machinery (Chapters 1–7) applied to the combined numeric feature matrix.
Causal validation (Chapter 3 + DoubleML above) before any feature is allowed to drive a decision.
Foundation-model baselines as a sanity check at every forecasting step.
Symbolic / interpretable model layer for any output that must be explained to a stakeholder or regulator.
Explainability layer (Chapter 10) — even when the model is not symbolic, every prediction must be locally explainable for downstream decisions.

The architectural questions — how features attach to a domain ontology, how lineage is captured, how the pipeline is governed — belong to the practitioner’s discipline of domain modelling and are covered in the companion volume Domain Modelling in Python.

The discipline from Chapter 7’s closing list still applies — selection bias, look-ahead, survivorship, overfitting, regime change, operational unreality — and two new pitfalls have been added by the AI layer:

Embedding drift. A transformer trained in 2022 will produce different embeddings of the same headline in 2026 if you upgrade to a new version. Lock the embedding model with a version pin.
LLM hallucination in feature extraction. Require structured-output prompts and human spot-checks.

The fundamental honesty has not changed. AI gives you new feature pipelines and new baselines; the question — “would this pattern survive an out-of-sample test on data not used to find it?” — is the same question Chapter 7 ended on. Use the AI layer to generate candidate patterns faster; use the classical layer to kill the ones that don’t generalise.

Re-run the analysis with the LLM’s outputs replaced by a deterministic rule-based extractor on the same text. If the result collapses, the LLM is adding genuine information; if it stays high, the LLM is leaking future information through its prompt context. (2) Compute the result on data the LLM cannot have been pre-trained on — e.g., a held-out future quarter or a private dataset. If the result degrades sharply, the apparent uplift came from training-data contamination, not from real signal.

← Chapter 8 · Contents · Chapter 10: Interpretability & Explainable AI →