• 📖 Cover
  • Contents

Contents

Contents

Tap any chapter to start reading.

Chapter 1 Distributions, Tails, and Anomalies

Empirical and theoretical distributions, KDE, bootstrap CIs, hypothesis testing, experimental design and A/B testing (power, sequential, bandits), association, EVT, anomaly detection, and multiple-testing correction.

Chapter 2 Statistical Predictive Models

Multiple linear regression, diagnostics and leverage, nonlinear transformations, variable selection, cross-validation, Ridge / LASSO / Elastic Net, PCA, tree ensembles.

Chapter 3 Causal Inference Beyond Regression

Randomized controlled trials, confounding and the backdoor path, propensity score matching, instrumental variables, regression discontinuity, difference-in-differences, sensitivity analysis.

Chapter 4 Rethinking Statistics with Bayesian Methods

Bayes’ theorem, conjugate Beta-Binomial and Normal-Normal updates, hand-coded Metropolis-Hastings MCMC, robust regression with Student-t errors, Bayesian linear regression, change-point detection, hierarchical shrinkage.

Chapter 5 Time Series Models

Pandas time-series methods, ADF + KPSS stationarity tests, ACF / PACF, ARIMA for the conditional mean, GARCH for volatility clustering, cointegration, Markov-switching regimes, a full mean-reversion backtest.

Chapter 6 Clustering for Unsupervised Pattern Discovery

K-means, hierarchical agglomerative clustering with dendrograms, DBSCAN, Gaussian mixture models, spectral clustering, cluster validation, and Hierarchical Risk Parity (HRP).

Chapter 7 Pattern Recognition

The full pipeline: framing, feature engineering, classifier zoo (KNN, logistic, SVM, gradient boosting), t-SNE visualisation, Hidden Markov Models, template matching, a worked signal hunt, and the six ways patterns lie.

Chapter 8 Embeddings, Vector Search, and LLM-Aided Features

Graph neural networks, embeddings of unstructured text, vector databases & retrieval-augmented generation, cross-modal embeddings (CLIP), domain fine-tuning, and LLM-aided structured-field extraction. The methods layer that turns unstructured data into statistical features.

Chapter 9 Foundation Models, Causal AI & Symbolic Regression

Zero-shot time-series forecasting with Chronos / TimeGPT, Double Machine Learning for causal effects under high-dimensional confounding, symbolic regression for interpretable equation discovery.

Chapter 10 Interpretability and Explainable AI

The interpretability spectrum, permutation feature importance, partial dependence plots and ICE, SHAP local explanations, LIME local surrogates, counterfactual explanations, and the discipline of communicating results to stakeholders.


How to read this book

Every Python code block in this book runs live in your browser. Click into any cell, edit it, press the ▶ Run button, and see the output. The Python engine (Pyodide) downloads once on the first chapter — after that, everything is instant.

This book assumes you are already comfortable with pandas DataFrames, NumPy arrays, and basic plotting in Python. If those words make you nervous, work through an introductory Python-for-data-analysis book first and then return.

The arc of the book in one paragraph

Chapters 1–7 are the durable statistical core — distributions and inference, predictive models, causal inference, Bayesian methods, time series, clustering, pattern recognition. These methods have been in production for decades and will be for decades more. Chapters 8–9 are the modern AI-driven superstructure — ontologies and embeddings that turn unstructured data into statistical features, plus foundation models, AI-scale causal inference, and symbolic regression. Chapter 10 is the interpretability layer that makes any of it usable in regulated, high-stakes settings.

What this book is not

This is not a course on neural networks per se, on reinforcement learning, or on production MLOps. Those are downstream of the foundations covered here. Get the foundations right; the rest is implementation.

← Back to Cover

 

Prof. Xuhu Wan · HKUST ISOM · Learning Statistics in Python