Chapter 10: Interpretability and Explainable AI

Chapter Introduction

A model that produces accurate predictions but cannot be explained to the people who must act on it has a hard ceiling on its usefulness. A bank cannot deploy a credit-scoring model whose decisions the regulator does not understand. A hospital cannot deploy a diagnosis-support tool whose reasoning the physician cannot inspect. A hiring team cannot use an algorithmic shortlist whose criteria a candidate cannot appeal. Interpretability — the ability to explain why a model produced a particular prediction — is no longer a research curiosity. It is a regulatory requirement, an ethical baseline, and increasingly a competitive advantage.

The 2024–2026 regulatory environment has made this concrete. The EU AI Act (in force since August 2024) classifies hiring, lending, education, and several other applications as “high risk” and requires interpretable explanations of any automated decision. The US OCC (Office of the Comptroller of the Currency) requires banks to demonstrate that their machine-learning credit models are explainable to examiners. GDPR Article 22 (since 2018) gives any EU resident the right to a “meaningful explanation” of an automated decision. FDA guidance for AI/ML-enabled medical devices increasingly requires interpretability documentation.

The chapter teaches the standard interpretability toolkit, organised from simplest to most sophisticated. Feature importance answers “which features matter overall?” Partial dependence plots and ICE curves answer “how does the prediction change as one feature varies?” SHAP (Shapley Additive exPlanations) decomposes each individual prediction into per-feature contributions using a result from cooperative game theory. LIME (Local Interpretable Model-agnostic Explanations) fits a simple model around each prediction to approximate the complex one locally. Counterfactual explanations answer “what would have to change for the prediction to flip?” — the format a regulator or a denied applicant most naturally understands.

The methods are all model-agnostic: they work on any predictor — linear, tree-ensemble, neural network, foundation model. That is what makes them production-grade. By the end of the chapter you will be able to take a black-box model from any of the previous chapters and produce the kind of explanation a regulator, a clinician, or a portfolio manager can sign off on.

The Interpretability Spectrum
Permutation Feature Importance
Partial Dependence Plots and ICE Curves
SHAP — Shapley Additive Explanations
TreeSHAP and SHAP Interaction Values
LIME — Local Surrogate Models
Anchors — High-Precision Local Rules
Counterfactual Explanations
Model Calibration — Brier Score and Reliability Diagrams
Fairness Metrics and Slice-Level Analysis
Model Cards and Regulatory Documentation
Communicating Results to Stakeholders

The Interpretability Spectrum

Two orthogonal distinctions structure the field.

Global vs. local

A global explanation describes the overall behaviour of the model — “the model relies most heavily on income and credit-utilisation.” A local explanation describes a single prediction — “for this applicant, the credit-utilisation feature contributed −0.18 to the predicted default probability.” Global explanations are for model auditors, regulators, and the team that built the model. Local explanations are for end users (the applicant, the clinician, the portfolio manager) who care about a single decision.

Intrinsic vs. post-hoc

Intrinsic interpretability comes from the model itself. A linear regression’s coefficients are intrinsic explanations. A small decision tree is intrinsically interpretable. Post-hoc interpretability is computed after the model is fit, using a separate explanation method. SHAP, LIME, partial dependence, and counterfactuals are post-hoc.

Pick the simplest possible model class consistent with the required accuracy. If a regularised linear regression performs nearly as well as a deep ensemble, use the regression — the intrinsic interpretability is free. Reach for post-hoc methods when the accuracy gap forces you to deploy a complex model.

Real-world stack at most regulated firms

Train the complex model (gradient-boosted trees, ensemble).
Compute global feature importance (permutation) for model documentation.
Compute partial dependence for top-K features to show monotonicity / shape.
For every production prediction, attach a SHAP explanation in the audit log.
For any prediction that drives an adverse action (denial, alert), generate a counterfactual the user can read.

Permutation Feature Importance

The simplest global explanation. For each feature $j$:

Measure the model’s baseline performance (RMSE, accuracy, log-loss) on a held-out set.
Randomly permute the values of feature $j$ across rows. This destroys whatever signal it carried.
Re-measure performance. The drop is the importance of $j$.

Permutation importance is model-agnostic (any predictor with a .predict() method works), respects feature correlations (if two features are redundant, neither dominates), and is what every serious ML production team computes as the first sanity check on a new model.

Module reference — sklearn.inspection.permutation_importance

permutation_importance(model, X, y, n_repeats=10, random_state=0) returns an object with .importances_mean and .importances_std per feature. Always evaluate on a held-out set, not training data. For tree ensembles, prefer permutation importance over .feature_importances_ (the built-in tree-based measure is biased toward high-cardinality features).

The chart correctly identifies $x_1, x_4, x_6$ (the linear effects, indexed 0/3/5) and $x_8$ (the nonlinear sine term) as the important features. This is the first plot anyone reviewing a model should see.

Partial Dependence Plots and ICE Curves

A partial dependence plot (PDP) shows the marginal effect of one feature on the prediction, averaging over the values of all other features. Mathematically:

\[ \widehat{\text{PDP}}_j(z) = \frac{1}{n} \sum_{i=1}^{n} \hat f\big(x_{i,1}, \ldots, x_{i,j-1}, z, x_{i,j+1}, \ldots, x_{i,p}\big). \]

Read a PDP as “if I set feature $j$ to value $z$ for every observation in the dataset, the average predicted outcome would be…”.

PDPs hide heterogeneity — they average over the rest of the features and lose any information about subgroups. Individual Conditional Expectation (ICE) curves fix this: plot one curve per observation, showing how its prediction changes as $x_j$ varies. ICE curves that fan out (different slopes) indicate that the effect of $x_j$ depends on the other features — i.e., interactions are present.

Module reference — sklearn.inspection

PartialDependenceDisplay.from_estimator(model, X, features=[0, 1], kind='average') — PDP.
Same call with kind='individual' — ICE.
kind='both' — overlay PDP on ICE curves.
For 2-D PDPs (interactions), pass a tuple in features: e.g., [(0, 1)].

The three panels tell a clear story: $x_1$ has an interaction (ICE curves fan), $x_2$ has a U-shape that PDP captures cleanly, $x_3$ has no effect. A regulator looking at this output can verify the model behaves the way the data-generating reasoning expects.

SHAP — Shapley Additive Explanations

SHAP (Lundberg & Lee 2017) decomposes each prediction into per-feature contributions using Shapley values from cooperative game theory. The Shapley value $\phi_j$ for feature $j$ on prediction $\hat f(x)$ is the average marginal contribution of $j$ across all possible orderings of the features. Three properties make SHAP the dominant local-explanation method:

Local accuracy: the per-feature contributions sum to the prediction minus the baseline. $\hat f(x) = \phi_0 + \sum_j \phi_j(x)$.
Consistency: if a feature’s contribution to the model increases (or stays the same) across all input subsets, its SHAP value cannot decrease.
Missingness: features that are not used in the model have zero SHAP value.

Exact Shapley computation is exponential in the number of features. Real-world SHAP uses TreeSHAP (an exact polynomial algorithm for tree ensembles) or KernelSHAP (an LIME-flavoured approximation for any model). The shap package implements both and is the production tool in every major data-science stack.

The package isn’t in Pyodide by default — we implement a small additive feature-attribution by hand below, which captures the core idea of SHAP without the full Shapley axiomatic machinery.

Module reference — shap (outside Pyodide)

pip install shap. Common patterns: - explainer = shap.TreeExplainer(model) for tree ensembles. - explainer = shap.KernelExplainer(model.predict, X_background) for any model. - shap_values = explainer.shap_values(X_test) — returns per-feature per-row contributions. - shap.summary_plot(shap_values, X_test) — global feature importance + direction (the famous “beeswarm” plot). - shap.force_plot(explainer.expected_value, shap_values[i], X_test.iloc[i]) — local explanation for prediction $i$. - shap.dependence_plot(j, shap_values, X_test) — feature interaction view.

The output reads like a regulator-friendly receipt: “the model started at the average prediction; feature $x_1$ pushed the prediction up by 1.4; feature $x_2$ pushed it down by 0.8; total prediction = …”. This is the format the EU AI Act and US OCC examiners actually want.

SHAP local explanations produce a per-feature breakdown of why this particular applicant received the score — the format a denial letter (“your credit-utilisation rate contributed −0.18, your length-of-credit-history contributed +0.04 …”) naturally consumes. The limitation is that SHAP describes what the model did — not what was causally responsible. If the model has a confounded feature (e.g., zip code as a proxy for ethnicity), SHAP will faithfully report the feature’s contribution but will not flag it as illegitimate. Causal inference (Chapter 3 / Chapter 9) is needed for the deeper question.

TreeSHAP and SHAP Interaction Values

The plain KernelSHAP described above runs in exponential time in the worst case. For tree ensembles — gradient boosting, random forests, the workhorses of tabular ML — a smarter exact algorithm exists.

TreeSHAP

TreeSHAP (Lundberg, Erion & Lee 2018) computes exact Shapley values for tree ensembles in time $O(T \cdot L \cdot D^2)$, where $T$ is the number of trees, $L$ the number of leaves, $D$ the depth. For typical gradient-boosting models this is sub-second on millions of rows. The algorithm walks each tree once and accumulates the per-feature contributions using a clever dynamic-programming trick over the path probabilities. Production libraries: shap.TreeExplainer(model) (works for XGBoost, LightGBM, CatBoost, sklearn ensembles), shap.GPUTreeExplainer for very large batches.

The practical consequences:

Speed. A 500-tree LightGBM model with 10,000 test rows is fully explained in seconds.
Exactness. No approximation, no Monte-Carlo variance.
Negative SHAP values are real contributions — TreeSHAP preserves the local accuracy identity $\hat f(x) = \phi_0 + \sum_j \phi_j(x)$.

SHAP interaction values

For each pair $(j, k)$ of features TreeSHAP also produces an interaction value $\Phi_{jk}$ — the part of the prediction explained by the joint presence of features $j$ and $k$ that neither feature explains alone. The decomposition is

\[ \hat f(x) = \phi_0 \;+\; \sum_j \Phi_{jj}(x) \;+\; \sum_{j \ne k} \Phi_{jk}(x), \]

where the diagonal $\Phi_{jj}$ is the “main effect” of feature $j$ and the off-diagonal $\Phi_{jk}$ is the pure interaction.

This is the strongest available interpretability output for tree ensembles: not only the per-feature contributions, but a heat-map of which features interact with which. Used at banks for explaining nonlinear credit-risk interactions (“the joint presence of high debt-to-income and short credit history is what drove the denial”), and at medical-AI teams for explaining biomarker interactions in diagnostic models.

Reading this two-panel diagnostic is now a routine part of model documentation at any firm shipping a tree ensemble into a regulated environment — the interaction surface is the visual proof that the model has learned the right pairwise structure (or hasn’t).

LIME — Local Surrogate Models

LIME (Ribeiro et al. 2016) takes a different approach to local explanation: around each prediction, fit a simple, interpretable model (linear regression or small decision tree) that locally approximates the complex one. The coefficients of the local surrogate are the explanation.

The recipe:

Pick the query point $x^\star$.
Sample nearby points $z$ — for tabular data, perturb each feature within its range.
Get the complex model’s predictions on the samples: $\hat f(z)$.
Weight each sample by its similarity to $x^\star$ (closer = more weight).
Fit a sparse linear regression of $\hat f(z)$ on $z$, weighted by similarity.
Report the linear coefficients as the local explanation.

LIME’s killer feature is its model-agnosticism — it works on any black box that exposes .predict(). Its weakness is instability: different random samples around the same query point can give different explanations. SHAP and LIME often agree on direction but disagree on magnitude; in production both are commonly reported.

The surrogate coefficients say: locally, the prediction behaves like $1.5\,x_1 - 0.8\,x_2 + \ldots$. That sentence is exactly the format an analyst, regulator, or end user can audit.

Anchors — High-Precision Local Rules

LIME gives a linear approximation in a neighbourhood. Anchors (Ribeiro et al. 2018) give a rule that pins down a prediction with high precision: “if income > $60K AND debt < $30K AND age > 35, the model predicts approval with ≥95% precision in the neighbourhood.” The output is an IF–THEN rule that any auditor can read.

The algorithm is a beam search over feature predicates. For each candidate predicate, perturb features outside the predicate and check whether the prediction stays the same. Pick the shortest predicate whose precision exceeds a threshold (e.g., 0.95) on the perturbation neighbourhood. The result is a compact, regulator-readable explanation.

Production use: financial-services compliance teams use Anchor-style rule extractions to convert black-box model decisions into the kind of “reason codes” that adverse-action notices require.

The output is an IF–THEN rule that an auditor or denied applicant can read: “the model approves any applicant with income > $60K AND age > 35 95+% of the time.” Pair Anchors with counterfactuals for adverse decisions and SHAP for full attribution — together they cover every explanation format a modern regulator asks for.

Counterfactual Explanations

A counterfactual explanation answers: “what is the smallest change to the inputs that would flip the prediction?” For a credit-rejected applicant, “if your annual income had been $5,000 higher and your credit-utilisation 10 points lower, your application would have been approved.” For a clinician’s diagnostic aid, “if the patient’s blood-pressure had been below 130, the model would not have flagged hypertensive risk.”

Counterfactuals are the format people most naturally consume — humans reason naturally about what-ifs. They are also actionable: the explanation tells the user what to change. This is why EU AI Act and GDPR-compliant explanations often default to counterfactual format.

Generating a counterfactual is an optimisation problem: minimise the distance from the original input subject to the constraint that the model’s prediction crosses the threshold. In practice this is a small constrained-optimisation problem; for low-dimensional tabular data, a grid search is enough.

The output is precisely the format a denial letter takes: “had your income been $15,000 higher, you would have been approved.” For the regulator, the auditor, and the applicant, this is the most useful explanation type.

Module reference — counterfactual libraries

Outside Pyodide: - dice-ml (Microsoft) — generates diverse counterfactuals with constraints (immutable features, plausibility). - alibi (Seldon) — counterfactuals, anchors, and other model-agnostic explanations. - aix360 (IBM) — full interpretability toolkit including counterfactuals. For simple cases, a few lines of grid search (as above) often suffices.

A counterfactual explanation. The clinician’s question is literally a counterfactual (“if X were Y, would the prediction change?”). SHAP and LIME would describe which features drove the current prediction, but a counterfactual tells the clinician which actionable change would alter it — exactly the form that supports a clinical decision (e.g., “if you reduce the medication’s dose by 25%, the model’s readmission risk falls below the high-risk threshold”). The constraint that counterfactuals only suggest actionable changes (avoid suggesting “if patient were younger”) is the key production refinement.

Model Calibration — Brier Score and Reliability Diagrams

A classifier can be accurate but uncalibrated. If your model says “70% probability” but is right only 50% of the time when it does, the probabilities mean less than they appear to. Calibration matters whenever the probability itself is a decision input — credit-score thresholds, medical risk stratification, ad-bid pricing, options-implied probabilities.

Two essential diagnostics:

Brier score — mean squared error between predicted probabilities and binary outcomes: $\mathrm{BS} = \frac{1}{n} \sum (p_i - y_i)^2$. Lower is better; perfect = 0.
Reliability diagram — bin predictions by predicted probability; plot mean predicted vs. mean observed. A well-calibrated model lies on the 45° line.

Two common fixes when miscalibration is found:

Platt scaling — fit a logistic regression of $y$ on raw scores; map raw scores through the logistic. Best when miscalibration is sigmoidal.
Isotonic regression — fit a non-parametric monotonic function. More flexible, requires more held-out data.

Module reference — calibration tools

sklearn.calibration.CalibratedClassifierCV(estimator, method='sigmoid') — Platt scaling wrapper.
sklearn.calibration.CalibrationDisplay.from_predictions(y_true, y_prob) — reliability diagram.
sklearn.metrics.brier_score_loss(y_true, y_prob) — single-number metric.
sklearn.metrics.log_loss(y_true, y_prob) — logarithmic scoring rule; rewards confident-correct predictions, punishes confident-wrong ones.

Reading the diagram: the raw RF systematically over-confidence below 0.5 (its dots sit above the 45° line) and under-confidence above 0.5. The isotonic recalibration pulls the line onto the diagonal. Whichever model you ship, report its Brier score and attach its reliability diagram in the model documentation.

Fairness Metrics and Slice-Level Analysis

A model can be accurate overall while performing systematically worse for a protected subgroup (age, gender, race, geography). Regulators increasingly require evidence that the model’s behaviour is equitable — and “equitable” has several distinct mathematical definitions, which are sometimes mutually exclusive.

The most-used fairness criteria:

Demographic parity (statistical parity): $P(\hat Y = 1 \mid A = a)$ is equal across groups $a$. The model approves at the same rate for every subgroup.
Equal opportunity: $P(\hat Y = 1 \mid Y = 1, A = a)$ is equal — the true-positive rate is equal across groups.
Equalised odds: both TPR and FPR are equal across groups.
Predictive parity: $P(Y = 1 \mid \hat Y = 1, A = a)$ is equal — the precision is equal across groups.
Calibration within groups: the reliability diagram is on the 45° line for each group separately.

A famous result (Chouldechova 2017; Kleinberg et al. 2016): when base rates differ across groups, you cannot satisfy predictive parity and equalised odds at the same time. You must pick which fairness criterion is appropriate for the application. This is the heart of the COMPAS recidivism-prediction controversy and similar high-profile disputes.

Module reference — fairness toolkits

Outside Pyodide: - fairlearn (Microsoft) — fairness assessment + bias mitigation algorithms. - aif360 (IBM) — broad fairness toolkit including pre/in/post-processing methods. - fairml, themis-ml, aequitas — alternatives. Inside Pyodide it’s straightforward to compute the metrics by hand with NumPy / pandas.

The output table is the standard slice-level fairness audit. Even when the model has the same overall accuracy across groups, the components (selection rate, TPR, FPR, precision) may disagree — and which component matters depends on the application. A hiring model usually optimises equal opportunity (don’t reject qualified people unequally); a lending model usually optimises predictive parity (the same model score means the same default rate).

Three mitigation strategies that get used in production:

Pre-processing: re-weight or re-sample to make the training distribution group-balanced.
In-processing: add a fairness penalty to the loss (the fairlearn.reductions.ExponentiatedGradient recipe).
Post-processing: threshold the predictions differently per group to enforce equality on the chosen metric (Hardt, Price & Srebro 2016).

There is no free lunch: enforcing fairness costs some accuracy, and which metric you enforce changes which subgroup pays the accuracy cost. Document the choice in the model card.

Model Cards and Regulatory Documentation

A model card (Mitchell et al. 2019) is a one- to two-page document that accompanies every shipped model. It is the de-facto standard at Google, Microsoft, OpenAI, Anthropic, and nearly every regulated firm. The EU AI Act formalises essentially the same template under Annex IV.

A complete model card has these sections:

Model details — name, version, owner, training date.
Intended use — the use cases the model was designed for and the cases it is not validated for.
Factors — relevant demographic, environmental, or instrument factors.
Metrics — accuracy, calibration, fairness across slices.
Evaluation data — what was used to test the model.
Training data — sources, dates, demographic make-up.
Quantitative analyses — disaggregated performance tables.
Ethical considerations — known risks, limitations, mitigations.
Caveats and recommendations — what users should know.

Adjacent artifacts:

Datasheets for datasets (Gebru et al. 2018) — same idea applied to the training data.
System cards (Anthropic, OpenAI) — extend model cards to the full deployed system, including the prompt template, retrieval layer, and human-review path.
Annex IV technical documentation — the EU AI Act’s binding template for high-risk systems. Includes risk-management system, data governance, transparency, human oversight.

Model-card YAML skeleton (production template)

model_name: credit-risk-v2.3
version: 2.3.1
trained_on: 2026-02-15
intended_use:
  primary: estimate 12-month default probability on personal-loan applicants
  out_of_scope: corporate loans; international markets; products newer than 2 years
data:
  source: internal lending decisions 2018-2025; pii-redacted snapshot 2026-02-01
  demographic_breakdown: see appendix A
metrics:
  overall_auc: 0.86
  calibration_brier: 0.087
  slice_performance:
    age_18_25: auc=0.81, demographic_parity_gap=0.04
    age_26_45: auc=0.87, demographic_parity_gap=0.01
    age_46_65: auc=0.88, demographic_parity_gap=-0.02
known_limitations:
  - thin-file applicants (< 6 months credit history) have higher prediction error
  - performance degrades during the first 90 days after a macroeconomic regime shift
mitigations:
  - quarterly recalibration on rolling 12-month data
  - human review for all denials with predicted probability between 0.45 and 0.55
contact: [email protected]

The discipline that matters: publish the model card publicly inside the firm before deploying the model. If anyone — risk, legal, product, an internal critic — finds something they can’t sign off on, that is the moment to address it. Not after a regulator’s inquiry.

Communicating Results to Stakeholders

The technical work in this chapter produces numbers. The remaining work — turning those numbers into something a non-technical stakeholder can act on — is where most interpretability projects fail. Four practical rules:

1. Match the explanation to the audience. - Regulator / auditor: global feature importance + PDP for top-K features + sample SHAP explanations from edge cases. - Model owner / engineer: ICE curves, residual diagnostics, feature interactions, fairness slices. - End user (applicant, patient, customer): counterfactual (“if X had been Y, the decision would have been Z”). - Executive: one-line headline (“the model relies primarily on credit utilisation and recent payment history; performance is uniform across age and gender segments”).

2. Always pair an explanation with a confidence statement. SHAP values are sample-dependent; LIME is unstable; PDP averages over heterogeneity. State the uncertainty.

3. Anticipate the “but is it fair?” follow-up. Compute group-level SHAP averages across protected attributes (gender, race) and report disparities. The model documentation should describe how disparities, if any, were investigated.

4. Document the limits of the explanation. SHAP describes the model, not the world. A SHAP value of −0.2 for a feature does not mean “intervening on this feature in real life would reduce the outcome by 0.2.” That is a causal claim and needs Chapter 3’s machinery.

Chapter Wrap-up — and Book Wrap-up

The ten chapters now form a complete arc:

Chapters 1–7 are the durable statistical core: distributions, prediction, causal inference, Bayesian methods, time series, clustering, pattern recognition. Every method has been in production for decades.
Chapters 8–9 are the modern AI-driven superstructure: embeddings, vector search, GNNs, and LLM-aided extraction that turn unstructured data into statistical features; foundation models, AI-scale causal inference, and symbolic regression that operate on the resulting matrices.
Chapter 10 is the layer that makes any of the above usable in regulated, high-stakes settings: the explanation discipline that turns model output into decisions a regulator, a clinician, a portfolio manager, or a customer can sign off on.

The most important meta-lesson, the one that has appeared at the end of every chapter and is true everywhere: a pattern is real only if it survives an honest test on data not used to find it, and a model is usable only if its outputs can be explained to the people who must act on them. Master both habits and you have moved from “doing statistics” to practising statistics — the discipline of intellectual honesty under pressure that this book has tried to convey.

← Chapter 9 · Contents · Cover