Module 2 · Lesson 1

SHAP: Attributing Predictions to Features

From game theory to model transparency — the method that became an industry standard

How can we fairly divide credit for a prediction among dozens of input variables?

When Scott Lundberg and Su-In Lee published A Unified Approach to Interpreting Model Predictions at NeurIPS 2017, they unified a fragmented landscape of local explanation methods under one mathematical framework rooted in cooperative game theory. The paper's central insight: treat each feature as a "player" and use the Shapley value from economics to assign each one a fair share of the prediction's deviation from the baseline.

Within two years, SHAP had been downloaded millions of times and integrated into Microsoft Azure ML, Amazon SageMaker Clarify, and Databricks. It became, for many teams, the default answer to the question: "Why did the model predict this?"

What Shapley Values Actually Measure

The Shapley value originates in a 1953 paper by Lloyd Shapley solving a cooperative game problem: if a coalition of players jointly produces a payoff, how should that payoff be divided equitably? The key properties are efficiency (attributions sum to the prediction gap from baseline), symmetry (features with identical contributions get identical attributions), dummy (features that never matter get zero), and additivity (attributions from separate games can be combined).

In the ML context, the "prediction" is the output value (or log-odds for classifiers), the "baseline" is the expected model output over the training data, and each "feature coalition" is a subset of inputs given to the model while the rest are marginalized over the data distribution. The SHAP value for feature i is the weighted average of the marginal contribution of feature i across all possible orderings of coalition formation.

Exact computation is exponential in the number of features — 2ⁿ coalitions for n features. SHAP's impact came from efficient approximations: TreeSHAP for tree-based models runs in O(TLD²) time (T trees, L leaves, D max depth), enabling exact values on large forests in seconds. KernelSHAP provides a model-agnostic approximation using weighted linear regression on sampled coalitions.

Real Deployment — Amazon SageMaker Clarify (2020)

AWS launched SageMaker Clarify in December 2020 with SHAP as the core attribution engine. Teams at Amazon used KernelSHAP to explain why specific product recommendations were made, surfacing that "days since last purchase" was the dominant feature for lapsed-customer reactivation models — a finding that changed reactivation campaign design. The same tool flagged that "zip code" was acting as a proxy for race in a credit-related scoring pipeline, triggering a remediation before the model shipped.

TreeSHAP in Practice

For gradient-boosted tree models — XGBoost, LightGBM, CatBoost, scikit-learn's GradientBoostingClassifier — TreeSHAP computes exact Shapley values by recursively tracking how each feature's decision nodes split the prediction away from the root expectation. The algorithm was introduced in Lundberg et al.'s 2020 Nature Machine Intelligence paper and became the default backend in the SHAP Python library's TreeExplainer.

A single call to shap.TreeExplainer(model).shap_values(X) returns a matrix of shape (n_samples × n_features) where each cell is the SHAP value for that sample-feature pair. Positive values push the prediction above baseline; negative values push it below. A waterfall plot for a single prediction shows each feature's signed contribution stacked from baseline to final output.

Global importance is computed by taking the mean absolute SHAP value per feature across all samples — mean(|SHAP|). This is often preferred over impurity-based importance (which is biased toward high-cardinality features) and permutation importance (which ignores feature interactions).

SHAP in High-Stakes Decisions

In 2019, clinicians at the University of Washington used SHAP with a gradient-boosted model predicting sepsis risk. TreeSHAP explanations revealed that the model heavily weighted lactate levels and respiratory rate — clinically sensible — but also gave high weight to "number of prior ICU admissions," a feature that correlates with frailty but also with socioeconomic access to care. The SHAP plots made this visible in a way that aggregate metrics did not, prompting a feature audit before clinical deployment.

FICO's 2018 Explainable Machine Learning Challenge used SHAP as a benchmark explanation method for credit scoring. Participants had to produce explanations that satisfied adverse action notice requirements — the regulatory obligation to tell loan applicants the primary reasons for denial. SHAP's additive attribution structure aligned naturally with the "top four reasons" format required by US regulation.

Key Limitation

SHAP values explain what the model did, not what is causally true in the world. A high SHAP value for "zip code" means zip code influenced the prediction, not that zip code causally drives the outcome. Causal XAI (addressed in Module 3) requires additional structure. Additionally, KernelSHAP's independence assumption — marginalizing features by sampling from the marginal rather than conditional distribution — can produce unrealistic counterfactual inputs, especially when features are strongly correlated.

Key Terms

Shapley valueA fair allocation of a coalition's payoff to individual players, derived from Lloyd Shapley's 1953 game theory work. In ML, it measures each feature's average marginal contribution across all possible feature orderings.

TreeSHAPAn exact, polynomial-time algorithm for computing Shapley values in tree-based models, introduced by Lundberg et al. (2020). Avoids exponential coalition enumeration by exploiting tree structure.

KernelSHAPA model-agnostic approximation of SHAP using weighted linear regression on sampled feature coalitions. Slower than TreeSHAP but applicable to any model.

Baseline / reference valueThe expected model output over background data. SHAP attributions measure each feature's contribution relative to this baseline.

Additivity (SHAP property)The sum of all SHAP values for a prediction exactly equals the difference between the prediction and the baseline. This makes SHAP explanations fully accounting.

Lesson 1 Quiz — SHAP

Three questions · Select the best answer

What mathematical concept from cooperative game theory underlies SHAP values?

Correct. Lloyd Shapley's 1953 cooperative game theory framework assigns each player the weighted average of their marginal contributions across all possible orderings of coalition formation. Lundberg & Lee adapted this so each "player" is a model feature.

Not quite. SHAP is grounded in Lloyd Shapley's 1953 cooperative game theory: each feature receives the weighted average of its marginal contribution across all coalition orderings — a fairness axiom known as the Shapley value.

Why does TreeSHAP achieve polynomial rather than exponential runtime?

Correct. TreeSHAP's key insight (Lundberg et al., 2020) is that a tree's node-splitting structure allows exact marginal contributions to be computed recursively in O(TLD²) time — no coalition sampling needed.

Not quite. TreeSHAP achieves polynomial runtime by exploiting the tree's recursive split structure, computing exact marginal contributions at each node without enumerating all 2ⁿ feature subsets — unlike KernelSHAP which does sample.

What does the SHAP additivity property guarantee?

Correct. Additivity (also called efficiency) ensures full accounting: Σ φᵢ = f(x) − E[f(x)]. Every unit of prediction deviation from baseline is attributed to some feature — nothing is left unexplained.

Not quite. The additivity (efficiency) property states: the sum of all SHAP values for a given prediction equals the difference between the prediction and the expected model output (baseline). This guarantees complete attribution with no residual.

Lab 1 — SHAP Explorer

Practice session · Minimum 3 exchanges to complete

Interpreting SHAP Outputs

In this lab you'll work with an AI tutor to deepen your understanding of SHAP values. You might discuss how to read a waterfall plot, interpret negative SHAP values, choose between TreeSHAP and KernelSHAP, or think through a real deployment scenario.

Suggested starter: "A loan model gives applicant A a SHAP value of +0.8 for 'annual income' and −0.5 for 'num_late_payments'. The baseline is 0.3 and the prediction is 0.6. Does the additivity property hold here? Walk me through it."

SHAP Tutor

Module 2 · L1

Welcome to the SHAP lab. I can help you work through Shapley value calculations, interpret waterfall and beeswarm plots, or discuss when to use TreeSHAP vs. KernelSHAP. What would you like to explore?

Module 2 · Lesson 2

LIME: Local Approximations of Complex Models

Fitting an interpretable surrogate in the neighbourhood of a single prediction

Can a simple linear model faithfully explain a complex one — at least locally?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin introduced LIME — Local Interpretable Model-agnostic Explanations — at KDD 2016. Their paper "Why Should I Trust You?" opened with a jarring demonstration: a classifier that achieved 99% accuracy on a flu-prediction dataset was shown to rely heavily on the word "not" in text strings like "I do not have a headache" — negations the model had learned to invert, but only in specific contexts. LIME made this visible by fitting a sparse linear model around individual predictions.

The method gained rapid adoption: by 2018, LIME was the most cited XAI technique in NLP and tabular ML, and it underpinned early responsible AI toolkits at Google, IBM, and several European fintech companies navigating early GDPR explainability discussions.

The LIME Algorithm

LIME's core idea: for any prediction point x, generate a set of perturbed samples in the neighbourhood of x, query the original (black-box) model f for their predictions, then fit a weighted sparse linear model g on those samples — weighting by proximity to x. The result is a local, interpretable approximation valid in x's neighbourhood.

Formally, LIME solves: argmin_g [ L(f, g, π_x) + Ω(g) ] where L is the fidelity loss (how well g mimics f locally), π_x is the locality kernel (proximity weighting), and Ω(g) penalises model complexity (typically via L1 regularisation to enforce sparsity). The output is a small set of features with signed weights indicating their local influence.

For tabular data, perturbations are made by sampling feature values from the training distribution and zeroing out random subsets. For text, words are randomly removed (replaced with padding). For images, LIME uses superpixels — contiguous image segments — that can be masked on or off.

Real Case — ProPublica / Recidivism (2016)

When ProPublica published their COMPAS investigation in May 2016, researchers immediately began applying LIME to the COMPAS-style models they could reconstruct. LIME analyses by Julia Angwin's team and independent researchers showed that locally, the model's predictions for Black defendants were dominated by different features than for white defendants — not just globally biased, but locally inconsistent in which signals drove individual outcomes. This use of local explanation to detect disparate treatment patterns became a template for fairness auditing.

LIME for Text and Images

LIME's image explainer was popularised by Ribeiro's own demo on a GoogLeNet classifier that correctly labeled an image "tree frog" but, under LIME analysis, was found to base the prediction primarily on the frog's eye texture and the water background — not the body shape. When the eye region was masked, confidence dropped from 0.93 to 0.12. This became a canonical example of shortcut learning made visible through local explanation.

For NLP, LIME produces word-level attributions by randomly ablating tokens and observing confidence changes. In 2019, researchers at Salesforce used LIME on their email classification models and found that certain legal boilerplate phrases were driving "high priority" classifications — a feature the model had learned from sales email metadata, not semantic content. Removing these phrases from training reduced false-positive priority flags by 34%.

LIME vs. SHAP: Core Differences

LIME and SHAP are both local, additive explanation methods but differ critically in how they generate and weight perturbations. SHAP has theoretical guarantees (the four Shapley axioms); LIME does not — its explanations can be unstable if the locality kernel σ is poorly tuned. SHAP values sum exactly to the prediction gap; LIME's linear coefficients do not have this accounting guarantee.

Property	LIME	SHAP
Theoretical basis	Local linear approximation; no axiomatic guarantees	Shapley axioms: efficiency, symmetry, dummy, additivity
Scope	Model-agnostic	Model-agnostic (Kernel) or model-specific (Tree, Deep)
Stability	Can vary across runs due to random sampling	TreeSHAP is deterministic; KernelSHAP has sampling variance
Attribution sum	Does not guarantee sum = prediction gap	Guaranteed: Σφᵢ = f(x) − E[f(x)]
Speed (trees)	Slow — many black-box queries per explanation	Fast — TreeSHAP is exact in polynomial time
Best use case	Text/image modalities; quick prototyping; NLP audits	Tabular data; production fairness monitoring; regulatory use

Instability Warning

A 2019 study by Alvarez-Melis and Jaakkola ("On the Robustness of Interpretability Methods") showed that LIME explanations for the same instance could vary substantially across runs — different features receiving high weight simply due to different random perturbation draws. They proposed stability metrics; practitioners should run LIME multiple times and check agreement before trusting a single explanation.

Key Terms

LIMELocal Interpretable Model-agnostic Explanations. Fits a sparse linear surrogate model in the neighbourhood of a single prediction using perturbed, proximity-weighted samples.

Surrogate modelA simpler, interpretable model (e.g., linear regression, decision tree) trained to mimic the behaviour of a complex model, either locally (LIME) or globally.

SuperpixelA contiguous segment of an image used as the unit of perturbation in LIME's image explainer. Masking superpixels on/off reveals which image regions most influence the prediction.

Locality kernel (π_x)A weighting function in LIME that assigns higher weight to perturbed samples closer to the instance being explained, controlling the effective neighbourhood size.

Shortcut learningWhen a model learns a spurious feature correlated with the target in training data rather than the genuine causal signal — often exposed by local explanation methods like LIME.

Lesson 2 Quiz — LIME

Three questions · Select the best answer

In the LIME optimisation objective argmin_g [ L(f, g, π_x) + Ω(g) ], what does Ω(g) represent?

Correct. Ω(g) is the complexity term — often implemented as the number of non-zero features in a linear surrogate. Minimising it alongside fidelity produces a sparse, human-readable explanation. π_x is the locality kernel (a different term in the objective).

Not quite. Ω(g) penalises model complexity — typically using L1 regularisation to limit the number of features in the surrogate, making it sparse and interpretable. The locality kernel is π_x, and fidelity to the original model is L.

How does LIME handle image data differently from tabular data?

Correct. LIME's image explainer uses superpixels as the "features." By switching superpixel segments on or off (filling masked regions with a neutral colour), LIME queries the model and builds a linear approximation indicating which segments most influenced the prediction.

Not quite. LIME for images operates on superpixels — contiguous segments — not individual pixels. Masking and unmasking these segments allows LIME to measure which image regions drive the prediction. Grad-CAM is a separate, gradient-based technique.

A 2019 study by Alvarez-Melis and Jaakkola found a key weakness of LIME. What was it?

Correct. Alvarez-Melis and Jaakkola's "On the Robustness of Interpretability Methods" (2019) showed that running LIME multiple times on the same instance can produce substantially different feature rankings, depending on which random perturbations happen to be sampled. They proposed stability metrics to quantify this.

Not quite. The Alvarez-Melis and Jaakkola (2019) study documented instability: repeated LIME runs on the same instance can yield different feature-importance rankings because the random perturbation sets differ. This means a single LIME explanation should be treated cautiously without stability checks.

Lab 2 — LIME Workshop

Practice session · Minimum 3 exchanges to complete

Probing Local Explanations

Work with the AI tutor on LIME scenarios. Explore how superpixel masking works, when LIME explanations might be unstable, how LIME handles negations in text, or how to choose the locality kernel bandwidth.

Suggested starter: "A LIME explanation for an image classifier shows superpixel A with weight +0.6 and superpixel B with +0.4. I ran LIME again and got A = +0.1 and B = +0.7. What's happening, and how should I respond to this instability?"

LIME Tutor

Module 2 · L2

Welcome to the LIME lab. I can help you think through perturbation design, superpixel segmentation choices, stability testing, and comparing LIME results to SHAP. What scenario would you like to work through?

Module 2 · Lesson 3

Saliency Maps & Attention Visualisation

Gradient-based and attention-based windows into neural network decisions

When a neural network classifies an image or parses a sentence, which parts of the input did it actually look at?

In 2019, a team at Stanford published CheXpert, a large chest X-ray dataset and accompanying model that achieved radiologist-level performance on 14 pathologies. But when researchers applied Grad-CAM saliency maps, they found the model often highlighted pacemaker leads and drain tubes as evidence for cardiomegaly — because patients with those devices are more likely to have enlarged hearts in the training data. The model was right, but for the wrong reasons. Without saliency maps, this clinical shortcut would have been invisible.

Vanilla Gradients and Saliency Maps

The simplest saliency method computes ∂f(x)/∂xᵢ — the gradient of the model output with respect to each input feature. For images, this produces a pixel-wise map where large gradient magnitude indicates "the prediction is sensitive to this pixel." Simonyan et al. (2014) introduced this in "Deep Inside Convolutional Networks," and it remains a baseline comparison method.

Raw gradients are noisy — they capture local sensitivity, not global importance. Integrated Gradients (Sundararajan, Taly & Yan, 2017) improves this by integrating the gradient along a path from a baseline input (e.g., a black image) to the actual input, satisfying an axiom analogous to SHAP's additivity. Google uses Integrated Gradients in its What-If Tool and as the default attribution method in Google Cloud Vertex Explainable AI.

SmoothGrad (Smilkov et al., 2017) reduces gradient noise by averaging gradients over many noisy versions of the input, producing sharper, more visually interpretable saliency maps at the cost of additional computation.

Grad-CAM: Class-Discriminative Saliency

Grad-CAM (Selvaraju et al., 2017 — cited over 14,000 times as of 2024) uses the gradients of the class score flowing into the final convolutional layer to produce a coarse spatial map of the regions most relevant to a specific class prediction. Unlike pixel-level saliency, Grad-CAM operates at feature-map resolution and is class-discriminative: you can ask "which regions support class A?" vs "which support class B?" for the same image.

Grad-CAM++ (Chattopadhay et al., 2018) extends this to handle multiple instances of the same class in one image and better localises small objects. HiResCAM (Draelos & Carin, 2020) provides a mathematically faithful version that avoids a gradient averaging step shown to sometimes misattribute importance in multi-layer networks.

Real Deployment — Google Cloud Vertex Explainable AI

Google's Vertex Explainable AI platform (launched 2020, evolved from PAIR Explainability Explorer) uses Integrated Gradients as its primary attribution method for image and text models. A documented use case from Google's internal teams: a satellite imagery model classifying crop health. Integrated Gradients revealed that the model was weighting cloud shadow regions in "unhealthy crop" predictions — shadows that correlated with overcast weather, which correlated with soil moisture stress in the training set. The team added cloud-masking preprocessing, reducing false positives by 18%.

Attention as Explanation: The Controversy

Transformer models compute attention weights that indicate, for each output token, how much each input token was "attended to." It is tempting to use these weights as explanations — to say "the model predicted X because it attended heavily to word Y." This interpretation was challenged decisively by Jain & Wallace's 2019 paper "Attention is not Explanation" (ACL 2019), which showed that attention weights frequently do not correlate with gradient-based feature importances, and that adversarially perturbed attention patterns can produce the same predictions with entirely different attention distributions.

Wiegreffe & Pinter's response "Attention is not not Explanation" (EMNLP 2019) complicated the picture further, arguing that the criteria used by Jain & Wallace were insufficient and that attention can be part of a faithful explanation when additional diagnostic tests are applied. The debate remains open.

Practical consensus: do not use raw attention weights as standalone explanations for high-stakes decisions. Use gradient-based or SHAP-based methods as the primary attribution signal; treat attention as a complementary diagnostic. Several NLP teams at major labs (Anthropic, Google Brain) document this recommendation in internal model cards.

Faithfulness vs. Plausibility

A critical distinction in XAI evaluation: faithfulness (does the explanation accurately reflect what the model computed?) vs. plausibility (does the explanation look reasonable to a human?). Saliency maps can be highly plausible — highlighting the right-looking regions — while being unfaithful if a different but equally valid saliency map (e.g., generated by perturbing hyperparameters) would change the story entirely.

Adebayo et al.'s "Sanity Checks for Saliency Maps" (NeurIPS 2018) showed that several popular saliency methods — including vanilla gradients and guided backpropagation — produce nearly identical maps even when the model weights are randomly re-initialised. This means those maps may reflect properties of the input data structure, not the model's learned parameters. Only methods like Integrated Gradients and SHAP DeepExplainer passed the sanity checks in that study.

Method

Vanilla Gradient

∂f/∂x — pixel sensitivity. Fast; noisy; fails sanity checks (Adebayo 2018). Good baseline only.

Method

Integrated Gradients

Path integral from baseline to input. Satisfies completeness axiom. Google's default in Vertex XAI.

Method

SmoothGrad

Averages gradients over noisy input copies. Reduces visual noise; increases compute cost linearly with sample count.

Method

Grad-CAM

Class-discriminative coarse map using final conv layer gradients. 14,000+ citations. Standard in medical imaging XAI.

Key Terms

Saliency mapA pixel- or token-wise visualisation of which input features most influence a model's output, typically derived from gradients of the output with respect to the input.

Integrated GradientsAn attribution method that integrates the gradient along a straight-line path from a baseline input to the actual input, satisfying a completeness axiom analogous to SHAP additivity.

Grad-CAMGradient-weighted Class Activation Mapping. Uses gradients into the last convolutional layer to produce a class-discriminative spatial importance map for CNNs.

FaithfulnessWhether an explanation accurately reflects the model's actual computational process — as opposed to plausibility, which means it merely looks reasonable to humans.

Sanity checks (Adebayo 2018)Tests showing that a valid saliency method must produce different maps when model weights are randomised. Methods that pass are sensitive to learned parameters; those that fail may only reflect input structure.

Lesson 3 Quiz — Saliency & Attention

Three questions · Select the best answer

What did Adebayo et al.'s 2018 "Sanity Checks for Saliency Maps" reveal about vanilla gradients and guided backpropagation?

Correct. Adebayo et al. showed that vanilla gradients and guided backpropagation maps look similar even for randomly-weighted networks — meaning they capture properties of the input data structure (e.g., image edges) rather than what the model has learned. Only methods like Integrated Gradients passed the sanity checks.

Not quite. The key finding was that these methods produce similar-looking maps whether weights are trained or random — meaning the maps may not reflect the model's learned computation at all. Integrated Gradients and SHAP-based methods fared better on these sanity checks.

What makes Integrated Gradients (Sundararajan et al., 2017) preferable to vanilla gradients for attribution?

Correct. Integrated Gradients satisfies "completeness" — attributions sum exactly to f(x) − f(baseline), which vanilla gradients do not guarantee. This gives it an axiomatic grounding similar to SHAP's additivity property. (The noisy average description is SmoothGrad, a different method.)

Not quite. Integrated Gradients computes a path integral from a baseline input to the actual input, and satisfies a completeness axiom: the sum of all pixel/token attributions equals the prediction difference from baseline. This axiomatic guarantee is what distinguishes it from vanilla gradients. (Averaging over noisy inputs describes SmoothGrad.)

The 2019 debate between Jain & Wallace and Wiegreffe & Pinter concerned which XAI topic?

Correct. Jain & Wallace (ACL 2019) argued "Attention is not Explanation" — attention weights don't correlate reliably with gradient-based importances. Wiegreffe & Pinter (EMNLP 2019) challenged the criteria used, arguing attention can form part of faithful explanations under different diagnostic tests. The debate remains unresolved.

Not quite. This debate — "Attention is not Explanation" (Jain & Wallace, 2019) vs. "Attention is not not Explanation" (Wiegreffe & Pinter, 2019) — focused specifically on whether transformer attention weights can legitimately serve as explanations for model predictions. The consensus is: don't use attention alone for high-stakes explanations.

Lab 3 — Saliency & Attention Deep Dive

Practice session · Minimum 3 exchanges to complete

Evaluating Gradient-Based Explanations

Work with the AI tutor on saliency map scenarios. You might explore how to apply sanity checks, compare Grad-CAM to Integrated Gradients for a medical imaging task, or work through the attention-as-explanation debate for a specific NLP use case.

Suggested starter: "I'm building an X-ray classifier for pneumonia detection. A radiologist colleague says the Grad-CAM highlights look clinically sensible. Is that enough to trust the explanation? What else should I check?"

Saliency Tutor

Module 2 · L3

Welcome to the saliency and attention lab. I can help you think through gradient methods, sanity checks, the faithfulness vs. plausibility distinction, or how to use Grad-CAM in a real imaging pipeline. What would you like to explore?

Module 2 · Lesson 4

Counterfactual Explanations

Actionable "what-if" statements that show how to change a decision

What is the smallest change to my situation that would have produced a different outcome?

SHAP and LIME tell a denied loan applicant that "low income" had a large negative contribution to their score. But they do not say: what would have been different if income were higher? or what combination of changes would have flipped the decision? For regulatory compliance and genuine user recourse, this actionable question is often more important than backward-looking attribution. Counterfactual explanations — answering "what input would have changed the output?" — emerged as a complementary technique to SHAP/LIME, formalised by Wachter, Mittelstadt & Russell in their 2017 paper "Counterfactual Explanations without Opening the Black Box."

Wachter-Mittelstadt-Russell Counterfactuals

Wachter et al. defined a counterfactual as the nearest point in feature space that receives a different prediction. Formally, find x' that minimises distance(x, x') + λ · loss(f(x'), target) — trading off proximity to the original input against achieving the desired outcome. The distance function matters enormously: L1 norm encourages sparse changes (few features altered), L2 encourages small changes overall, and domain-constrained metrics can enforce that only actionable features (e.g., savings rate, not age) are altered.

The paper argued that counterfactuals satisfy the right to explanation under GDPR Article 22, because they provide meaningful, actionable information about automated decision-making without requiring disclosure of the model's internal parameters. This legal framing accelerated adoption in European fintech.

Real Case — Danske Bank (2021)

Danske Bank's credit decisioning team piloted counterfactual explanations in their automated lending platform in 2021, as reported in their AI ethics transparency report. For mortgage denials, the system generated statements like: "If your debt-to-income ratio were 38% instead of 51%, and you had 6 months of additional employment history, this application would have been approved." Customer satisfaction with denial explanations improved significantly, and the team reported that compliance officers found counterfactuals easier to audit for regulatory adverse action notice compliance than SHAP waterfall plots.

DiCE: Diverse Counterfactuals

A key limitation of the Wachter framework: it finds a single nearest counterfactual. Real users benefit from multiple diverse options — "here are three different paths to approval." Mothilal, Sharma & Tan's DiCE (Diverse Counterfactual Explanations, 2020, Microsoft Research) extended the framework to generate a set of counterfactuals that are collectively diverse while each being individually close to the original input. DiCE is open-source and integrated into Microsoft's InterpretML toolkit.

DiCE adds a diversity penalty to the optimisation objective so that returned counterfactuals differ from each other in which features they change — giving decision subjects genuine options rather than variations on the same intervention. It also supports feasibility constraints: marking features as immutable (age, race, nationality) or specifying allowed ranges, ensuring counterfactuals reflect actionable changes rather than fictional ones.

FACE: Feasible and Actionable Counterfactuals

Poyiadzi et al.'s FACE (2020) addressed a deeper problem: proximity in feature space does not equal feasibility in the real world. A counterfactual that says "increase your income by £40,000" is nearby in Euclidean space but may cross through an infeasible region (no realistic path from current income to that level). FACE generates counterfactuals along data density corridors — paths that pass through regions of high training-data density, ensuring the counterfactual journey is realistically traversable.

Counterfactuals in Algorithmic Recourse

The broader concept of algorithmic recourse (Ustun, Spangher & Liu, 2019) goes beyond explanation to actionable recommendation: given that you were denied credit, here is a specific policy (increase savings by $X, reduce one credit card balance by $Y over Z months) that would change the outcome. Recourse methods integrate causal structure (which variables can be changed, which are downstream effects) with counterfactual search to avoid "perverse" recommendations — like "open two new credit cards" to raise available credit, which might simultaneously lower the credit score through hard inquiries.

Karimi et al.'s "Algorithmic Recourse: From Counterfactual Explanations to Interventions" (ACM FAccT 2021) formalised the distinction between observational counterfactuals (nearest point in data space) and interventional counterfactuals (nearest achievable outcome under a causal model), arguing the latter is required for genuine recourse.

Regulatory Relevance

The UK Financial Conduct Authority's 2022 guidance on AI in financial services explicitly cited counterfactual explanations as a mechanism for satisfying the consumer duty to provide "meaningful explanations" for automated decisions. The EU AI Act's Article 13 (transparency obligations for high-risk AI systems) requires that affected persons receive information sufficient to allow meaningful exercise of their rights — language that legal analysts at Allen & Overy and Bird & Bird have interpreted as requiring recourse-style explanations for consequential AI decisions.

Key Terms

Counterfactual explanationAn explanation stating the minimum change to input features that would produce a different model output. Answers "what would need to be different?" rather than "why was this the outcome?"

Algorithmic recourseActionable guidance derived from counterfactual analysis, specifying interventions a person can actually take to change an automated decision outcome.

DiCEDiverse Counterfactual Explanations (Mothilal et al., 2020). Generates multiple diverse counterfactuals simultaneously, giving users several actionable paths rather than a single recommendation.

Feasibility constraintA restriction in counterfactual search preventing recommendations that change immutable features (age, race) or lie in regions of feature space that are impossible to reach in practice.

FACEFeasible and Actionable Counterfactual Explanations (Poyiadzi et al., 2020). Generates counterfactuals along data-density corridors to ensure the suggested path of change is realistic.

Lesson 4 Quiz — Counterfactual Explanations

Three questions · Select the best answer

What fundamental limitation of single-counterfactual methods does DiCE (Mothilal et al., 2020) address?

Correct. Wachter et al.'s framework finds one nearest counterfactual — a single recommendation. DiCE adds a diversity penalty so that the returned set of counterfactuals collectively spans different features, giving users genuinely distinct paths rather than variations on the same change.

Not quite. DiCE's key contribution is diversity: where Wachter-style methods return one nearest counterfactual (one path to approval), DiCE returns a set of counterfactuals that differ from each other in which features change — so users get real alternatives. It also supports feasibility constraints (another limitation, but separate).

What problem does FACE (Poyiadzi et al., 2020) solve that standard distance-minimising counterfactuals do not?

Correct. A Euclidean-nearest counterfactual may require passing through sparse data regions representing impossible real-world states (e.g., jumping from £30k to £100k income directly). FACE routes counterfactuals through high-density data corridors — paths that real people plausibly traverse.

Not quite. FACE addresses feasibility of trajectory, not just feasibility of endpoint. Standard distance metrics may find a nearby counterfactual that crosses infeasible regions of feature space. FACE navigates along data-density corridors so the path from current state to counterfactual state is one that real-world trajectories actually follow.

Karimi et al. (ACM FAccT 2021) distinguish between observational and interventional counterfactuals. What is the key difference?

Correct. An observational counterfactual just says: "if your feature values were X, prediction would be Y." An interventional counterfactual asks: "if I actually performed intervention I on variable V, what would causally result?" — accounting for the fact that changing one variable (e.g., taking on more debt to lower DTI) may causally affect others (credit score), potentially negating the intended improvement.

Not quite. Karimi et al.'s distinction is causal: an observational counterfactual finds a nearby point in data space without considering how you'd actually get there. An interventional counterfactual models the causal consequences of each action — e.g., "opening a new credit card" might increase available credit but trigger a hard inquiry that drops the score. Causal structure prevents perverse recommendations.

Lab 4 — Counterfactual Design Studio

Practice session · Minimum 3 exchanges to complete

Building Actionable Recourse

Work with the AI tutor to design counterfactual explanations for real decision scenarios. You might work through a credit denial case, discuss how to set feasibility constraints, compare DiCE to FACE for a specific use case, or think through what "algorithmic recourse" means legally.

Suggested starter: "A mortgage applicant was denied. Their debt-to-income ratio is 55%, they have 2 years employment history, and a credit score of 640. Design a set of three diverse, actionable counterfactuals using DiCE-style logic. Mark any features that should be immutable."

Counterfactual Tutor

Module 2 · L4

Welcome to the counterfactual design lab. I can help you work through recourse scenarios, design feasibility constraints, compare DiCE and FACE approaches, or explore the causal vs. observational counterfactual distinction. What case would you like to work on?

Module 2 Test — Interpretation Techniques

15 questions · Score 80% or above to pass

1. SHAP's "efficiency" (additivity) property guarantees that:

Correct. Efficiency (additivity) states Σφᵢ = f(x) − E[f(x)] — a complete accounting of the prediction gap with no residual unexplained.

The efficiency property is: the sum of all feature SHAP values for a prediction exactly equals f(x) − E[f(x)] — the deviation from the baseline. This guarantees full attribution.

2. TreeSHAP achieves exact Shapley values in polynomial time by:

Correct. TreeSHAP (Lundberg et al., 2020) recursively tracks how each feature's decision nodes split the prediction from the root expectation, avoiding exponential coalition enumeration.

TreeSHAP exploits tree structure — recursively computing marginal contributions at each node — to achieve exact values in O(TLD²) time rather than exponential coalition sampling.

3. KernelSHAP's key limitation compared to TreeSHAP is:

Correct. KernelSHAP marginalises absent features by sampling from their marginal distribution, ignoring conditional relationships. When features are correlated (e.g., height and weight), this creates unrealistic data points that may distort attributions.

KernelSHAP's core limitation is its independence assumption: absent features are filled from marginal distributions, ignoring correlations. With strongly correlated features, this produces off-manifold, unrealistic inputs that can mislead attributions.

4. In the LIME objective argmin_g [L(f,g,π_x) + Ω(g)], the locality kernel π_x controls:

Correct. π_x is the proximity weighting function. Perturbed samples closer to x get higher weight in fitting g, ensuring the surrogate model is optimised to be faithful in x's neighbourhood specifically.

π_x is the locality kernel — a proximity weighting function that assigns higher loss weight to perturbed samples near x. This makes the surrogate g more faithful locally around the instance being explained. Ω(g) handles regularisation/sparsity.

5. LIME for images uses superpixels rather than individual pixels as the unit of perturbation because:

Correct. Superpixels group contiguous pixels into semantically coherent regions (a bird's wing, a background tree). Masking an entire superpixel produces interpretable ablations; toggling individual pixels would create imperceptible noise without meaningful explanations.

LIME uses superpixels because they are semantically meaningful units. A superpixel might represent "the frog's eye" or "the water background." Masking it on/off produces an interpretable test. Individual pixel toggling creates imperceptible noise, not a meaningful ablation.

6. Alvarez-Melis and Jaakkola (2019) recommended practitioners should address LIME's instability by:

Correct. Their paper proposed stability metrics and recommended running LIME multiple times; if feature rankings vary substantially across runs, the explanation is unreliable and should not be used for high-stakes decisions without further investigation.

Alvarez-Melis and Jaakkola proposed stability metrics: run LIME multiple times on the same instance and measure agreement across runs. If explanations vary substantially, don't trust a single run. A fixed seed would mask the instability rather than quantify it.

7. Adebayo et al.'s "Sanity Checks for Saliency Maps" (2018) tested whether saliency methods:

Correct. The sanity check is a randomisation test: if you randomise model weights, a faithful saliency method should produce clearly different maps. Vanilla gradients and guided backpropagation failed — their maps look similar regardless of whether weights are trained or random.

The sanity check randomises model weights and checks whether the saliency map changes. If a method produces similar-looking maps for trained and randomly-weighted models, its maps reflect input structure (e.g., edges), not what the model learned. Vanilla gradients and guided backprop failed this test.

8. What makes Integrated Gradients "complete" in the attribution-axioms sense?

Correct. Completeness: Σ IntGrad(xᵢ) = f(x) − f(x'_baseline). The integral from baseline to input sums to exactly the prediction difference, analogous to SHAP's efficiency property. This is why Integrated Gradients passes Adebayo's sanity checks while vanilla gradients do not.

Completeness means: the attributions sum to f(x) − f(x_baseline). Integrating gradients along the path from baseline to input captures the exact total effect, leaving no residual. Vanilla gradients measure local sensitivity only and have no such guarantee.

9. The academic debate between Jain & Wallace (2019) and Wiegreffe & Pinter (2019) concluded that:

Correct. The debate is genuinely unresolved. Practical consensus: don't rely on raw attention weights alone for high-stakes explanations. Use gradient-based or SHAP-based methods as primary attribution; treat attention as supplementary diagnostic information.

The academic debate ended without resolution. Jain & Wallace showed attention doesn't reliably correlate with feature importance; Wiegreffe & Pinter challenged the criteria. Practical takeaway: attention alone is insufficient for high-stakes XAI; use gradient or SHAP methods as primary attribution.

10. Grad-CAM is "class-discriminative" — what does this mean in practice?

Correct. Grad-CAM computes gradients of a specific class score into the last conv layer. By changing the target class, you get different spatial maps — critical for multi-class tasks where the same image region might support one class but not another.

Class-discriminativeness means Grad-CAM is conditioned on a target class. The gradient of class C's score through the final conv layer highlights regions specifically relevant to predicting C. Changing C changes the map — unlike vanilla gradient saliency which is not class-conditioned in this way.

11. Wachter, Mittelstadt & Russell (2017) argued counterfactuals satisfy GDPR Article 22 because:

Correct. The paper argued counterfactuals give data subjects enough information to understand and contest automated decisions — "if X were different, the outcome would change" — without exposing trade-secret model details. This satisfies the spirit of meaningful explanation under GDPR Article 22.

Wachter et al.'s argument: counterfactuals provide meaningful, actionable information ("change these features and the decision changes") without exposing model internals. This allows data subjects to contest decisions and understand their rights — the key GDPR Article 22 requirement — without requiring proprietary disclosure.

12. DiCE's diversity penalty in the optimisation objective ensures that:

Correct. DiCE adds a pairwise diversity term to the objective, penalising counterfactuals that make similar feature changes. This produces a set where one option might change income and savings, another changes debt-to-income and employment tenure — genuinely different paths.

DiCE's diversity penalty makes returned counterfactuals differ from each other in which features they modify. Rather than three variants of "reduce debt," you get one option reducing debt, one extending employment history, one improving savings rate — genuinely different paths to the same goal.

13. FACE (Poyiadzi et al., 2020) differs from standard distance-minimising counterfactuals by requiring that:

Correct. FACE's key innovation is path feasibility: it navigates along data density corridors so the counterfactual journey stays within the manifold of plausible feature combinations. A Euclidean-nearest counterfactual may technically exist but require traversing impossible intermediate states.

FACE addresses path feasibility, not just endpoint distance. Standard methods find the nearest counterfactual in Euclidean space, which may require crossing regions of feature space that no real data points occupy — meaning no realistic trajectory exists. FACE navigates along high-density corridors instead.

14. Karimi et al. (FAccT 2021) argue that interventional counterfactuals are preferable to observational ones for algorithmic recourse because:

Correct. Observational counterfactuals find a nearby point without modelling how variables relate causally. If you recommend "open two credit cards to increase available credit," a causal model reveals this triggers hard inquiries that lower the credit score — potentially worsening the outcome. Interventional counterfactuals prevent such perverse advice.

Karimi et al.'s point: if variable A causally affects B and C, a recommendation to change A must account for what happens to B and C. Observational counterfactuals ignore these downstream effects, potentially recommending actions that backfire. A causal model of the domain is required for genuinely useful recourse.

15. Which combination of explanation methods best satisfies both regulatory adverse action notice requirements and genuine user recourse in a credit decisioning context?

Correct. SHAP provides axiomatic attribution satisfying "top four adverse reasons" regulatory formats (used by FICO and documented in the 2018 Explainable ML Challenge). DiCE/FACE counterfactuals provide actionable recourse — the forward-looking "what can I do?" answer. Combining both satisfies regulatory and user needs simultaneously.

Best practice combines SHAP (for backward-looking attribution that maps to adverse action notice "top reasons" formats) with counterfactual explanations (for forward-looking recourse — telling users what they can do to change the decision). Raw attention, vanilla gradients, or a single unstable LIME run each fall short on one or both dimensions.