When Scott Lundberg and Su-In Lee published A Unified Approach to Interpreting Model Predictions at NeurIPS 2017, they unified a fragmented landscape of local explanation methods under one mathematical framework rooted in cooperative game theory. The paper's central insight: treat each feature as a "player" and use the Shapley value from economics to assign each one a fair share of the prediction's deviation from the baseline.
Within two years, SHAP had been downloaded millions of times and integrated into Microsoft Azure ML, Amazon SageMaker Clarify, and Databricks. It became, for many teams, the default answer to the question: "Why did the model predict this?"
The Shapley value originates in a 1953 paper by Lloyd Shapley solving a cooperative game problem: if a coalition of players jointly produces a payoff, how should that payoff be divided equitably? The key properties are efficiency (attributions sum to the prediction gap from baseline), symmetry (features with identical contributions get identical attributions), dummy (features that never matter get zero), and additivity (attributions from separate games can be combined).
In the ML context, the "prediction" is the output value (or log-odds for classifiers), the "baseline" is the expected model output over the training data, and each "feature coalition" is a subset of inputs given to the model while the rest are marginalized over the data distribution. The SHAP value for feature i is the weighted average of the marginal contribution of feature i across all possible orderings of coalition formation.
Exact computation is exponential in the number of features β 2βΏ coalitions for n features. SHAP's impact came from efficient approximations: TreeSHAP for tree-based models runs in O(TLDΒ²) time (T trees, L leaves, D max depth), enabling exact values on large forests in seconds. KernelSHAP provides a model-agnostic approximation using weighted linear regression on sampled coalitions.
AWS launched SageMaker Clarify in December 2020 with SHAP as the core attribution engine. Teams at Amazon used KernelSHAP to explain why specific product recommendations were made, surfacing that "days since last purchase" was the dominant feature for lapsed-customer reactivation models β a finding that changed reactivation campaign design. The same tool flagged that "zip code" was acting as a proxy for race in a credit-related scoring pipeline, triggering a remediation before the model shipped.
For gradient-boosted tree models β XGBoost, LightGBM, CatBoost, scikit-learn's GradientBoostingClassifier β TreeSHAP computes exact Shapley values by recursively tracking how each feature's decision nodes split the prediction away from the root expectation. The algorithm was introduced in Lundberg et al.'s 2020 Nature Machine Intelligence paper and became the default backend in the SHAP Python library's TreeExplainer.
A single call to shap.TreeExplainer(model).shap_values(X) returns a matrix of shape (n_samples Γ n_features) where each cell is the SHAP value for that sample-feature pair. Positive values push the prediction above baseline; negative values push it below. A waterfall plot for a single prediction shows each feature's signed contribution stacked from baseline to final output.
Global importance is computed by taking the mean absolute SHAP value per feature across all samples β mean(|SHAP|). This is often preferred over impurity-based importance (which is biased toward high-cardinality features) and permutation importance (which ignores feature interactions).
In 2019, clinicians at the University of Washington used SHAP with a gradient-boosted model predicting sepsis risk. TreeSHAP explanations revealed that the model heavily weighted lactate levels and respiratory rate β clinically sensible β but also gave high weight to "number of prior ICU admissions," a feature that correlates with frailty but also with socioeconomic access to care. The SHAP plots made this visible in a way that aggregate metrics did not, prompting a feature audit before clinical deployment.
FICO's 2018 Explainable Machine Learning Challenge used SHAP as a benchmark explanation method for credit scoring. Participants had to produce explanations that satisfied adverse action notice requirements β the regulatory obligation to tell loan applicants the primary reasons for denial. SHAP's additive attribution structure aligned naturally with the "top four reasons" format required by US regulation.
SHAP values explain what the model did, not what is causally true in the world. A high SHAP value for "zip code" means zip code influenced the prediction, not that zip code causally drives the outcome. Causal XAI (addressed in Module 3) requires additional structure. Additionally, KernelSHAP's independence assumption β marginalizing features by sampling from the marginal rather than conditional distribution β can produce unrealistic counterfactual inputs, especially when features are strongly correlated.
In this lab you'll work with an AI tutor to deepen your understanding of SHAP values. You might discuss how to read a waterfall plot, interpret negative SHAP values, choose between TreeSHAP and KernelSHAP, or think through a real deployment scenario.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin introduced LIME β Local Interpretable Model-agnostic Explanations β at KDD 2016. Their paper "Why Should I Trust You?" opened with a jarring demonstration: a classifier that achieved 99% accuracy on a flu-prediction dataset was shown to rely heavily on the word "not" in text strings like "I do not have a headache" β negations the model had learned to invert, but only in specific contexts. LIME made this visible by fitting a sparse linear model around individual predictions.
The method gained rapid adoption: by 2018, LIME was the most cited XAI technique in NLP and tabular ML, and it underpinned early responsible AI toolkits at Google, IBM, and several European fintech companies navigating early GDPR explainability discussions.
LIME's core idea: for any prediction point x, generate a set of perturbed samples in the neighbourhood of x, query the original (black-box) model f for their predictions, then fit a weighted sparse linear model g on those samples β weighting by proximity to x. The result is a local, interpretable approximation valid in x's neighbourhood.
Formally, LIME solves: argmin_g [ L(f, g, Ο_x) + Ξ©(g) ] where L is the fidelity loss (how well g mimics f locally), Ο_x is the locality kernel (proximity weighting), and Ξ©(g) penalises model complexity (typically via L1 regularisation to enforce sparsity). The output is a small set of features with signed weights indicating their local influence.
For tabular data, perturbations are made by sampling feature values from the training distribution and zeroing out random subsets. For text, words are randomly removed (replaced with padding). For images, LIME uses superpixels β contiguous image segments β that can be masked on or off.
When ProPublica published their COMPAS investigation in May 2016, researchers immediately began applying LIME to the COMPAS-style models they could reconstruct. LIME analyses by Julia Angwin's team and independent researchers showed that locally, the model's predictions for Black defendants were dominated by different features than for white defendants β not just globally biased, but locally inconsistent in which signals drove individual outcomes. This use of local explanation to detect disparate treatment patterns became a template for fairness auditing.
LIME's image explainer was popularised by Ribeiro's own demo on a GoogLeNet classifier that correctly labeled an image "tree frog" but, under LIME analysis, was found to base the prediction primarily on the frog's eye texture and the water background β not the body shape. When the eye region was masked, confidence dropped from 0.93 to 0.12. This became a canonical example of shortcut learning made visible through local explanation.
For NLP, LIME produces word-level attributions by randomly ablating tokens and observing confidence changes. In 2019, researchers at Salesforce used LIME on their email classification models and found that certain legal boilerplate phrases were driving "high priority" classifications β a feature the model had learned from sales email metadata, not semantic content. Removing these phrases from training reduced false-positive priority flags by 34%.
LIME and SHAP are both local, additive explanation methods but differ critically in how they generate and weight perturbations. SHAP has theoretical guarantees (the four Shapley axioms); LIME does not β its explanations can be unstable if the locality kernel Ο is poorly tuned. SHAP values sum exactly to the prediction gap; LIME's linear coefficients do not have this accounting guarantee.
| Property | LIME | SHAP |
|---|---|---|
| Theoretical basis | Local linear approximation; no axiomatic guarantees | Shapley axioms: efficiency, symmetry, dummy, additivity |
| Scope | Model-agnostic | Model-agnostic (Kernel) or model-specific (Tree, Deep) |
| Stability | Can vary across runs due to random sampling | TreeSHAP is deterministic; KernelSHAP has sampling variance |
| Attribution sum | Does not guarantee sum = prediction gap | Guaranteed: Ξ£Οα΅’ = f(x) β E[f(x)] |
| Speed (trees) | Slow β many black-box queries per explanation | Fast β TreeSHAP is exact in polynomial time |
| Best use case | Text/image modalities; quick prototyping; NLP audits | Tabular data; production fairness monitoring; regulatory use |
A 2019 study by Alvarez-Melis and Jaakkola ("On the Robustness of Interpretability Methods") showed that LIME explanations for the same instance could vary substantially across runs β different features receiving high weight simply due to different random perturbation draws. They proposed stability metrics; practitioners should run LIME multiple times and check agreement before trusting a single explanation.
Work with the AI tutor on LIME scenarios. Explore how superpixel masking works, when LIME explanations might be unstable, how LIME handles negations in text, or how to choose the locality kernel bandwidth.
In 2019, a team at Stanford published CheXpert, a large chest X-ray dataset and accompanying model that achieved radiologist-level performance on 14 pathologies. But when researchers applied Grad-CAM saliency maps, they found the model often highlighted pacemaker leads and drain tubes as evidence for cardiomegaly β because patients with those devices are more likely to have enlarged hearts in the training data. The model was right, but for the wrong reasons. Without saliency maps, this clinical shortcut would have been invisible.
The simplest saliency method computes βf(x)/βxα΅’ β the gradient of the model output with respect to each input feature. For images, this produces a pixel-wise map where large gradient magnitude indicates "the prediction is sensitive to this pixel." Simonyan et al. (2014) introduced this in "Deep Inside Convolutional Networks," and it remains a baseline comparison method.
Raw gradients are noisy β they capture local sensitivity, not global importance. Integrated Gradients (Sundararajan, Taly & Yan, 2017) improves this by integrating the gradient along a path from a baseline input (e.g., a black image) to the actual input, satisfying an axiom analogous to SHAP's additivity. Google uses Integrated Gradients in its What-If Tool and as the default attribution method in Google Cloud Vertex Explainable AI.
SmoothGrad (Smilkov et al., 2017) reduces gradient noise by averaging gradients over many noisy versions of the input, producing sharper, more visually interpretable saliency maps at the cost of additional computation.
Grad-CAM (Selvaraju et al., 2017 β cited over 14,000 times as of 2024) uses the gradients of the class score flowing into the final convolutional layer to produce a coarse spatial map of the regions most relevant to a specific class prediction. Unlike pixel-level saliency, Grad-CAM operates at feature-map resolution and is class-discriminative: you can ask "which regions support class A?" vs "which support class B?" for the same image.
Grad-CAM++ (Chattopadhay et al., 2018) extends this to handle multiple instances of the same class in one image and better localises small objects. HiResCAM (Draelos & Carin, 2020) provides a mathematically faithful version that avoids a gradient averaging step shown to sometimes misattribute importance in multi-layer networks.
Google's Vertex Explainable AI platform (launched 2020, evolved from PAIR Explainability Explorer) uses Integrated Gradients as its primary attribution method for image and text models. A documented use case from Google's internal teams: a satellite imagery model classifying crop health. Integrated Gradients revealed that the model was weighting cloud shadow regions in "unhealthy crop" predictions β shadows that correlated with overcast weather, which correlated with soil moisture stress in the training set. The team added cloud-masking preprocessing, reducing false positives by 18%.
Transformer models compute attention weights that indicate, for each output token, how much each input token was "attended to." It is tempting to use these weights as explanations β to say "the model predicted X because it attended heavily to word Y." This interpretation was challenged decisively by Jain & Wallace's 2019 paper "Attention is not Explanation" (ACL 2019), which showed that attention weights frequently do not correlate with gradient-based feature importances, and that adversarially perturbed attention patterns can produce the same predictions with entirely different attention distributions.
Wiegreffe & Pinter's response "Attention is not not Explanation" (EMNLP 2019) complicated the picture further, arguing that the criteria used by Jain & Wallace were insufficient and that attention can be part of a faithful explanation when additional diagnostic tests are applied. The debate remains open.
Practical consensus: do not use raw attention weights as standalone explanations for high-stakes decisions. Use gradient-based or SHAP-based methods as the primary attribution signal; treat attention as a complementary diagnostic. Several NLP teams at major labs (Anthropic, Google Brain) document this recommendation in internal model cards.
A critical distinction in XAI evaluation: faithfulness (does the explanation accurately reflect what the model computed?) vs. plausibility (does the explanation look reasonable to a human?). Saliency maps can be highly plausible β highlighting the right-looking regions β while being unfaithful if a different but equally valid saliency map (e.g., generated by perturbing hyperparameters) would change the story entirely.
Adebayo et al.'s "Sanity Checks for Saliency Maps" (NeurIPS 2018) showed that several popular saliency methods β including vanilla gradients and guided backpropagation β produce nearly identical maps even when the model weights are randomly re-initialised. This means those maps may reflect properties of the input data structure, not the model's learned parameters. Only methods like Integrated Gradients and SHAP DeepExplainer passed the sanity checks in that study.
Work with the AI tutor on saliency map scenarios. You might explore how to apply sanity checks, compare Grad-CAM to Integrated Gradients for a medical imaging task, or work through the attention-as-explanation debate for a specific NLP use case.
SHAP and LIME tell a denied loan applicant that "low income" had a large negative contribution to their score. But they do not say: what would have been different if income were higher? or what combination of changes would have flipped the decision? For regulatory compliance and genuine user recourse, this actionable question is often more important than backward-looking attribution. Counterfactual explanations β answering "what input would have changed the output?" β emerged as a complementary technique to SHAP/LIME, formalised by Wachter, Mittelstadt & Russell in their 2017 paper "Counterfactual Explanations without Opening the Black Box."
Wachter et al. defined a counterfactual as the nearest point in feature space that receives a different prediction. Formally, find x' that minimises distance(x, x') + Ξ» Β· loss(f(x'), target) β trading off proximity to the original input against achieving the desired outcome. The distance function matters enormously: L1 norm encourages sparse changes (few features altered), L2 encourages small changes overall, and domain-constrained metrics can enforce that only actionable features (e.g., savings rate, not age) are altered.
The paper argued that counterfactuals satisfy the right to explanation under GDPR Article 22, because they provide meaningful, actionable information about automated decision-making without requiring disclosure of the model's internal parameters. This legal framing accelerated adoption in European fintech.
Danske Bank's credit decisioning team piloted counterfactual explanations in their automated lending platform in 2021, as reported in their AI ethics transparency report. For mortgage denials, the system generated statements like: "If your debt-to-income ratio were 38% instead of 51%, and you had 6 months of additional employment history, this application would have been approved." Customer satisfaction with denial explanations improved significantly, and the team reported that compliance officers found counterfactuals easier to audit for regulatory adverse action notice compliance than SHAP waterfall plots.
A key limitation of the Wachter framework: it finds a single nearest counterfactual. Real users benefit from multiple diverse options β "here are three different paths to approval." Mothilal, Sharma & Tan's DiCE (Diverse Counterfactual Explanations, 2020, Microsoft Research) extended the framework to generate a set of counterfactuals that are collectively diverse while each being individually close to the original input. DiCE is open-source and integrated into Microsoft's InterpretML toolkit.
DiCE adds a diversity penalty to the optimisation objective so that returned counterfactuals differ from each other in which features they change β giving decision subjects genuine options rather than variations on the same intervention. It also supports feasibility constraints: marking features as immutable (age, race, nationality) or specifying allowed ranges, ensuring counterfactuals reflect actionable changes rather than fictional ones.
Poyiadzi et al.'s FACE (2020) addressed a deeper problem: proximity in feature space does not equal feasibility in the real world. A counterfactual that says "increase your income by Β£40,000" is nearby in Euclidean space but may cross through an infeasible region (no realistic path from current income to that level). FACE generates counterfactuals along data density corridors β paths that pass through regions of high training-data density, ensuring the counterfactual journey is realistically traversable.
The broader concept of algorithmic recourse (Ustun, Spangher & Liu, 2019) goes beyond explanation to actionable recommendation: given that you were denied credit, here is a specific policy (increase savings by $X, reduce one credit card balance by $Y over Z months) that would change the outcome. Recourse methods integrate causal structure (which variables can be changed, which are downstream effects) with counterfactual search to avoid "perverse" recommendations β like "open two new credit cards" to raise available credit, which might simultaneously lower the credit score through hard inquiries.
Karimi et al.'s "Algorithmic Recourse: From Counterfactual Explanations to Interventions" (ACM FAccT 2021) formalised the distinction between observational counterfactuals (nearest point in data space) and interventional counterfactuals (nearest achievable outcome under a causal model), arguing the latter is required for genuine recourse.
The UK Financial Conduct Authority's 2022 guidance on AI in financial services explicitly cited counterfactual explanations as a mechanism for satisfying the consumer duty to provide "meaningful explanations" for automated decisions. The EU AI Act's Article 13 (transparency obligations for high-risk AI systems) requires that affected persons receive information sufficient to allow meaningful exercise of their rights β language that legal analysts at Allen & Overy and Bird & Bird have interpreted as requiring recourse-style explanations for consequential AI decisions.
Work with the AI tutor to design counterfactual explanations for real decision scenarios. You might work through a credit denial case, discuss how to set feasibility constraints, compare DiCE to FACE for a specific use case, or think through what "algorithmic recourse" means legally.