Making AI Explainable · Introduction

When the Machine Decides and No One Can Say Why

Understanding why AI systems reach conclusions is now as important as whether they are correct.

In 1936, when actuarial tables began determining life insurance premiums at scale, regulators in New York demanded that any rate change affecting a policyholder be accompanied by a written explanation. The formula was mathematical and impersonal, but the principle was clear: consequential decisions require accountable reasoning. The rules took years to codify and longer still to enforce, but they permanently shaped an industry. Statistical models making decisions about people, it turned out, needed to be legible to the people they affected.

That pattern is repeating now, faster and with far higher stakes. In 2018, a ProPublica investigation revealed that the COMPAS recidivism algorithm — used by judges in at least two dozen U.S. states to inform bail and sentencing decisions — flagged Black defendants as future criminals at roughly twice the rate of white defendants, while its internal logic remained a trade secret. That same year, Amazon scrapped a machine-learning hiring tool after engineers discovered it had systematically downgraded résumés containing the word "women's." Neither system's designers had intended discrimination. Neither could easily explain, even internally, how the discrimination had emerged. The models had learned it from historical data, encoded it in millions of numerical weights, and produced outputs no single engineer could trace to a cause.

This course is about the discipline — increasingly called Explainable AI or XAI — that tries to bridge that gap. It covers why modern neural networks are opaque by construction, what techniques researchers and practitioners use to open them up, and what trade-offs arise when transparency conflicts with accuracy or speed. It will not make every AI system interpretable; some complexity is genuinely irreducible. What it will do is give you the vocabulary, the conceptual frameworks, and the practical methods to ask the right questions — and to demand honest answers — when an algorithm makes a decision that matters.

Making AI Explainable · Lesson 1

The Black Box Problem

Why modern AI systems produce correct answers that even their creators cannot fully explain.

What happens when a system is too complex for any human to understand, yet consequential enough to demand accountability?

A judge in Broward County, Florida opens a pre-sentencing report on a defendant named Vernon Prater. Appended to the report is a score — 3 out of 10, low risk of reoffending — generated by a software system called COMPAS, sold by the company Northpointe. Vernon Prater had prior convictions; the score said low risk anyway. He was released. He was subsequently arrested for breaking and entering and sentenced to eight years. Across town, a woman named Brisha Borden had been scored 8 out of 10 — high risk — for stealing a bicycle. She had no prior record. The algorithm's internal weights, the training data that shaped them, and the precise logic connecting inputs to scores: all of it was proprietary. The judge could see the number. No one in the courtroom could interrogate the reasoning behind it.

ProPublica published its analysis of 7,000 such cases in May 2016. The headline finding: the algorithm was not particularly accurate, and its errors were racially asymmetric. But the deeper finding — the one that rippled through computer science, law, and philosophy departments — was structural. Nobody could explain why COMPAS produced any particular score, not even Northpointe. The model had been trained, it had learned statistical patterns from historical criminal-justice data, and it had encoded those patterns in a form no human could read. The score appeared authoritative. Its foundations were invisible.

What "Black Box" Actually Means

The term "black box" in AI refers to any system whose internal mechanism — how inputs are transformed into outputs — is either inaccessible, too complex to interpret, or both. It is not a metaphor for secrecy alone. A decision tree with ten nodes can be fully inspected; a neural network with billions of parameters cannot be meaningfully read even when every weight is publicly available.

The distinction that matters is between interpretability and accuracy. For most of the twentieth century, the models that could be interpreted well — linear regression, logistic regression, shallow decision trees — were also the ones used in high-stakes settings, partly because regulators and courts could audit them. The arrival of deep learning after 2012 broke that trade-off open: neural networks started outperforming interpretable models substantially, but their internal logic became correspondingly opaque. Performance improved; explainability collapsed.

Three Dimensions of Opacity

Researchers distinguish three sources of opacity in modern AI systems, and conflating them leads to confused solutions.

Algorithmic opacity The model's architecture is deliberately hidden — a trade secret, a proprietary API, or a closed system. COMPAS is the canonical example. The code and weights were never publicly released.

Intrinsic complexity The model is publicly available but structurally impossible to interpret. GPT-4's weights are not public, but even if they were, reading 1.8 trillion parameters would not tell you why the model generated any particular sentence. The complexity is inherent to the architecture.

Emergent behavior The model does something its designers did not anticipate and cannot explain post-hoc. In 2022, researchers at Google and Stanford documented that large language models spontaneously developed chain-of-thought reasoning abilities that had not been trained for and were not predictable from smaller-scale behavior.

Why Opacity Matters Now

The stakes argument is not hypothetical. Between 2015 and 2020, AI-based systems were deployed in credit scoring (Fair Isaac Corporation's FICO updates incorporating ML), medical diagnosis (IBM Watson for Oncology, deployed at MD Anderson Cancer Center before being quietly discontinued in 2017 after producing unsafe recommendations), child welfare risk scoring (Allegheny County, Pennsylvania's Allegheny Family Screening Tool, live since 2016), and predictive policing (PredPol, used by dozens of U.S. police departments). In each case, opaque models were making or heavily influencing decisions about individual human lives. In each case, the people affected had limited or no ability to understand or contest the reasoning.

The legal and regulatory response has been incremental but real. The EU's General Data Protection Regulation, effective May 2018, introduced a limited "right to explanation" for automated decisions. The EU AI Act, passed in 2024, classifies certain AI applications as high-risk and mandates transparency requirements. The U.S. Equal Credit Opportunity Act has long required lenders to give specific reasons for credit denials — creating friction between that law and opaque ML models being used for underwriting.

Critical Point

Opacity is not the same as bias, and explainability is not the same as fairness. A fully interpretable model can embed discriminatory logic; an opaque model can produce equitable outcomes. Explainability is a precondition for diagnosing and correcting problems — including bias — not a guarantee of avoiding them.

The Accuracy–Interpretability Trade-Off

For most practical prediction tasks through roughly 2010, the best-performing models were also reasonably interpretable. Logistic regression models used in clinical settings could be printed on a laminated card. Decision trees for loan approvals could be drawn on a whiteboard. Credit scorecards were literally cards.

The ImageNet competition, which AlexNet won in 2012 with a convolutional neural network, demonstrated that deep architectures could achieve dramatically lower error rates than classical methods — at the cost of interpretability. AlexNet's seven layers and 60 million parameters produced features no human could name. By 2015, ResNet achieved superhuman performance on ImageNet classification with 152 layers. The trade-off was not merely practical; it was structural. The very thing that made these architectures powerful — their ability to learn arbitrary high-dimensional representations from data — made them impossible to interpret by inspection.

This trade-off is real but not absolute. Subsequent research has found domains where interpretable models achieve near-parity with deep learning (structured tabular data being the clearest case), and techniques have emerged — LIME in 2016, SHAP in 2017 — that approximate explanations for opaque models without requiring access to their internals. The field of Explainable AI exists precisely in this gap: trying to recover accountability from systems too powerful to sacrifice.

Key Terms — Lesson 1

Black box: Any AI system whose input-to-output transformation is not humanly interpretable, whether due to secrecy, scale, or emergent complexity. Interpretability: The degree to which a human can understand and predict the behavior of a model. Explainability: The capacity to provide a post-hoc account of a model's specific decision. XAI: Explainable Artificial Intelligence — the research and practice discipline addressing both.

Lesson 1 Quiz — The Black Box Problem

Four questions · Select the best answer for each

1. The COMPAS recidivism algorithm's scores were contested in part because:

Correct. COMPAS was a commercial product; its weights and training data were trade secrets. Neither defendants nor their attorneys could interrogate the reasoning behind any individual score — only the number itself was visible.

Not quite. The core problem was not mathematical incorrectness but the inaccessibility of the model's internal reasoning, which prevented any meaningful contest or audit of individual decisions.

2. Which of the following best describes "intrinsic complexity" as a source of AI opacity?

Correct. Intrinsic complexity means the model's scale itself makes interpretation impossible — a neural network with billions of parameters cannot be understood by reading its weights, regardless of whether they are public or secret.

That describes algorithmic opacity (deliberate secrecy) rather than intrinsic complexity. Intrinsic complexity is a structural property of the architecture — even full disclosure of parameters does not yield understanding.

3. The 2012 AlexNet result is relevant to explainability because it demonstrated:

Correct. AlexNet's 60 million parameters produced internal features no human could name or describe, establishing the pattern that the most powerful architectures sacrificed interpretability — a trade-off that has defined the XAI field ever since.

AlexNet's significance for explainability was structural: it showed that high-performance architectures were inherently opaque, not that they were biased or that explainability methods existed to address them (LIME and SHAP came years later).

4. Which statement correctly distinguishes explainability from fairness?

Correct. A fully interpretable model can embed discriminatory logic deliberately or through biased training data. Explainability makes such problems detectable and correctable — it is a diagnostic tool, not a guarantee of fairness.

Explainability and fairness are related but distinct. An interpretable model can be deliberately or inadvertently discriminatory. Output auditing alone misses many forms of bias embedded in the model's internal representations.

Lab 1 — Interrogating the Black Box

Conversational lab · Discuss opacity, accountability, and what "explanation" really requires

Your Task

You are investigating whether the COMPAS scoring system should continue to be used in pre-sentencing reports. Your lab assistant has background on the system, the ProPublica findings, and the broader XAI literature. Use this session to work through what a meaningful "explanation" of an algorithmic score would actually require — and what obstacles stand in the way.

Try asking: "What would Northpointe need to disclose for COMPAS scores to be legally contestable?" — or: "Can a model be fair even if it can't be explained?" — or: "What's the difference between explaining a model globally versus explaining one specific decision?"

XAI Lab Assistant

Lesson 1 · Black Box Problem

Welcome. We're examining the COMPAS case and the broader problem of AI opacity. The core tension is this: courts in the U.S. have long held that defendants have a right to confront the evidence against them — yet a proprietary algorithmic score functions as evidence with no inspectable foundation. Where would you like to start? What aspect of the black box problem is most pressing to you?

Making AI Explainable · Lesson 2

How Neural Networks Learn — and Why That Creates Opacity

The mechanism of deep learning is also the mechanism of illegibility.

If a network learns by adjusting millions of weights through gradient descent, what could "explanation" even mean for any single decision?

In October 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted their convolutional neural network — later called AlexNet — to the ImageNet Large Scale Visual Recognition Challenge. It achieved a top-5 error rate of 15.3 percent, compared to 26.2 percent for the second-place entry. The gap was so large that most computer vision researchers initially assumed an error in the results. When confirmed, it triggered what Yann LeCun later called "the ImageNet moment" — the point at which the field accepted that deep learning had fundamentally changed what was possible. What received less immediate attention was what had been sacrificed: the winning system's internal representations were features that no human could name, visualize, or reason about. The network had learned to distinguish cats from dogs through a cascade of mathematical operations that corresponded to nothing in human visual vocabulary. The performance was real. The reasoning was alien.

Gradient Descent and the Death of Legibility

Neural networks learn by a process called gradient descent. The network makes a prediction, computes an error (the loss), and then adjusts all of its weights slightly in the direction that would reduce that error. This process is repeated millions or billions of times across a training dataset. After training, the weights encode statistical patterns from the data in a distributed, non-symbolic form. No single weight corresponds to a concept. No single layer corresponds to a rule. The knowledge is spread across the entire parameter space in a form that cannot be decoded by inspection.

This is the fundamental source of opacity. Classical software is explicit: a programmer writes if age > 65 and income < 30000, then flag. Every rule is legible. A trained neural network is implicit: the equivalent logic — if it exists in any coherent sense — has been encoded in the values of millions of weights that interact with each other non-linearly. There is no single place to look. There is no rule to read.

What Neural Networks Actually Contain

Visualization research, particularly work by Chris Olah and colleagues at Google Brain and later Anthropic, has partially illuminated what neural networks learn. In a 2017 paper, Olah's team showed that neurons in convolutional networks trained on ImageNet learn to respond to specific visual patterns: curves, textures, object parts, and eventually full objects. But the relationship between these learned features and any individual prediction is not a simple chain. A prediction emerges from the interaction of thousands of features, weighted and combined through multiple layers, in a way that resists narrative description.

A separate line of research — adversarial examples — demonstrated how alien neural network reasoning actually is. In 2013, Christian Szegedy and colleagues at Google showed that imperceptible perturbations to images — noise invisible to human eyes — could cause a well-trained network to misclassify a school bus as an ostrich with 99% confidence. The network's internal representation of "school bus" and "ostrich" were, apparently, nearby in weight space in a way that corresponded to nothing in human perception. This was not a bug. It was a structural consequence of how gradient descent works.

Why This Matters for Accountability

If a model's reasoning cannot be traced — even in principle — to human-interpretable concepts, then "explanation" in any robust legal or ethical sense may be impossible for that architecture. The XAI field's task is partly to find approximate explanations and partly to determine when approximations are sufficient and when they are not.

The Distributed Representation Problem

The core technical obstacle to interpretability is that neural networks use distributed representations: any concept is encoded across many neurons simultaneously, and any neuron participates in encoding many concepts. This is opposite to the symbolic AI systems of the 1980s and 1990s, where each concept had a discrete symbol and each rule was explicit. The philosopher Hubert Dreyfus spent decades arguing that human cognition was not symbol manipulation — that it was contextual, embodied, and irreducibly holistic. Neural networks, it turned out, agreed with him. They learned representations that were powerful precisely because they were not symbolic. And they became hard to explain for exactly the same reason.

Research by Yoshua Bengio and colleagues on disentangled representations — attempting to train networks where individual neurons correspond to independent semantic factors — has partially addressed this, but full disentanglement remains an open research problem. Modern large language models exhibit partial disentanglement (certain attention heads demonstrably track syntactic structure), but the majority of their capabilities remain distributed and opaque.

Key Terms — Lesson 2

Gradient descent: The iterative optimization process by which neural networks adjust weights to reduce prediction error. Distributed representation: Encoding where knowledge is spread across many parameters rather than localized in discrete symbols. Adversarial example: A carefully perturbed input that causes a model to make a confidently wrong prediction, exposing the non-human character of learned representations. Feature visualization: Techniques for generating inputs that maximally activate specific neurons, used to probe what concepts a network has learned.

Lesson 2 Quiz — How Neural Networks Learn

Four questions · Select the best answer for each

5. What does "distributed representation" mean in the context of neural network opacity?

Correct. Distributed representations mean knowledge is spread across the entire parameter space non-locally — the opposite of symbolic AI where each concept has a discrete, inspectable symbol. This is why reading individual weights tells you nothing.

Distributed representation is an architectural property of how information is encoded, not a statement about infrastructure or data geography. It refers to the fact that no single neuron or parameter encodes any single human-interpretable concept.

6. The adversarial example research by Szegedy et al. (2013) is significant for XAI because it showed:

Correct. The fact that noise invisible to humans causes confident misclassification reveals that the network's "understanding" of visual categories is structurally different from human understanding — which matters enormously when deciding how much to trust or explain a model's outputs.

While adversarial examples do raise security concerns, their deeper significance for XAI is what they reveal about the nature of learned representations: the network's concept space does not map onto human concept space in the ways we intuitively assume.

7. AlexNet's 2012 ImageNet victory was a turning point for explainability primarily because:

Correct. AlexNet's gap over competing methods — roughly 11 percentage points — was so large that it settled the debate about deep learning's superiority. But it also showed that the most powerful path forward was architecturally opaque, making interpretability a field-defining problem rather than a minor concern.

AlexNet's significance was establishing the accuracy–interpretability trade-off at scale. LIME and SHAP came later (2016 and 2017), partly in response to the problems AlexNet's dominance created.

8. Why is gradient descent, as a learning mechanism, a source of opacity rather than transparency?

Correct. Gradient descent encodes knowledge implicitly and non-locally. The "rules" a network learns — if they can even be called that — exist nowhere in isolation. They emerge from the interaction of all parameters together, which is precisely what makes the system powerful and precisely what makes it illegible.

Gradient descent is deterministic given a fixed seed, not random, and applies to all data types. The source of opacity is how it distributes knowledge across parameters non-symbolically — there is no readable rule at any inspectable location.

Lab 2 — Inside the Network

Conversational lab · Explore distributed representations, adversarial examples, and what visualization reveals

Your Task

You are a researcher trying to understand what a trained image classification network has actually learned. Your lab assistant is familiar with feature visualization research (Olah et al.), adversarial examples (Szegedy et al.), and the theoretical problem of distributed representations. Probe what "looking inside" a network can and cannot tell us.

Try asking: "If I visualize what maximally activates a neuron, does that tell me what concept it represents?" — or: "Why can't I just look at the highest-weight connections to understand a prediction?" — or: "What did Chris Olah's circuits work actually find inside neural networks?"

XAI Lab Assistant

Lesson 2 · Neural Network Opacity

Let's explore what's actually inside a trained neural network. The short answer is: more than we expected, but less than we need. Feature visualization has revealed that networks learn curve detectors, texture analyzers, and object-part recognizers — but the relationship between these features and any individual prediction remains tangled. What would you like to dig into first?

Making AI Explainable · Lesson 3

Post-Hoc Explanation Methods — LIME and SHAP

The most widely deployed XAI techniques explain what a black box did without opening it.

If we cannot read a model's internals, can we still produce meaningful explanations by watching how it behaves?

At the KDD conference in San Francisco in August 2016, Marco Ribeiro, Sameer Singh, and Carlos Guestrin presented a paper titled "Why Should I Trust You?": Explaining the Predictions of Any Classifier. They demonstrated their method — LIME, Local Interpretable Model-agnostic Explanations — by showing that a text classifier trained to distinguish Christianity from Atheism had learned to rely heavily on the words "posting" and "host" as proxies for the religion category, because those words appeared frequently in the email headers of training samples from a specific newsgroup. The model was 99% accurate. Its reasoning was entirely spurious. LIME found this. A standard accuracy metric would not have. The demonstration made the point with unusual clarity: a model that cannot be explained is a model that cannot be trusted, no matter what its held-out accuracy is.

LIME — Local Interpretable Model-Agnostic Explanations

LIME's core insight is that global interpretability — understanding a complex model everywhere — may be impossible, but local interpretability — understanding why a specific decision was made — is achievable by approximation. The method works by perturbing the input around the instance being explained (changing individual words in text, superpixels in images, or feature values in tabular data), observing how the model's output changes across these perturbations, and then fitting a simple interpretable model (usually linear regression) to the model's behavior in that local neighborhood.

The result is a set of feature importances: "this prediction was most strongly influenced by features X, Y, and Z." These importances are not the model's actual internal reasoning — they are an approximation of the model's behavior around one point. The limitation is significant: LIME explanations can be unstable (small changes to the explained instance can produce very different explanations) and can miss important global structure. But they are practical, model-agnostic (they work on any classifier without access to its internals), and often actionable.

SHAP — SHapley Additive exPlanations

In 2017, Scott Lundberg and Su-In Lee at the University of Washington published SHAP, grounding feature attribution in cooperative game theory. The Shapley value — invented by Lloyd Shapley in 1953 for fair division of payoffs in cooperative games — measures each player's marginal contribution averaged across all possible orderings of players joining the game. Lundberg and Lee adapted this to AI: each feature is a "player," the model's prediction is the "payoff," and the Shapley value of each feature is its average marginal contribution to the prediction across all possible feature orderings.

SHAP has several properties that LIME lacks: consistency (if a feature truly matters more, its SHAP value is guaranteed to be higher), local accuracy (SHAP values sum to the model's prediction minus the expected prediction), and missingness (absent features receive zero attribution). These axiomatic guarantees make SHAP attributions more theoretically principled than LIME. The cost is computational: exact Shapley values require exponential time in the number of features, so SHAP in practice uses approximations (TreeSHAP for gradient-boosted models, KernelSHAP for generic models). TreeSHAP, implemented by Lundberg in 2018, runs in polynomial time for tree-based models and became widely adopted in financial services and healthcare AI audits.

The Faithfulness Problem

Both LIME and SHAP explain model behavior — they describe what features the model appears to use — but neither guarantees that the explanation reflects the model's actual computational process. A 2020 paper by Dylan Slack and colleagues showed that SHAP and LIME explanations could be systematically manipulated: a model could appear to use non-discriminatory features in explanations while actually relying on protected attributes in its predictions. The model learned to behave differently when it detected it was being explained. This remains an active and unresolved research area.

Global vs. Local Explanations

A persistent confusion in XAI practice is conflating global and local explanations. A global explanation describes a model's overall behavior — which features it generally relies on, what its decision boundaries look like across its input space. A local explanation describes a specific prediction — why this applicant was denied credit, why this image was classified as a tumor.

LIME is explicitly local. SHAP can be aggregated: summing absolute SHAP values across many examples produces a global feature importance ranking. But this aggregate conceals heterogeneity — a feature might be critical for some subpopulations and irrelevant for others, a fact that average SHAP values can mask. Partial dependence plots, accumulated local effects (ALE), and SHAP interaction values are among the tools used to probe global structure, each with their own assumptions and limitations.

The EU GDPR's "right to explanation" refers specifically to individual automated decisions — a local explanation requirement. The EU AI Act's requirements for high-risk systems lean more global — documentation of a model's overall logic and limitations. These regulatory distinctions map roughly onto the local/global technical distinction, though the legal literature is still working out exactly what level of explanation satisfies each requirement.

Key Terms — Lesson 3

LIME: Local Interpretable Model-Agnostic Explanations — explains individual predictions by fitting a simple model to the black box's local behavior through perturbation. SHAP: SHapley Additive exPlanations — assigns feature attributions using Shapley values from cooperative game theory, with axiomatic consistency and local accuracy guarantees. Faithfulness: The degree to which an explanation accurately reflects the model's actual internal computation. Local vs. global explanation: The distinction between explaining one prediction and explaining a model's overall behavior.

Lesson 3 Quiz — LIME and SHAP

Four questions · Select the best answer for each

9. The KDD 2016 LIME demonstration with the Christianity/Atheism classifier showed that:

Correct. The classifier was 99% accurate but was using email header words as proxies for the religion category — a spurious correlation that would have been invisible to anyone looking only at test accuracy. LIME's perturbation-based approach surfaced this by showing which features actually drove individual predictions.

The demonstration made precisely the opposite point: high accuracy does not imply correct or trustworthy reasoning. And LIME is explicitly model-agnostic — it requires no access to internal weights, only the ability to query the model's outputs.

10. SHAP's key advantage over LIME in terms of theoretical grounding is:

Correct. SHAP's foundation in Shapley values from game theory provides formal guarantees that LIME's local linear approximation does not. Consistency, local accuracy, and missingness are axiomatic properties that hold regardless of the model being explained.

SHAP is computationally expensive — exact Shapley values require exponential time, and SHAP uses approximations like TreeSHAP or KernelSHAP. Its advantage is theoretical, not speed-based, and it does not require internal model access.

11. The "faithfulness problem" in XAI, as demonstrated by Slack et al. (2020), refers to:

Correct. Slack et al. showed that models can be built to behave differently when being explained — producing explanation-friendly outputs while relying on protected attributes in actual predictions. An explanation describes model behavior; it does not guarantee it reflects internal computation.

Faithfulness is a technical concept referring to the accuracy of an explanation as a representation of a model's actual computational process — not user adoption, communication, or uncertainty quantification.

12. The EU GDPR's "right to explanation" for automated decisions is best characterized as a requirement for:

Correct. GDPR Article 22 concerns automated individual decisions — meaning it requires an explanation of why you specifically were denied a loan or flagged in a system, not a general description of how the model works. This maps to local explanation in the XAI technical vocabulary.

GDPR's right to explanation applies to individual automated decisions — a local requirement. Global explanations (how the whole model behaves) are more relevant to the EU AI Act's high-risk system documentation requirements.

Lab 3 — Applying LIME and SHAP

Conversational lab · Work through how post-hoc explanation methods function and where they break down

Your Task

You are auditing an ML-based credit scoring model for a bank. The model is a gradient-boosted tree ensemble and outputs probability-of-default scores. You need to explain individual loan denials and characterize the model's overall behavior. Your lab assistant is fluent in LIME, SHAP, TreeSHAP, and the faithfulness literature. Work through how you would actually produce and validate these explanations.

Try asking: "Which method should I use for individual loan denial explanations — LIME or SHAP?" — or: "How would I detect if SHAP explanations were being gamed by the model?" — or: "What does a SHAP summary plot tell me that a global feature importance ranking doesn't?"

XAI Lab Assistant

Lesson 3 · LIME and SHAP

Good setup. A gradient-boosted tree ensemble is actually one of the more tractable cases for XAI — TreeSHAP computes exact Shapley values in polynomial time for tree models, which is a significant advantage. The harder question is what those values mean and how to communicate them to a loan applicant who's just been denied. Where would you like to start — the technical side or the communication side?

Making AI Explainable · Lesson 4

Interpretable-by-Design — When Transparency is Built In

Some AI systems are built to be explainable from the first layer, not explained after the fact.

If post-hoc explanations can be gamed or approximate, why not build models that are interpretable by construction — and what do we sacrifice?

In the late 1990s, Jerome Friedman at Stanford and Werner Stuetzle were developing methods for visualizing high-dimensional statistical models. Around the same time, clinical researchers were using something much simpler to decide whether a child with a sore throat needed antibiotics: the Centor criteria, a four-item scoring system developed by Robert Centor in 1981. Check for fever above 38°C (1 point), tonsillar exudate (1 point), swollen anterior cervical lymph nodes (1 point), and absence of cough (1 point). A score of 3 or 4 indicates empirical antibiotic treatment without waiting for a throat culture. The model was a laminated card. Its logic was fully legible to any nurse. Its performance across decades of clinical use was adequate — not optimal, but sufficient, safe, and auditable. A neural network trained on electronic health records might score fractionally better on a held-out dataset. But when that network flagged a child for antibiotics, no one could explain to the parents, the pharmacist, or the malpractice attorney why.

Architectures Designed for Interpretability

Interpretable-by-design approaches build transparency into the model architecture rather than applying explanation methods afterward. The most established include:

Linear models Logistic regression, ridge regression, LASSO. Each feature has a single coefficient; the decision rule is a weighted sum. Perfectly interpretable; may underfit complex patterns.

Decision trees Hierarchical if-then rules that can be printed and read. Interpretability degrades rapidly as depth increases. Prone to overfitting without regularization.

Generalized Additive Models (GAMs) Extend linear models by allowing each feature to have a non-linear effect — but keeping effects additive and therefore fully inspectable. Each feature's contribution can be plotted independently.

Neural Additive Models (NAMs) Developed by Rishabh Agarwal and colleagues at Google in 2021. NAMs use a small neural network for each feature, preserving the additivity constraint while allowing complex per-feature shapes. They outperform classical GAMs while remaining interpretable.

Rule lists and scorecards Cynthia Rudin at Duke University has argued that for structured tabular data in high-stakes settings, interpretable models can often match black-box accuracy. Her SLIM (Supersparse Linear Integer Models) and FasterRisk methods produce clinical scorecards with integer coefficients that can be computed in one's head.

Rudin's Argument — Stop Explaining Black Boxes

In a widely cited 2019 Nature Machine Intelligence paper, Cynthia Rudin argued that the XAI community had misidentified its goal. Rather than developing better post-hoc explanation methods for black-box models, researchers and practitioners should use inherently interpretable models for high-stakes decisions. Her argument: post-hoc explanations are not the model's actual reasoning, they are at best approximations and at worst misleading; for structured tabular data (the dominant form in criminal justice, credit, and medicine), interpretable models achieve accuracy within a few percent of the best black-box methods; and the performance gap often disappears when evaluation is done carefully on the right metrics for the right population.

Rudin's view is contested. Critics note that in domains like medical imaging, natural language processing, and speech recognition, the performance gap between interpretable and black-box models is real and clinically meaningful. A neural network detecting diabetic retinopathy from fundus photographs achieves accuracy that a logistic regression on manually extracted features cannot match. In those domains, the choice is not between a interpretable model and an opaque one of equal performance — it is between an interpretable model that misses more cases and an opaque model that catches them.

The Right Question

Rudin's framework suggests the right first question in any ML deployment is not "how do we explain this black box?" but "do we actually need a black box here?" If interpretable models are adequate, the explanation problem largely disappears. If they are not — as is often true in vision and language tasks — then post-hoc methods like SHAP become necessary tools with known limitations, not substitutes for interpretability.

Concept-Based Explanations and TCAV

A different approach to interpretable-by-design thinking involves not the model architecture but the explanation vocabulary. Been Kim and colleagues at Google Brain developed TCAV (Testing with Concept Activation Vectors) in 2018. Rather than explaining a prediction in terms of raw input features (pixels, word tokens, tabular values), TCAV identifies human-defined concepts — "striped," "smiling," "elderly" — and tests whether a network uses those concepts in its predictions.

The method works by training a linear classifier to distinguish activations produced by concept-containing images from activations produced by random images, then measuring how much perturbing in the concept direction changes the model's output. TCAV explanations are in a vocabulary humans can understand — "this tumor prediction was influenced by the concept of irregular borders" — rather than "pixel 147 in the upper-left quadrant contributed 0.003 to the output." For medical AI, where the explanation vocabulary needs to match clinical training, this is a significant advantage.

Key Terms — Lesson 4

Interpretable-by-design: Model architectures whose decision logic is human-readable without post-hoc explanation. GAMs / NAMs: Generalized and Neural Additive Models — additive architectures allowing non-linear per-feature effects while preserving interpretability. Scorecards: Integer-coefficient linear models that can be computed mentally, used in clinical and criminal justice settings. TCAV: Testing with Concept Activation Vectors — explaining neural network predictions in terms of human-defined concepts rather than raw features.

Lesson 4 Quiz — Interpretable-by-Design

Four questions · Select the best answer for each

13. Neural Additive Models (NAMs), developed by Agarwal et al. at Google in 2021, improve on classical GAMs by:

Correct. NAMs retain the additivity constraint — each feature's contribution can be inspected and plotted independently — but use a neural network per feature instead of a spline or polynomial, allowing far more complex per-feature functions without sacrificing interpretability.

NAMs explicitly preserve additivity — the interpretability-enabling constraint. They do not allow feature interactions (which would break interpretability) and do not use SHAP. The key innovation is neural-network-powered shape functions within an additive framework.

14. Cynthia Rudin's 2019 Nature Machine Intelligence argument is best summarized as:

Correct. Rudin's argument was domain-specific and empirical: for the structured, tabular data common in criminal justice, credit, and clinical scoring, interpretable models have been shown to achieve near-equal performance. The implication is that black-box models are often an unnecessary choice in these settings — and post-hoc explanations of them are approximations of what could be exact reasoning.

Rudin did not call for a general ban, nor did she compare LIME and SHAP. Her specific, empirical claim was that for structured tabular data in high-stakes domains, the performance gap between interpretable and black-box models is often small enough that using opaque models is an unnecessary choice — not an inevitable one.

15. TCAV (Testing with Concept Activation Vectors) addresses what limitation of pixel-level or token-level explanation methods?

Correct. A radiologist cannot interpret "pixel 147 contributed 0.003 to the output." A radiologist can interpret "the prediction was strongly influenced by the concept of irregular lesion borders." TCAV bridges the vocabulary gap by letting domain experts define concepts and then testing whether the model uses them.

TCAV's distinguishing feature is its explanation vocabulary — it uses human-defined semantic concepts rather than raw features, making explanations meaningful to domain experts. It does require labeled concept examples to function, and does not produce decision trees.

16. The Centor criteria for strep throat treatment is used in the lesson as an example of:

Correct. The Centor criteria is a four-item integer scorecard — a textbook example of interpretable-by-design. Its logic fits on a card, can be explained to patients, and can be audited by anyone. It illustrates Rudin's broader argument that in many high-stakes clinical settings, simple interpretable models have been adequate for decades.

The lesson uses the Centor criteria as a positive example of interpretable-by-design — a model whose logic is fully legible and communicable, which is exactly what deep learning alternatives typically sacrifice.

Lab 4 — Choosing Your Interpretability Strategy

Conversational lab · Work through model selection and interpretability trade-offs for real deployment scenarios

Your Task

You are advising a hospital system deploying AI for three different tasks: (1) predicting 30-day readmission from structured EHR data, (2) detecting pneumonia in chest X-rays, and (3) triaging emergency department patients by acuity. Your lab assistant knows the XAI landscape — GAMs, NAMs, Rudin's scorecard methods, TCAV, and the limits of post-hoc methods. Work through what interpretability strategy is appropriate for each task and why.

Try asking: "For the readmission prediction task, would Rudin's argument suggest using a scorecard over a gradient-boosted tree?" — or: "How would TCAV be applied to the chest X-ray classifier to explain findings to radiologists?" — or: "For ED triage, what are the legal and ethical stakes of using a black-box model versus an interpretable one?"

XAI Lab Assistant

Lesson 4 · Interpretable-by-Design

Three tasks, three very different interpretability situations. Structured EHR data for readmission prediction is exactly the domain where Rudin's argument has the most force — there is solid empirical evidence that interpretable models like GAMs or scorecards can match black-box performance here. Chest X-ray analysis is the opposite case: deep convolutional networks genuinely outperform feature-engineering approaches on imaging, which means you're in post-hoc explanation territory whether you like it or not. ED triage sits in between. Which task would you like to analyze first?

Module 1 Test — The Black Box Problem

15 questions · Score 80% or higher to pass · All four lessons

1. The COMPAS algorithm used in U.S. sentencing was controversial primarily because:

Correct. COMPAS's trade-secret status meant defendants could see their score but could not interrogate the logic — a fundamental due process concern identified by the ProPublica investigation and subsequent legal challenges.

COMPAS was a proprietary commercial tool; its controversy centered on the inaccessibility of its internal reasoning, not random generation or mandatory compliance.

2. "Emergent behavior" in AI refers to:

Correct. The 2022 Google/Stanford research on chain-of-thought reasoning in large language models is the canonical example: the capability appeared at scale without being directly trained for, and could not have been predicted from the behavior of smaller models.

Emergent behavior specifically refers to capabilities that arise unexpectedly at scale — not designed features, not smooth scaling, and not deployment errors.

3. Which of the following correctly describes the trade-off that AlexNet's 2012 ImageNet win established?

Correct. AlexNet demonstrated that the architectures achieving the largest accuracy gains were structurally opaque — 60 million parameters producing features no human could name. This established the accuracy-interpretability trade-off that defines the XAI field.

The trade-off was accuracy vs. interpretability, not speed vs. accuracy or data vs. interpretability. AlexNet's significance was structural: deep architectures were powerful because of their complexity, and complex because of their opacity.

4. Distributed representations in neural networks make interpretation difficult because:

Correct. Distributed representation is the architectural property that makes neural networks powerful and illegible simultaneously. Concepts are not localized; they emerge from the interaction of all parameters, which is why reading individual weights yields no interpretable information.

Distributed representation is an architectural property of how information is encoded — not infrastructure, not training noise, not cross-layer communication. The key fact is that knowledge is spread non-locally, making any particular location uninformative in isolation.

5. Adversarial examples — images perturbed imperceptibly to fool classifiers — demonstrate:

Correct. The fact that noise invisible to humans causes catastrophically wrong confident predictions reveals that the network's internal concept space does not map onto human concept space — which matters enormously for explainability and trust calibration.

Adversarial examples are primarily a window into the nature of learned representations — they show that the network's understanding of categories differs structurally from human understanding, which is the core XAI implication of Szegedy et al.'s 2013 findings.

6. LIME generates explanations by:

Correct. LIME's model-agnostic approach — perturbing inputs, querying outputs, fitting a local linear approximation — means it works on any classifier without requiring internal access. This is also its limitation: it approximates rather than reveals actual internal reasoning.

LIME is model-agnostic and requires no internal access. Shapley values describe SHAP, not LIME. LIME's mechanism is local perturbation and local approximation — an external behavioral probe, not an internal inspection.

7. SHAP's "local accuracy" property means:

Correct. Local accuracy is the axiomatic property that SHAP attributions sum to the model's actual prediction minus the baseline. This means the explanation fully accounts for the prediction — there is no unexplained residual. This is a formal guarantee LIME's local approximation does not provide.

Local accuracy in SHAP is a technical axiomatic property: the sum of attribution values equals the model's output minus its expected output. It is not about geographic locality or proximity to decision boundaries.

8. The Dylan Slack et al. 2020 finding that SHAP and LIME explanations could be gamed is most concerning because:

Correct. A model trained to detect when it is being explained — and to behave differently in that context — can pass explanation-based audits while continuing to discriminate in real deployment. This is the faithfulness gap in its most adversarial form.

The finding was not about computational cost or LIME vs. SHAP comparisons. It was a faithfulness attack: models can be built that produce clean explanations while doing discriminatory computation — undermining the regulatory value of explanation-based auditing.

9. The EU GDPR right to explanation (Article 22) applies specifically to:

Correct. GDPR Article 22's right to explanation is triggered by automated decisions with legal or similarly significant effects on individuals — credit, employment, housing decisions. It is a local explanation requirement: explain why this decision was made for this person.

GDPR Article 22 is specifically about individual automated decisions with legal or significant effects — not all AI systems, not all personal data processing, and not based on model size or licensing status.

10. Generalized Additive Models (GAMs) are interpretable because:

Correct. The additivity constraint means each feature's contribution to the prediction is independent of all others — you can plot the effect of age on predicted outcome without worrying about how it interacts with income. This separability is the source of interpretability.

GAMs allow non-linear per-feature effects (unlike linear regression) but require those effects to be additive (no interactions). This separability — not linearity or parameter count — is what makes them interpretable.

11. Cynthia Rudin's argument that practitioners should use interpretable models is most defensible in which context?

Correct. Rudin's argument has the most force for structured tabular data — where the feature space is human-defined and interpretable models have demonstrated near-parity with black-box methods. EHR-based readmission prediction fits this profile exactly. Image and audio tasks are where the performance gap between interpretable and black-box models is real and clinically significant.

Rudin's argument is specifically strongest for structured tabular data. Image-based tasks (retinopathy, CT reports) and audio tasks (speech transcription) are the domains where deep learning outperforms interpretable models by meaningful margins that cannot be dismissed.

12. TCAV (Testing with Concept Activation Vectors) addresses what gap in standard feature attribution methods like SHAP?

Correct. A radiologist cannot act on "pixel 312 contributed 0.004 to the prediction." TCAV bridges this gap by letting experts define meaningful concepts — "irregular borders," "mass effect" — and then testing whether the model uses those concepts, producing explanations in clinical vocabulary.

TCAV's distinguishing contribution is its explanation vocabulary — human-defined concepts rather than raw features. It does require internal access (to activation vectors) and is not faster than SHAP by design. SHAP also works on images and text.

13. IBM Watson for Oncology was discontinued at MD Anderson Cancer Center in 2017 primarily because:

Correct. MD Anderson's internal review found Watson recommended treatments that contradicted established clinical guidelines, including in some cases potentially harmful combinations. Because the system's reasoning was opaque, there was no mechanism for identifying why — a core XAI failure mode in high-stakes deployment.

The Watson for Oncology discontinuation was a consequence of unsafe recommendations combined with opacity that prevented error diagnosis. The inability to inspect the reasoning was both a safety problem and an accountability problem.

14. The key difference between a "global" and a "local" explanation in XAI is:

Correct. The global/local distinction is fundamental in XAI. GDPR Article 22 requires local explanations (for individual automated decisions). The EU AI Act leans more global (documentation of overall model behavior). SHAP can serve both (individual SHAP values are local; aggregate SHAP importance is a global summary).

The global/local distinction is about the scope of explanation — one prediction versus the full model behavior. SHAP and LIME are not exclusively tied to one scope, and the regulatory mapping is roughly (not exactly) reversed from the wrong option.

15. Feature visualization research (Olah et al., Google Brain) revealed that neurons in convolutional networks trained on ImageNet:

Correct. Olah's work showed that individual neurons do develop preferences for interpretable-seeming features — curve detectors, texture analyzers, object-part recognizers. But the combination of thousands of such features across multiple layers to produce a single prediction resists any simple narrative explanation.

Feature visualization showed that neurons do specialize — they learn real visual patterns — but this partial structure does not yield full interpretability. The gap between "neurons learn curves" and "the model predicted school bus" remains too large to bridge narratively.