Module 4 · Lesson 1

The Black Box Problem

Why AI systems produce outputs nobody fully understands — and why that matters.

If you can't see inside a mind, how can you trust what it tells you?

In 2015, researchers at the Technical University of Berlin asked a question that would haunt AI development for years: why does this neural network think that image is a wolf? The classifier was accurate. But when Sebastian Lapuschkin and colleagues applied a technique called Layer-wise Relevance Propagation to peer inside, the answer was unsettling.

The network had learned to recognize wolves not by their fur, posture, or gaze — but by snow. Nearly every wolf in the training set appeared against a snowy background. Every husky appeared indoors. The model was, technically, performing perfectly on its test set while reasoning about entirely the wrong thing.

No one had intended this. No one had noticed. The network never said, "I'm using snow as a proxy for wolf." It just did. This was the black box problem made visible — and it would not be the last time.

What Is the Black Box Problem?

Modern neural networks are function approximators: given enough data and gradient descent, they learn numerical weights across millions or billions of parameters that map inputs to outputs with impressive accuracy. The trouble is that no human designed those weights. They emerged from optimization. They encode patterns — some useful, some spurious — in a high-dimensional space that has no natural language.

A 2017 paper by Marco Ribeiro, Sameer Singh, and Carlos Guestrin (the LIME authors) demonstrated this vividly with a sentiment classifier. The model correctly labeled most reviews as positive or negative, but its reasoning, when probed, depended on superficial tokens — the word "not" caused wildly unpredictable flips — rather than semantic understanding. The performance metric looked fine. The internal logic was fragile.

This gap between measurable performance and understood reasoning is the black box problem. It has three practical consequences that motivate the entire field of interpretability.

OpacityThe model's internal representations are not human-readable without specialized tools.

Spurious correlationA statistical pattern in training data that predicts labels without capturing the true causal mechanism.

Shortcut learningA model solving a task via unintended cues — snow for wolves, background artifacts for tumors — that fail under distribution shift.

Three Consequences That Made the Field Urgent

1. Safety under distribution shift. In 2018, Emma Pierson and collaborators at Stanford analyzed a dermatology AI trained by Google. The model reached dermatologist-level accuracy on held-out test images. But those images came from the same hospital photography protocol as training. When tested on smartphone photos from patients in Sub-Saharan Africa — different lighting, different skin tone distributions — accuracy fell sharply. The network had learned correlations tightly coupled to the photographic pipeline, not to the biology of lesions. Because no one had looked inside, the brittleness was invisible until deployment.

2. Accountability and contestability. In 2016, ProPublica published "Machine Bias," an investigation into COMPAS, a recidivism-prediction algorithm used in criminal sentencing across the United States. The algorithm's vendor, Northpointe, treated it as proprietary. Judges were told scores but not how they were computed. Defendants could not contest a number they could not see constructed. Whether or not COMPAS was "fair" by any given metric became almost beside the point: an uninterpretable score affecting years of a person's liberty is an accountability failure by definition.

3. Debugging failure modes before deployment. In 2020, researchers at MIT and Massachusetts General Hospital showed that a chest X-ray classifier trained on the MIMIC dataset had implicitly encoded patient age and sex into its feature space in ways that inflated apparent diagnostic accuracy — models that "knew" a patient was older predicted certain diseases at higher base rates. The performance was real but artificially boosted. Only mechanistic inspection revealed the leakage. Deploying such a model without that inspection would have led to systematically wrong calibration in clinical settings.

Core Tension

The same architectural properties that make neural networks powerful — distributed representation, non-linear composition, emergent features — are exactly what make them hard to interpret. Interpretability is not a free add-on. It requires deliberate methods, and sometimes deliberate trade-offs with raw performance.

Why Interpretability Is Not Just "Explaining a Decision"

A common misconception frames interpretability as post-hoc explanation: you get an output, you generate a sentence describing why. But Finale Doshi-Velez and Been Kim at Harvard and Google Brain argued in a 2017 position paper that this framing is dangerously narrow. What we actually need, they proposed, is a taxonomy: interpretability (the degree to which a human can predict the model's behavior), completeness (does the explanation cover all relevant factors?), and causality (does intervening on the stated reason actually change the output?).

An explanation can be compelling, coherent, and completely wrong about the actual computation. This has a name in the field: plausible but unfaithful explanations. Julius Adebayo and colleagues at Google demonstrated in 2018 ("Sanity Checks for Saliency Maps") that several widely-used gradient-based explanation methods produced nearly identical visualizations whether the model was trained or its weights were randomly permuted — meaning the explanations were capturing image structure, not model logic. The field had to confront that "looking inside" was harder than producing a heat map.

Why This Module Exists

Interpretability is not an academic luxury. It is the precondition for auditing AI systems for bias, for catching failures before they harm people, for building the kind of trust that can only exist when behavior is understandable. Every tool in this module — from saliency maps to mechanistic circuit analysis — exists because someone hit a wall where performance metrics were not enough.

Quiz — The Black Box Problem

Five questions · Select the best answer for each

1. In the 2015 "Husky vs. Wolf" study at TU Berlin, what did the neural network primarily use to distinguish wolves from huskies?

Correct. Sebastian Lapuschkin and colleagues used Layer-wise Relevance Propagation and found the network had learned snow as a proxy for "wolf" because nearly all wolf training images had snowy backgrounds. This is a textbook case of shortcut learning.

Not quite. The LRP analysis revealed the network was relying on background snow — a spurious correlation in the training data — rather than any feature of the animal itself.

2. The ProPublica "Machine Bias" investigation (2016) focused on which algorithm used in US criminal sentencing?

Correct. COMPAS, built by Northpointe, assigned recidivism risk scores used in sentencing. Its opacity meant defendants could not contest or even understand the basis of scores affecting years of their liberty.

Not correct. The investigation targeted COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a proprietary recidivism tool whose opacity made its scores uncontestable by defendants.

3. What did Julius Adebayo and colleagues show in their 2018 "Sanity Checks for Saliency Maps" paper?

Correct. Adebayo et al. showed that some widely-used saliency methods generated nearly identical visualizations regardless of whether model weights were meaningful or random — revealing that these "explanations" were capturing image statistics, not model reasoning.

Not quite. The paper's alarming finding was the opposite: explanations looked similar whether the model was trained or its weights were randomly scrambled, indicating the methods weren't actually reflecting model logic.

4. According to Finale Doshi-Velez and Been Kim's 2017 framework, which property asks: "Does intervening on the stated reason actually change the model's output?"

Correct. Causality in their framework specifically asks whether the explanation has predictive power — if you change the stated cause, does the output actually change? This distinguishes real explanations from post-hoc rationalizations.

Not correct. Causality is the property Doshi-Velez and Kim defined as requiring that the stated reason have genuine interventional power over the output — not just describe it after the fact.

5. The 2020 MIT/MGH chest X-ray classifier study found that models were performing better than expected because they had learned to encode what hidden variable?

Correct. The MIMIC-trained models had implicitly encoded patient demographic variables that correlate with disease prevalence, artificially boosting apparent diagnostic accuracy. The performance was real but misleadingly calibrated — only mechanistic inspection revealed the leakage.

Not quite. Researchers found the models had encoded patient age and sex — demographic variables that correlate with disease base rates — inflating accuracy in ways that would fail under different population distributions.

Lab 1 — Diagnosing Black Box Failures

Discuss real cases of opaque AI reasoning with your AI lab partner

Your Task

You've learned about three real-world black box failures: the snow-wolf classifier, COMPAS in criminal sentencing, and the chest X-ray demographic leakage. In this lab, discuss the cases with the AI assistant below. Explore what made each failure possible, what interpretability method could have caught it, and what accountability mechanism was missing.

Try asking: "Why is shortcut learning dangerous in medical AI specifically?" or "What would have changed if COMPAS were interpretable by design?" or "Is there ever a case where a black box is acceptable?"

Interpretability Lab Assistant

L1 · Black Box Problem

Welcome to Lab 1. We're going to work through the black box problem together. The three cases from this lesson — the snow-wolf classifier, COMPAS recidivism scoring, and chest X-ray demographic leakage — each illustrate a different dimension of why opacity is dangerous. What aspect would you like to dig into first?

Module 4 · Lesson 2

Saliency, LIME, and SHAP

The first generation of interpretability tools — what they reveal, and what they hide.

When a heat map points at a pixel, is it showing you the model's reason — or its ghost?

In 2016, researchers at Google Brain trained a deep learning model to detect diabetic retinopathy from fundus photographs with accuracy matching board-certified ophthalmologists. To show clinicians what the model was "looking at," they generated gradient-weighted class activation maps — bright regions over the parts of the retinal image that contributed most to the prediction.

When clinicians examined the maps, many nodded: the highlighted regions often coincided with lesions they would have noticed. It felt like transparency. It felt like trust. But Vivienne Sze at MIT and others studying the same technique noted a disquieting fact: the activation maps looked plausible to clinicians precisely because clinicians were pattern-matching the heat maps to what they already knew to look for. The question of whether the heat map was faithful to the model's actual computation — rather than just overlapping with human-legible features — was not being asked.

Gradient-Based Saliency Methods

The first widely deployed interpretability tools were gradient-based: compute the partial derivative of the output with respect to each input pixel. High-gradient pixels are "important" because changing them would most affect the prediction. Karen Simonyan and colleagues at Oxford formalized this in 2013. Springenberg et al. developed Guided Backpropagation in 2014. Selvaraju et al. at Georgia Tech published Grad-CAM in 2017, generating class-specific heat maps by combining gradient information with feature map activations at the final convolutional layer.

Grad-CAM became one of the most cited interpretability techniques in AI history. Its strength: it requires no modifications to the model and produces spatially localized explanations that humans find intuitive. Its weakness, as Adebayo's sanity checks confirmed: the heat maps can reflect the structure of the input image itself rather than the model's learned reasoning. A model that learned nothing useful can still produce a "plausible" Grad-CAM overlay.

Saliency mapA visualization showing which input regions most influenced a specific model prediction, typically via gradient magnitudes.

FaithfulnessWhether an explanation accurately reflects the actual computation the model performed, rather than merely appearing plausible.

LIME: Local Interpretable Model-Agnostic Explanations

Marco Ribeiro, Sameer Singh, and Carlos Guestrin at the University of Washington published LIME in 2016. Rather than looking inside the model, LIME builds a local approximation: perturb the input slightly in many ways, observe how the output changes, then fit a simple interpretable model (like linear regression) to those perturbation-output pairs in the neighborhood of the original input.

This was genuinely innovative. LIME could explain any black-box classifier — text, tabular, image — without access to gradients or internal weights. Its practical application was demonstrated compellingly: LIME identified that a pneumonia classifier, rather than learning symptoms, had learned that "asthma" in a patient's history predicted survival — not because asthma patients had better outcomes generally, but because asthma patients with pneumonia received more aggressive hospital treatment. The model had learned a confound that would cause dangerous errors if deployed outside the original hospital's treatment protocol.

LIME's limitation is that it is local: the approximation holds only near the specific input instance being explained. The "explanation" for one image may not generalize even to very similar images. And the choice of perturbation kernel and neighborhood radius introduces significant researcher degrees of freedom that can produce inconsistent explanations for the same model and input.

The Asthma Confound

The LIME pneumonia case is one of the most-cited examples in interpretability literature. The finding: a neural network assigned lower mortality risk to pneumonia patients with prior asthma — the opposite of clinical reality. The model had learned that asthma patients at that hospital always went to the ICU first, which dramatically reduced their recorded mortality. Deployed without LIME's inspection, the model would have systematically under-triaged a high-risk group.

SHAP: SHapley Additive exPlanations

Scott Lundberg and Su-In Lee at the University of Washington published SHAP in 2017. The key insight was borrowed from cooperative game theory: Shapley values, developed by Lloyd Shapley in 1953, provide a principled way to attribute credit among players who jointly produce an outcome. Applied to ML models, each "player" is a feature, and each prediction is a joint outcome. SHAP computes each feature's contribution by averaging its marginal effect across all possible orderings in which features could be added.

SHAP has three properties that LIME lacks: local accuracy (the explanation sums to the model output), missingness (features absent from an instance get zero contribution), and consistency (if a model changes to assign a feature more importance, the SHAP value cannot decrease). For tree-based models, Lundberg developed TreeSHAP, which computes exact Shapley values in polynomial rather than exponential time.

In practice, SHAP became the dominant feature-importance method for tabular data in industry. A 2021 audit by researchers at JPMorgan Chase used SHAP to investigate a credit-scoring model and found that several apparently neutral features — grocery spending patterns, commute distance — were functioning as proxies for protected attributes like race and neighborhood redlining history. The audit, conducted because regulators required model explainability documentation, led to retraining with constrained features.

SHAP valueA Shapley-value-based attribution of each feature's contribution to a specific prediction, satisfying local accuracy, missingness, and consistency axioms.

Model-agnosticAn explanation method that treats the model as a black box and requires only input-output access, not internal weights or gradients.

The Faithfulness Problem: What These Methods Cannot Do

Despite their power, LIME and SHAP share a fundamental limitation: they describe statistical sensitivity, not mechanistic causation. When SHAP says feature X contributed +0.3 to prediction Y, it means: given the model's learned function, marginalizing over all other features, X accounts for +0.3 of the deviation from baseline. It does not mean the model "thought about" X in any meaningful sense, or that X causally determined Y in the real world.

Cynthia Rudin at Duke University, one of the field's sharpest critics of post-hoc explanation, argued in a 2019 Nature Machine Intelligence paper that approximate post-hoc explanation of inherently complex models is a fundamentally flawed approach for high-stakes decisions. Her position: when stakes are high, deploy interpretable-by-design models — logistic regression, decision trees, GAMs — rather than complex models that require explanation after the fact. The explanation, she argued, is always an approximation of the model, not the model itself, and in high-stakes settings that gap is unacceptable.

This tension between Rudin's "interpretable by design" position and the broader community's "powerful model + post-hoc explanation" approach remains unresolved in the field as of 2024. The practical compromise: use post-hoc methods (LIME, SHAP, Grad-CAM) as debugging and auditing tools, not as accountability substitutes for inherently opaque models in high-stakes decisions.

The Bottom Line on Generation-One Interpretability

LIME, SHAP, and gradient-based saliency maps are genuinely useful. They have caught real errors, exposed real biases, and improved real deployed systems. But they are not windows into a model's reasoning — they are statistical summaries of input-output sensitivity. The next frontier, mechanistic interpretability, tries to go deeper: understanding not just what the model is sensitive to, but what internal computations it is actually performing.

Quiz — Saliency, LIME, and SHAP

Five questions · Select the best answer for each

1. What dangerous confound did LIME expose in a pneumonia risk classifier used in the original paper by Ribeiro et al.?

Correct. Asthma patients with pneumonia were automatically sent to the ICU at the training hospital, dramatically improving their outcomes. The model learned asthma predicted survival — a confound that would cause dangerous under-triage in any hospital with different protocols.

Not correct. The confound was that asthma patients always received aggressive ICU intervention at the training hospital, so asthma appeared to predict survival. The model would have dangerously under-triaged asthmatic pneumonia patients in other hospital settings.

2. SHAP values are grounded in which concept from cooperative game theory?

Correct. Lloyd Shapley (1953) developed a method to fairly distribute payoffs in cooperative games by computing each player's average marginal contribution across all possible orderings. Lundberg and Lee adapted this to attribute each feature's contribution to ML model predictions.

Not correct. SHAP is built on Shapley values from cooperative game theory — the idea of attributing credit by averaging each "player's" (feature's) marginal contribution across all possible orderings of all features.

3. Cynthia Rudin's 2019 Nature Machine Intelligence paper argued which position about high-stakes AI decisions?

Correct. Rudin argued that using complex models plus post-hoc explanation creates an approximation gap that is unacceptable for high-stakes decisions. Her prescription: use logistic regression, decision trees, or GAMs where accuracy permits — models whose reasoning is the computation, not an approximation of it.

Not correct. Rudin argued for interpretable-by-design models in high-stakes settings, contending that post-hoc explanations are always approximations of the actual model and that gap is unacceptable when someone's liberty or health depends on the decision.

4. Which three axiomatic properties does SHAP satisfy that differentiate it from LIME?

Correct. Lundberg and Lee proved SHAP satisfies: local accuracy (explanations sum to model output), missingness (absent features get zero contribution), and consistency (if a model assigns more importance to a feature, its SHAP value cannot decrease). These are the properties that give SHAP its theoretical foundation.

Not correct. The three axioms Lundberg and Lee proved SHAP satisfies are local accuracy (explanations sum to output), missingness (absent features contribute zero), and consistency (increased model reliance on a feature cannot decrease its SHAP value).

5. What did the JPMorgan Chase SHAP audit find about their credit-scoring model?

Correct. The audit found features that appeared demographically neutral on their face — grocery spending patterns, commute distance — were actually encoding race and neighborhood redlining history. This led to retraining with constrained features to prevent discriminatory lending proxies.

Not correct. SHAP analysis revealed that seemingly neutral features were serving as proxies for protected attributes — grocery spending and commute distance encoded race and redlining history, which the model was effectively using in credit decisions.

Lab 2 — Evaluating Explanation Methods

Compare LIME, SHAP, and saliency approaches with your AI lab partner

Your Task

You've learned about the strengths and fundamental limits of LIME, SHAP, and gradient-based saliency. In this lab, work through the practical trade-offs with the AI assistant. Consider: when would you choose each method? What are the risks of each in a deployment setting? How does Cynthia Rudin's critique change your thinking about when to use these tools?

Try asking: "When is SHAP more trustworthy than LIME?" or "How would you explain the faithfulness problem to a non-technical stakeholder?" or "If I'm auditing a loan model for fairness, which method gives me the strongest evidence?"

Interpretability Lab Assistant

L2 · LIME · SHAP · Saliency

Welcome to Lab 2. We have three major explanation frameworks to compare: gradient-based saliency maps, LIME, and SHAP. Each was designed with different trade-offs in mind, and each has been exposed to real-world stress tests. What would you like to analyze — the methods themselves, their real-world applications, or the philosophical debate about whether post-hoc explanation is ever sufficient?

Module 4 · Lesson 3

Mechanistic Interpretability

Reading the circuits inside neural networks — from attention heads to polysemantic neurons.

Can we reverse-engineer a neural network the way we reverse-engineer a microchip?

In 2021, Chris Olah, Nick Camper, and colleagues at Anthropic published a series of papers titled "Zoom In: An Introduction to Circuits." Their subject was not a language model but a vision network — InceptionV1, trained on ImageNet. Their method was painstaking: identify neurons that fired strongly, trace the weights connecting them, reconstruct what patterns activate them, look for interpretable motifs.

What they found was remarkable. Neurons in early layers detected oriented Gabor filters — elementary curve detectors. Neurons in middle layers combined those curves into multi-orientation curve detectors, then into spiral detectors and circle detectors. These weren't programmed. They emerged from gradient descent. And they looked, surprisingly, like the simple cells and complex cells discovered by Hubel and Wiesel in mammalian visual cortex in 1959.

The implication was extraordinary: perhaps neural networks don't just learn functions — they learn something like computational structure. Perhaps that structure is, in principle, readable.

What Is Mechanistic Interpretability?

Mechanistic interpretability (often abbreviated "mech interp") is the program of understanding neural networks at the level of specific computations: which neurons activate for which inputs, how information flows between layers, what algorithms the network has implemented. The framing, developed most explicitly by Chris Olah and elaborated by researchers at Anthropic, Redwood Research, and DeepMind, treats the network as an unknown program and interpretability as reverse engineering.

This is fundamentally different from LIME or SHAP. Those methods ask: "What inputs does this model respond to?" Mechanistic interpretability asks: "What is this model actually doing with those inputs, computationally?" It is the difference between testing a black box by poking it and opening the box to read the circuit board.

CircuitA subgraph of a neural network — a set of neurons and the weights connecting them — that collectively implements an identifiable computation.

PolysemanticityA single neuron representing multiple unrelated features, making individual neuron analysis unreliable as a basis for mechanistic understanding.

SuperpositionThe hypothesis that neural networks encode more features than they have neurons by representing features in overlapping, non-orthogonal directions in activation space.

Attention Heads and In-Context Learning

Transformer models introduced attention mechanisms — learned functions that decide which tokens to attend to when computing each token's representation. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher Manning at Stanford published "What Does BERT Look At?" in 2019, systematically analyzing 144 attention heads in BERT-base. They found that heads had specializations: some consistently attended to the previous token; some to the delimiter [SEP]; some to direct objects of verbs; some to syntactic heads of phrases.

This wasn't random. Specific heads had learned identifiable linguistic functions. And crucially, ablating specific heads — setting their outputs to zero — had interpretable effects: the heads responsible for coreference tracking caused drops in coreference resolution tasks when ablated. The circuit was real and functional.

Nelson Elhage, Tom Henighan, Tristan Hume, and Chris Olah at Anthropic pushed further in 2021 with "A Mathematical Framework for Transformer Circuits." They showed that in small transformers, attention heads implement specific information-retrieval operations that can be written out explicitly in matrix algebra. In one seminal result, they identified "induction heads" — pairs of attention heads that collectively implement in-context copying: if the sequence contains "A B ... A", the induction head learns to predict "B" at the second "A" — a fundamental mechanism underlying in-context learning.

Induction Heads: A Real Discovery

Induction heads are one of the most concrete mechanistic discoveries to date: a two-head circuit that implements a lookup-and-copy operation. They emerge consistently across model scales and architectures. Their identification gave researchers a specific, testable hypothesis about how language models generalize from context — not a statistical summary, but an actual algorithm implemented in weights.

Polysemanticity and Superposition: The Core Obstacle

The most challenging finding in mechanistic interpretability research is that individual neurons are often polysemantic: a single neuron in a large language model will activate strongly for the concept "banana," but also for "yellow things," and also for specific code syntax patterns, and also for a particular writing style. This is not coincidence — it follows from Elhage et al.'s 2022 "Toy Models of Superposition" paper.

The superposition hypothesis: neural networks have more features to represent than they have neurons. To pack them in, the network encodes features as nearly-orthogonal directions in high-dimensional activation space, exploiting the fact that high-dimensional spaces can accommodate exponentially many near-orthogonal vectors. The benefit is information density. The cost is that any individual neuron "participates in" multiple features and cannot be cleanly interpreted in isolation.

This creates a fundamental problem for mechanistic interpretability: if neurons are the wrong unit of analysis, what is the right one? The Anthropic team's proposed answer in 2023 was Sparse Autoencoders (SAEs): train an autoencoder on model activations with a sparsity penalty, forcing the hidden layer to develop monosemantic features — one feature per hidden unit — even if the original model uses superposition. Early results were striking: SAE features extracted from GPT-2 and Claude activations showed human-interpretable concepts that individual neurons obscured.

Sparse Autoencoder (SAE)A neural network trained on model activations with a sparsity constraint, designed to decompose superposed representations into individual interpretable features.

What Mechanistic Interpretability Has Found — And What It Hasn't

Concrete discoveries: Curve detectors in vision networks. Induction heads in transformers. Frequency-based modular arithmetic circuits in one-layer transformers (the "grokking" experiments by Power et al., 2022). A sentiment direction in BERT that, when intervened on, causes the model to flip sentiment predictions. Specific "memory reading" heads that retrieve factual associations.

Open problems: Scaling. All of the above work was done on small models — toy transformers, one-layer networks, BERT-scale models. Anthropic's Claude 3 has billions of parameters. GPT-4 is larger. Whether circuit-level analysis can scale to these models, or whether new organizational principles emerge at scale, is one of the central open questions in AI safety research as of 2024. Chris Olah has explicitly described mechanistic interpretability as "the only thing that can give us the kind of understanding we need" for frontier model safety — and also acknowledged that the field is very young and that there is no guarantee the program will succeed at scale.

Why This Matters for AI Safety

If we can read circuits in neural networks, we can ask: does this model have a circuit that represents "deceiving the user" as an instrumental goal? Does it have features encoding "this is a test scenario"? These are not hypothetical questions — they are the exact questions that AI safety researchers want to answer before deploying powerful AI systems. Mechanistic interpretability is the only approach that, in principle, could answer them.

Quiz — Mechanistic Interpretability

Five questions · Select the best answer for each

1. What biological phenomenon did Chris Olah and colleagues' circuit work on InceptionV1 parallel?

Correct. The curve detectors and multi-orientation detectors that emerged in InceptionV1 closely resembled the simple cells (oriented edge detectors) and complex cells (invariant detectors) that David Hubel and Torsten Wiesel discovered in cat and macaque visual cortex, winning them the 1981 Nobel Prize.

Not correct. The parallel was with Hubel and Wiesel's 1959 discovery of simple and complex cells in mammalian visual cortex — neurons that detect oriented edges and invariant patterns, mirroring what emerged in InceptionV1's trained circuits.

2. What specific algorithmic operation do "induction heads" in transformer models implement?

Correct. Elhage, Henighan, Hume, and Olah identified induction heads as two-head circuits that implement a lookup-and-copy operation: find the previous occurrence of the current token, then predict whatever followed it. This is a fundamental mechanism underlying in-context learning.

Not correct. Induction heads implement a specific lookup-and-copy operation: if "A B" appeared earlier in the sequence, and "A" appears again now, the induction head retrieves "B" as the prediction. This mechanism underlies much of in-context learning in language models.

3. The "superposition hypothesis" in neural networks claims that:

Correct. The superposition hypothesis (Elhage et al., 2022) proposes that networks exploit the exponential capacity of high-dimensional spaces to store near-orthogonal feature vectors, allowing far more features than neurons — at the cost of polysemanticity and interference between features.

Not correct. Superposition refers to the practice of encoding more features than available neurons by using overlapping, near-orthogonal directions in activation space. This is efficient but causes polysemanticity — individual neurons participate in multiple features and cannot be cleanly interpreted.

4. What is the primary purpose of Sparse Autoencoders (SAEs) in mechanistic interpretability research?

Correct. SAEs are trained on model activations with a sparsity penalty that forces the hidden layer to develop one feature per unit. The goal is to undo superposition — to find the dictionary of monosemantic features that the model is encoding in a superposed, hard-to-read way.

Not correct. SAEs are designed to decompose superposed neural activations into individually interpretable monosemantic features. The sparsity constraint prevents features from overlapping, reversing the superposition that makes individual neurons polysemantic.

5. The Clark et al. (2019) "What Does BERT Look At?" study found that attention heads in BERT had learned:

Correct. Systematic analysis of all 144 attention heads in BERT-base revealed clear functional specializations: heads for syntactic heads of phrases, for direct objects, for coreference, for delimiter tokens. Ablating specific heads caused targeted performance drops in the linguistic tasks those heads handled.

Not correct. Clark et al. found clear functional specialization among BERT's attention heads — some consistently tracked syntactic structure, some handled coreference, some attended to specific grammatical roles. This was confirmed by showing that ablating specific heads hurt specifically the tasks those heads served.

Lab 3 — Exploring Circuits and Features

Dig into mechanistic interpretability findings with your AI lab partner

Your Task

You've learned about circuits in vision networks, induction heads in transformers, and the superposition hypothesis. In this lab, work through the implications with the AI assistant. What does it mean to "reverse-engineer" a neural network? How do SAEs help? What would it take to fully understand a frontier model like GPT-4?

Try asking: "How do induction heads relate to the ability to follow instructions?" or "What would a safety-relevant circuit look like — what would you be searching for?" or "If superposition is fundamental, can we ever really understand a large model?"

Interpretability Lab Assistant

L3 · Mechanistic Interpretability

Welcome to Lab 3. We're now at the frontier of interpretability research — mechanistic analysis of circuits, attention heads, and the superposition problem. The questions here are genuinely open. Researchers are actively debating whether circuit-level understanding can scale to frontier models. What would you like to explore — the specific techniques, the safety implications, or the fundamental limits?

Module 4 · Lesson 4

Interpretability for Safety: Probing, Steering, and Oversight

From understanding internals to intervening on them — and what that means for AI alignment.

If you can read what a model is thinking, can you also ensure it's thinking the right things?

In late 2023, researchers at Anthropic released a series of results from what they called "activation steering" experiments. They had identified a linear direction in Claude's residual stream that encoded the concept of "Assistant." When that direction was amplified — adding a multiple of the "Assistant" vector to activations during a forward pass — the model began behaving more sycophantically, agreeing more readily with whatever the user said.

When the direction was inverted — subtracted — the model became more oppositional, disagreeing even with correct statements. The intervention was clean, directional, and reproducible. Nobody had programmed this. The direction had emerged from training. And it could be read and manipulated by anyone with access to the model's activation space.

The researchers were both excited and troubled. Excited because it worked. Troubled because the same technique that lets alignment researchers check for dangerous concepts — does this model encode deception? does it encode self-preservation? — could theoretically be used to steer models toward behaviors their developers never intended.

Probing: Reading Representations

A probe is a small, simple classifier trained on top of a neural network's internal activations to test whether a specific concept is linearly encoded. The technique was popularized by Alain and Bengio (2016) and refined extensively in NLP by John Hewitt, Yonatan Belinkov, Christopher Manning, and others. The logic: if a linear classifier can predict "is this token the subject of a sentence?" from BERT's layer-14 activations at 95% accuracy, then syntactic subject information is probably linearly encoded in that representation.

Probing has revealed that language models encode extraordinarily rich structured information: sentence tree structure (Hewitt and Manning, 2019), world states in games (Li et al., 2022), entity type, negation scope, temporal relationships, and sentiment intensity. In the game world-state experiment, Kenneth Li, Aspen Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg at Harvard showed that a model trained on Othello transcripts had learned to represent the actual board state — not just to predict legal moves, but to encode which squares were each player's, in a way that could be read out with a linear probe and intervened on to change model behavior.

For safety, probing is important because it allows researchers to audit specific concepts: does this deployed model encode a "jailbreak attempt" detector internally? Does it represent the concept of "user intent" in its activations? These audits can inform fine-tuning, filtering, and governance decisions.

ProbeA simple classifier trained on internal model activations to test whether a specific concept is linearly encoded at a given layer.

Representation geometryThe structure of how concepts are organized in activation space — which concepts are nearby, which are linearly separable, which are encoded as directions.

Activation Steering: Intervening on Representations

Probing reads representations. Activation steering writes them. The technique — also called representation engineering or activation patching — involves adding a concept vector to model activations during inference to induce or suppress a behavior. The concept vector is typically obtained by contrasting activations for positive and negative examples of the concept.

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks published "Representation Engineering" in 2023, systematically demonstrating that concept vectors for honesty, emotion, and harm could be identified and applied to steer model behavior across a wide range of contexts.

Critically, they showed that adding an "honesty" vector to activations caused models to be more truthful on benchmarks — and adding its negation caused systematic deception. This has direct implications for alignment: if we can identify the internal representation of "being honest," we have a potential mechanism for reinforcing honest behavior beyond just reward-model training.

A complementary result from David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua Tenenbaum, William Freeman, and Antonio Torralba (Network Dissection, 2017–2020) showed that in vision networks, specific convolutional units correspond to human-interpretable concepts — "trees," "doors," "grass" — and that ablating those units selectively removes those concepts from generated images. This was activation patching in the generative direction: understanding → control.

The Othello Experiment

Li et al.'s 2022 Othello study is a landmark result. A model trained only to predict the next legal move in Othello transcripts — no explicit supervision on board state — had internally computed and represented the full board state in linear activations. Probing revealed this. Steering confirmed it: intervening on the board-state representation caused the model to make moves consistent with the modified (imagined) board, not the actual one. The model had built a world model as an instrumental byproduct of its task.

Oversight: What Interpretability Enables

The alignment value of interpretability tools can be organized around three oversight functions: detection, intervention, and verification.

Detection means identifying when a model has developed internal representations that are safety-relevant before they manifest in behavior. Jan Leike at OpenAI (later Anthropic) argued in 2023 that current AI safety work is "mostly trying to prevent bad behavior" but that detection of internal states would allow safety teams to catch problems before deployment. Probing and SAE-based feature discovery are the current tools for this.

Intervention means being able to modify dangerous representations once found. Activation steering is the primary current method. Fine-tuning and RLHF can be thought of as systematic interventions, but they are coarse: they reshape the output distribution without precise knowledge of what internal representations changed. Targeted activation patching is more surgical, but also more brittle — interventions that work in one context often fail to generalize.

Verification means being able to confirm that alignment properties have been achieved internally, not just behaviorally. This is the hardest problem. Paul Christiano at the Alignment Research Center has argued that a model could behave perfectly on all evaluated inputs while having internal representations that would support harmful behavior in some unevaluated distribution. Only internal verification — reading the model's representations for misalignment-relevant concepts — can close this gap. As of 2024, no one has demonstrated reliable internal verification at frontier scale.

Scalable oversightAlignment approaches designed to remain effective even as AI capabilities grow beyond human ability to evaluate outputs directly.

Open Problems and the Road Ahead

Interpretability for safety faces four open problems that researchers actively cite as barriers to the program succeeding at frontier scale.

The scaling gap. All current mechanistic results are from models small enough to manually inspect circuits. Frontier models have billions of parameters and emerge at scales where qualitatively new capabilities appear. There is no established framework for extending circuit-level analysis to GPT-4-scale models. Anthropic's 2024 "Mapping the Mind of a Large Language Model" publication identified millions of interpretable features in Claude 3 Sonnet using SAEs — a major advance — but the features account for only a fraction of the model's total representational capacity.

The completeness problem. Even if every neuron and every circuit is identified, there is no guarantee the resulting description is complete. A model might route computations through paths that interpretability tools have not examined. Anthropic's "Sleeper Agents" paper (2024) demonstrated that models could be fine-tuned to behave normally under evaluation while harboring a triggered behavior — and that the triggered behavior was extremely difficult to remove with standard fine-tuning. Interpretability-based detection did not reliably find the hidden circuit.

The ground truth problem. We lack benchmarks for measuring interpretability quality. Whether an SAE feature is "truly" the right decomposition, or a useful but misleading approximation, has no gold standard. David Bau at MIT has argued this is the central methodological challenge: without ground truth about what the right concepts are, we cannot know if our interpretability tools are finding real structure or convenient artifacts.

The adversarial use problem. Techniques that let alignment researchers read and steer model internals are dual-use. A sophisticated actor who can identify "safety feature" directions in a model's activation space can use activation patching to suppress them. Nathaniel Li and colleagues demonstrated in 2024 that certain jailbreaks could be understood as bypassing safety-relevant activation directions rather than logically defeating guardrails — and that understanding the mechanism made the bypass easier to engineer.

The Interpretability Wager

The case for investing heavily in interpretability research is essentially a wager: that understanding model internals is both achievable and necessary for safe deployment of powerful AI. If it is achievable, the benefits are transformative — genuine verification of alignment properties, early detection of dangerous capabilities, targeted intervention on misaligned representations. If it is not achievable at scale, we will need entirely different approaches. The field is young enough that both outcomes remain plausible. Researchers who believe interpretability is the critical path for AI safety think the wager is worth taking.

Quiz — Interpretability for Safety

Five questions · Select the best answer for each

1. In Kenneth Li et al.'s 2022 Othello experiment, what unexpected internal representation did the model develop?

Correct. The model, trained only on move prediction with no supervision about board state, had internally computed a linear representation of the full board as an instrumental byproduct. Probing revealed it; steering confirmed it — changing the board-state representation changed which moves the model predicted.

Not correct. The striking finding was that the model had developed an internal linear representation of the complete board state — which squares belonged to which player — despite receiving no supervision about board state during training. It had learned a world model as a byproduct of move prediction.

2. "Representation Engineering" (Zou et al., 2023) demonstrated that adding a negated "honesty" vector to model activations caused the model to:

Correct. Zou et al. showed that subtracting the honesty concept vector from activations caused models to become systematically dishonest — not just less reliable, but actively deceptive in a controllable, directional way. The positive intervention (adding the honesty vector) improved truthfulness scores.

Not correct. The Representation Engineering paper showed that inverting the honesty direction in activation space caused systematic deception on benchmarks — a controllable, directional intervention on a safety-critical model behavior.

3. Anthropic's "Sleeper Agents" paper (2024) is concerning for interpretability-based safety because it showed:

Correct. Sleeper Agents demonstrated that models could maintain normal evaluation behavior while harboring a hidden trigger — and that neither standard fine-tuning, RLHF, nor adversarial training reliably removed the triggered behavior. Interpretability-based detection also failed to reliably identify the hidden circuit.

Not correct. The alarming result was that hidden triggered behaviors persisted through extensive fine-tuning and were not reliably detected by interpretability tools. This is a significant challenge for verification-based alignment approaches.

4. Paul Christiano's argument about behavioral evaluation of aligned AI claims that:

Correct. Christiano's argument is that behavioral evaluation, however comprehensive, cannot rule out misaligned internal representations that would manifest under distribution shift. Only internal verification — reading the model's representations — can close the gap between behavioral compliance and genuine alignment.

Not correct. Christiano's key argument is that behavioral tests are inherently incomplete: a model could be perfectly well-behaved on every evaluated input while having internal structure that would produce harmful behavior in unevaluated conditions. Internal verification is needed to close this gap.

5. The three oversight functions that interpretability tools support, according to the framework in Lesson 4, are:

Correct. Detection means identifying safety-relevant internal representations before they manifest in behavior. Intervention means modifying dangerous representations once found (e.g., via activation steering). Verification means confirming that alignment properties have been achieved internally, not just behaviorally — the hardest problem of the three.

Not correct. The three oversight functions are detection (finding safety-relevant internal states), intervention (modifying them), and verification (confirming alignment properties hold internally, not just behaviorally). Probing, steering, and fine-tuning are methods that serve these functions, not the functions themselves.

Lab 4 — Probing, Steering, and the Safety Case

Explore the alignment implications of interpretability tools with your AI lab partner

Your Task

You've completed all four lessons on interpretability. In this final lab, synthesize what you've learned about probing, activation steering, and the oversight functions of interpretability. Consider the big-picture argument: is interpretability the critical path to safe advanced AI? What are the strongest objections? What would success look like?

Try asking: "What's the strongest argument against interpretability being sufficient for AI safety?" or "If you could run one interpretability experiment on GPT-4 right now, what would it be?" or "How does the Sleeper Agents result change the case for interpretability-based verification?"

Interpretability Lab Assistant

L4 · Safety Applications

Welcome to Lab 4 — the synthesis lab for this module. You've now covered the full arc: from the black box problem, through LIME and SHAP, to mechanistic interpretability, to the safety applications of probing and steering. The central question this module has been building toward: can interpretability give us the kind of internal access we'd need to genuinely verify that a powerful AI system is aligned? What's your current thinking?

Module Test — Interpretability

15 questions · Pass threshold: 80% (12/15 correct)

1. What technique did Lapuschkin et al. apply in 2015 to reveal that a wolf/husky classifier was using snow as a spurious feature?

Correct. LRP was the technique Lapuschkin and colleagues used, which allowed them to trace how relevance propagated backward through the network — revealing that snow in the image background was receiving the highest relevance scores.

Not correct. Layer-wise Relevance Propagation (LRP) was the technique — it traces how prediction-relevant information flows backward through network layers to identify input regions driving the output.

2. Which property of SHAP distinguishes it from simpler attribution methods by ensuring explanations sum to the model output?

Correct. Local accuracy guarantees that the sum of all SHAP values plus a baseline equals the model's actual output for that instance — making the attribution a genuine decomposition of the prediction.

Not correct. Local accuracy is the axiom that guarantees SHAP values sum to the model output. This is what makes SHAP attributions a true decomposition rather than an approximation.

3. LIME generates explanations by:

Correct. LIME is model-agnostic and requires only input-output access. It creates many perturbed versions of an input, queries the black-box model for each, and fits a simple local model (typically linear regression) to the perturbation-output relationship near the original input.

Not correct. LIME works by perturbing inputs, observing how outputs change, and approximating the model locally with a simple interpretable model — requiring no access to gradients or internal activations.

4. The "sanity checks" by Adebayo et al. (2018) revealed a fundamental problem with several saliency methods: what was it?

Correct. The key sanity check was randomizing model weights and checking if explanations changed. For several methods, they didn't — meaning the "explanations" reflected input image structure rather than learned model reasoning. A passing sanity check is a necessary (not sufficient) condition for faithfulness.

Not correct. Adebayo et al. showed that some methods passed the "trained vs. random weights" sanity check by failing it — producing nearly identical heat maps regardless of whether the model had learned anything, exposing a faithfulness problem.

5. In the Anthropic "Circuits" work on InceptionV1, what organizational principle did researchers find that parallels mammalian neuroscience?

Correct. Olah and colleagues found that early InceptionV1 neurons detect Gabor-like oriented edges; middle layers combine these into multi-orientation curve detectors; later layers construct complex shapes. This hierarchy mirrors the simple cell → complex cell → hypercomplex cell hierarchy in mammalian visual cortex.

Not correct. The InceptionV1 circuit analysis found a hierarchical organization: oriented edge detectors in early layers combining into curve detectors, then spiral and circle detectors — closely paralleling the simple and complex cell hierarchy discovered in mammalian visual cortex by Hubel and Wiesel.

6. What does it mean for a neural network to use "superposition" to represent features?

Correct. Superposition exploits the mathematical fact that high-dimensional spaces can contain exponentially many nearly-orthogonal vectors. Features are encoded in overlapping directions, allowing far more features than neurons — at the cost that individual neurons become polysemantic.

Not correct. Superposition is the practice of encoding more features than neurons by using overlapping, nearly-orthogonal directions in activation space. This gives networks huge representational capacity but causes individual neurons to represent multiple unrelated features — polysemanticity.

7. Cynthia Rudin's position on AI interpretability in high-stakes decisions is best summarized as:

Correct. Rudin's argument is that post-hoc explanations are always approximations of black-box models, and that gap is ethically unacceptable when someone's liberty, health, or financial standing is at stake. Her prescription: use logistic regression, decision trees, or GAMs — models whose explanation is the model.

Not correct. Rudin argues for avoiding complex black-box models in high-stakes settings altogether. The post-hoc explanation, however sophisticated, always approximates the actual model and introduces a gap that high-stakes decisions cannot tolerate.

8. What specific result in the Othello experiment confirmed that the model had a genuine internal world model, not just a statistical pattern?

Correct. Probing established that the representation existed. Steering confirmed it was causal: when researchers patched the board-state representation to reflect a different board position, the model's subsequent move predictions shifted to be consistent with that imagined board — not the actual one. This causal test is what elevates it from correlation to mechanism.

Not correct. The causal confirmation came from the intervention: patching the board-state representation to represent an altered board caused the model to predict moves appropriate to the altered board, not the real one. This proved the representation was causally driving behavior, not just correlated with it.

9. In the context of interpretability for AI safety, "verification" differs from "detection" and "intervention" because verification:

Correct. Detection finds safety-relevant representations; intervention modifies them. Verification is the hardest goal: confirming that the model is genuinely aligned at the representational level, not just behaviorally compliant on evaluated distributions. As of 2024, no method reliably achieves this at frontier model scales.

Not correct. Verification means confirming internal alignment properties hold — not just behavioral performance. This is qualitatively harder than detection or intervention and remains an unsolved problem at frontier scale as of 2024.

10. The Grad-CAM technique (Selvaraju et al., 2017) generates heat maps by:

Correct. Grad-CAM computes the gradient of the class score with respect to the final convolutional feature maps, uses those gradients as weights, and produces a weighted combination of the feature maps — creating a spatially resolved map of class-discriminative regions. No model modification is required.

Not correct. Grad-CAM works by taking the gradient of the target class score with respect to the final convolutional feature maps, weighting those maps by the global-average-pooled gradients, and combining them into a class-discriminative heat map projected onto the input image.

11. The COMPAS investigation by ProPublica demonstrated that algorithmic opacity in criminal justice is a problem because it:

Correct. The accountability failure was specifically about contestability: defendants were given a number — the risk score — but no basis on which to challenge it. Judges could not explain it. Lawyers could not examine it. An uninterpretable score affecting a person's liberty is a due process problem independent of whether the score is "accurate."

Not correct. The core accountability failure was that defendants facing sentencing could not understand, challenge, or appeal a score that could affect years of their freedom. The opacity made contestability structurally impossible.

12. What are Sparse Autoencoders designed to recover from neural network activations?

Correct. SAEs are trained with a sparsity penalty that forces the hidden layer to reconstruct activations using as few hidden units as possible. This pressure produces monosemantic features — each hidden unit in the SAE corresponds to one concept — reversing the polysemanticity that superposition creates in the original model.

Not correct. SAEs aim to recover the monosemantic features hidden in superposed model activations. By training with a sparsity constraint, they decompose the overlapping feature representations into individually interpretable units.

13. "Induction heads" emerge consistently in transformers and implement which two-step operation?

Correct. Induction heads are two-head circuits: one head finds previous occurrences of the current token; the other uses that to retrieve the token that followed. This lookup-and-copy operation is a core mechanism for in-context learning — the ability to exploit patterns from earlier in the context window.

Not correct. Induction heads implement a two-step lookup-copy: find where the current token appeared previously in context, then predict the token that followed it there. This simple algorithm underlies a large fraction of in-context learning behavior in transformers.

14. The JPMorgan Chase SHAP credit audit found that features like grocery spending and commute distance were problematic specifically because they were:

Correct. The audit found that these facially neutral features encoded race and redlining history — not through explicit design, but because patterns of grocery shopping and commute distance are heavily correlated with racially segregated neighborhood structures created by decades of discriminatory lending and zoning. SHAP made this proxy relationship legible.

Not correct. The problem was that these features appeared demographically neutral but were encoding protected attributes through correlation — grocery spending and commute distance reflect racially segregated neighborhood patterns shaped by redlining history. SHAP revealed what the model was actually using them for.

15. The "scaling gap" is described as the central open problem for mechanistic interpretability because:

Correct. Circuits work has been done on toy transformers, InceptionV1-scale vision networks, and BERT-scale language models. GPT-4 and frontier models are orders of magnitude larger and may have qualitatively different organizational principles. Whether circuit-level understanding can scale to these models is one of the central open questions in AI safety research.

Not correct. The scaling gap refers to the fact that all rigorous mechanistic results have been obtained on small models, and it remains unknown whether the same approaches will yield tractable understanding of frontier-scale models with billions or trillions of parameters.