In 2015, researchers at the Technical University of Berlin asked a question that would haunt AI development for years: why does this neural network think that image is a wolf? The classifier was accurate. But when Sebastian Lapuschkin and colleagues applied a technique called Layer-wise Relevance Propagation to peer inside, the answer was unsettling.
The network had learned to recognize wolves not by their fur, posture, or gaze β but by snow. Nearly every wolf in the training set appeared against a snowy background. Every husky appeared indoors. The model was, technically, performing perfectly on its test set while reasoning about entirely the wrong thing.
No one had intended this. No one had noticed. The network never said, "I'm using snow as a proxy for wolf." It just did. This was the black box problem made visible β and it would not be the last time.
Modern neural networks are function approximators: given enough data and gradient descent, they learn numerical weights across millions or billions of parameters that map inputs to outputs with impressive accuracy. The trouble is that no human designed those weights. They emerged from optimization. They encode patterns β some useful, some spurious β in a high-dimensional space that has no natural language.
A 2017 paper by Marco Ribeiro, Sameer Singh, and Carlos Guestrin (the LIME authors) demonstrated this vividly with a sentiment classifier. The model correctly labeled most reviews as positive or negative, but its reasoning, when probed, depended on superficial tokens β the word "not" caused wildly unpredictable flips β rather than semantic understanding. The performance metric looked fine. The internal logic was fragile.
This gap between measurable performance and understood reasoning is the black box problem. It has three practical consequences that motivate the entire field of interpretability.
1. Safety under distribution shift. In 2018, Emma Pierson and collaborators at Stanford analyzed a dermatology AI trained by Google. The model reached dermatologist-level accuracy on held-out test images. But those images came from the same hospital photography protocol as training. When tested on smartphone photos from patients in Sub-Saharan Africa β different lighting, different skin tone distributions β accuracy fell sharply. The network had learned correlations tightly coupled to the photographic pipeline, not to the biology of lesions. Because no one had looked inside, the brittleness was invisible until deployment.
2. Accountability and contestability. In 2016, ProPublica published "Machine Bias," an investigation into COMPAS, a recidivism-prediction algorithm used in criminal sentencing across the United States. The algorithm's vendor, Northpointe, treated it as proprietary. Judges were told scores but not how they were computed. Defendants could not contest a number they could not see constructed. Whether or not COMPAS was "fair" by any given metric became almost beside the point: an uninterpretable score affecting years of a person's liberty is an accountability failure by definition.
3. Debugging failure modes before deployment. In 2020, researchers at MIT and Massachusetts General Hospital showed that a chest X-ray classifier trained on the MIMIC dataset had implicitly encoded patient age and sex into its feature space in ways that inflated apparent diagnostic accuracy β models that "knew" a patient was older predicted certain diseases at higher base rates. The performance was real but artificially boosted. Only mechanistic inspection revealed the leakage. Deploying such a model without that inspection would have led to systematically wrong calibration in clinical settings.
The same architectural properties that make neural networks powerful β distributed representation, non-linear composition, emergent features β are exactly what make them hard to interpret. Interpretability is not a free add-on. It requires deliberate methods, and sometimes deliberate trade-offs with raw performance.
A common misconception frames interpretability as post-hoc explanation: you get an output, you generate a sentence describing why. But Finale Doshi-Velez and Been Kim at Harvard and Google Brain argued in a 2017 position paper that this framing is dangerously narrow. What we actually need, they proposed, is a taxonomy: interpretability (the degree to which a human can predict the model's behavior), completeness (does the explanation cover all relevant factors?), and causality (does intervening on the stated reason actually change the output?).
An explanation can be compelling, coherent, and completely wrong about the actual computation. This has a name in the field: plausible but unfaithful explanations. Julius Adebayo and colleagues at Google demonstrated in 2018 ("Sanity Checks for Saliency Maps") that several widely-used gradient-based explanation methods produced nearly identical visualizations whether the model was trained or its weights were randomly permuted β meaning the explanations were capturing image structure, not model logic. The field had to confront that "looking inside" was harder than producing a heat map.
Interpretability is not an academic luxury. It is the precondition for auditing AI systems for bias, for catching failures before they harm people, for building the kind of trust that can only exist when behavior is understandable. Every tool in this module β from saliency maps to mechanistic circuit analysis β exists because someone hit a wall where performance metrics were not enough.
You've learned about three real-world black box failures: the snow-wolf classifier, COMPAS in criminal sentencing, and the chest X-ray demographic leakage. In this lab, discuss the cases with the AI assistant below. Explore what made each failure possible, what interpretability method could have caught it, and what accountability mechanism was missing.
In 2016, researchers at Google Brain trained a deep learning model to detect diabetic retinopathy from fundus photographs with accuracy matching board-certified ophthalmologists. To show clinicians what the model was "looking at," they generated gradient-weighted class activation maps β bright regions over the parts of the retinal image that contributed most to the prediction.
When clinicians examined the maps, many nodded: the highlighted regions often coincided with lesions they would have noticed. It felt like transparency. It felt like trust. But Vivienne Sze at MIT and others studying the same technique noted a disquieting fact: the activation maps looked plausible to clinicians precisely because clinicians were pattern-matching the heat maps to what they already knew to look for. The question of whether the heat map was faithful to the model's actual computation β rather than just overlapping with human-legible features β was not being asked.
The first widely deployed interpretability tools were gradient-based: compute the partial derivative of the output with respect to each input pixel. High-gradient pixels are "important" because changing them would most affect the prediction. Karen Simonyan and colleagues at Oxford formalized this in 2013. Springenberg et al. developed Guided Backpropagation in 2014. Selvaraju et al. at Georgia Tech published Grad-CAM in 2017, generating class-specific heat maps by combining gradient information with feature map activations at the final convolutional layer.
Grad-CAM became one of the most cited interpretability techniques in AI history. Its strength: it requires no modifications to the model and produces spatially localized explanations that humans find intuitive. Its weakness, as Adebayo's sanity checks confirmed: the heat maps can reflect the structure of the input image itself rather than the model's learned reasoning. A model that learned nothing useful can still produce a "plausible" Grad-CAM overlay.
Marco Ribeiro, Sameer Singh, and Carlos Guestrin at the University of Washington published LIME in 2016. Rather than looking inside the model, LIME builds a local approximation: perturb the input slightly in many ways, observe how the output changes, then fit a simple interpretable model (like linear regression) to those perturbation-output pairs in the neighborhood of the original input.
This was genuinely innovative. LIME could explain any black-box classifier β text, tabular, image β without access to gradients or internal weights. Its practical application was demonstrated compellingly: LIME identified that a pneumonia classifier, rather than learning symptoms, had learned that "asthma" in a patient's history predicted survival β not because asthma patients had better outcomes generally, but because asthma patients with pneumonia received more aggressive hospital treatment. The model had learned a confound that would cause dangerous errors if deployed outside the original hospital's treatment protocol.
LIME's limitation is that it is local: the approximation holds only near the specific input instance being explained. The "explanation" for one image may not generalize even to very similar images. And the choice of perturbation kernel and neighborhood radius introduces significant researcher degrees of freedom that can produce inconsistent explanations for the same model and input.
The LIME pneumonia case is one of the most-cited examples in interpretability literature. The finding: a neural network assigned lower mortality risk to pneumonia patients with prior asthma β the opposite of clinical reality. The model had learned that asthma patients at that hospital always went to the ICU first, which dramatically reduced their recorded mortality. Deployed without LIME's inspection, the model would have systematically under-triaged a high-risk group.
Scott Lundberg and Su-In Lee at the University of Washington published SHAP in 2017. The key insight was borrowed from cooperative game theory: Shapley values, developed by Lloyd Shapley in 1953, provide a principled way to attribute credit among players who jointly produce an outcome. Applied to ML models, each "player" is a feature, and each prediction is a joint outcome. SHAP computes each feature's contribution by averaging its marginal effect across all possible orderings in which features could be added.
SHAP has three properties that LIME lacks: local accuracy (the explanation sums to the model output), missingness (features absent from an instance get zero contribution), and consistency (if a model changes to assign a feature more importance, the SHAP value cannot decrease). For tree-based models, Lundberg developed TreeSHAP, which computes exact Shapley values in polynomial rather than exponential time.
In practice, SHAP became the dominant feature-importance method for tabular data in industry. A 2021 audit by researchers at JPMorgan Chase used SHAP to investigate a credit-scoring model and found that several apparently neutral features β grocery spending patterns, commute distance β were functioning as proxies for protected attributes like race and neighborhood redlining history. The audit, conducted because regulators required model explainability documentation, led to retraining with constrained features.
Despite their power, LIME and SHAP share a fundamental limitation: they describe statistical sensitivity, not mechanistic causation. When SHAP says feature X contributed +0.3 to prediction Y, it means: given the model's learned function, marginalizing over all other features, X accounts for +0.3 of the deviation from baseline. It does not mean the model "thought about" X in any meaningful sense, or that X causally determined Y in the real world.
Cynthia Rudin at Duke University, one of the field's sharpest critics of post-hoc explanation, argued in a 2019 Nature Machine Intelligence paper that approximate post-hoc explanation of inherently complex models is a fundamentally flawed approach for high-stakes decisions. Her position: when stakes are high, deploy interpretable-by-design models β logistic regression, decision trees, GAMs β rather than complex models that require explanation after the fact. The explanation, she argued, is always an approximation of the model, not the model itself, and in high-stakes settings that gap is unacceptable.
This tension between Rudin's "interpretable by design" position and the broader community's "powerful model + post-hoc explanation" approach remains unresolved in the field as of 2024. The practical compromise: use post-hoc methods (LIME, SHAP, Grad-CAM) as debugging and auditing tools, not as accountability substitutes for inherently opaque models in high-stakes decisions.
LIME, SHAP, and gradient-based saliency maps are genuinely useful. They have caught real errors, exposed real biases, and improved real deployed systems. But they are not windows into a model's reasoning β they are statistical summaries of input-output sensitivity. The next frontier, mechanistic interpretability, tries to go deeper: understanding not just what the model is sensitive to, but what internal computations it is actually performing.
You've learned about the strengths and fundamental limits of LIME, SHAP, and gradient-based saliency. In this lab, work through the practical trade-offs with the AI assistant. Consider: when would you choose each method? What are the risks of each in a deployment setting? How does Cynthia Rudin's critique change your thinking about when to use these tools?
In 2021, Chris Olah, Nick Camper, and colleagues at Anthropic published a series of papers titled "Zoom In: An Introduction to Circuits." Their subject was not a language model but a vision network β InceptionV1, trained on ImageNet. Their method was painstaking: identify neurons that fired strongly, trace the weights connecting them, reconstruct what patterns activate them, look for interpretable motifs.
What they found was remarkable. Neurons in early layers detected oriented Gabor filters β elementary curve detectors. Neurons in middle layers combined those curves into multi-orientation curve detectors, then into spiral detectors and circle detectors. These weren't programmed. They emerged from gradient descent. And they looked, surprisingly, like the simple cells and complex cells discovered by Hubel and Wiesel in mammalian visual cortex in 1959.
The implication was extraordinary: perhaps neural networks don't just learn functions β they learn something like computational structure. Perhaps that structure is, in principle, readable.
Mechanistic interpretability (often abbreviated "mech interp") is the program of understanding neural networks at the level of specific computations: which neurons activate for which inputs, how information flows between layers, what algorithms the network has implemented. The framing, developed most explicitly by Chris Olah and elaborated by researchers at Anthropic, Redwood Research, and DeepMind, treats the network as an unknown program and interpretability as reverse engineering.
This is fundamentally different from LIME or SHAP. Those methods ask: "What inputs does this model respond to?" Mechanistic interpretability asks: "What is this model actually doing with those inputs, computationally?" It is the difference between testing a black box by poking it and opening the box to read the circuit board.
Transformer models introduced attention mechanisms β learned functions that decide which tokens to attend to when computing each token's representation. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher Manning at Stanford published "What Does BERT Look At?" in 2019, systematically analyzing 144 attention heads in BERT-base. They found that heads had specializations: some consistently attended to the previous token; some to the delimiter [SEP]; some to direct objects of verbs; some to syntactic heads of phrases.
This wasn't random. Specific heads had learned identifiable linguistic functions. And crucially, ablating specific heads β setting their outputs to zero β had interpretable effects: the heads responsible for coreference tracking caused drops in coreference resolution tasks when ablated. The circuit was real and functional.
Nelson Elhage, Tom Henighan, Tristan Hume, and Chris Olah at Anthropic pushed further in 2021 with "A Mathematical Framework for Transformer Circuits." They showed that in small transformers, attention heads implement specific information-retrieval operations that can be written out explicitly in matrix algebra. In one seminal result, they identified "induction heads" β pairs of attention heads that collectively implement in-context copying: if the sequence contains "A B ... A", the induction head learns to predict "B" at the second "A" β a fundamental mechanism underlying in-context learning.
Induction heads are one of the most concrete mechanistic discoveries to date: a two-head circuit that implements a lookup-and-copy operation. They emerge consistently across model scales and architectures. Their identification gave researchers a specific, testable hypothesis about how language models generalize from context β not a statistical summary, but an actual algorithm implemented in weights.
The most challenging finding in mechanistic interpretability research is that individual neurons are often polysemantic: a single neuron in a large language model will activate strongly for the concept "banana," but also for "yellow things," and also for specific code syntax patterns, and also for a particular writing style. This is not coincidence β it follows from Elhage et al.'s 2022 "Toy Models of Superposition" paper.
The superposition hypothesis: neural networks have more features to represent than they have neurons. To pack them in, the network encodes features as nearly-orthogonal directions in high-dimensional activation space, exploiting the fact that high-dimensional spaces can accommodate exponentially many near-orthogonal vectors. The benefit is information density. The cost is that any individual neuron "participates in" multiple features and cannot be cleanly interpreted in isolation.
This creates a fundamental problem for mechanistic interpretability: if neurons are the wrong unit of analysis, what is the right one? The Anthropic team's proposed answer in 2023 was Sparse Autoencoders (SAEs): train an autoencoder on model activations with a sparsity penalty, forcing the hidden layer to develop monosemantic features β one feature per hidden unit β even if the original model uses superposition. Early results were striking: SAE features extracted from GPT-2 and Claude activations showed human-interpretable concepts that individual neurons obscured.
Concrete discoveries: Curve detectors in vision networks. Induction heads in transformers. Frequency-based modular arithmetic circuits in one-layer transformers (the "grokking" experiments by Power et al., 2022). A sentiment direction in BERT that, when intervened on, causes the model to flip sentiment predictions. Specific "memory reading" heads that retrieve factual associations.
Open problems: Scaling. All of the above work was done on small models β toy transformers, one-layer networks, BERT-scale models. Anthropic's Claude 3 has billions of parameters. GPT-4 is larger. Whether circuit-level analysis can scale to these models, or whether new organizational principles emerge at scale, is one of the central open questions in AI safety research as of 2024. Chris Olah has explicitly described mechanistic interpretability as "the only thing that can give us the kind of understanding we need" for frontier model safety β and also acknowledged that the field is very young and that there is no guarantee the program will succeed at scale.
If we can read circuits in neural networks, we can ask: does this model have a circuit that represents "deceiving the user" as an instrumental goal? Does it have features encoding "this is a test scenario"? These are not hypothetical questions β they are the exact questions that AI safety researchers want to answer before deploying powerful AI systems. Mechanistic interpretability is the only approach that, in principle, could answer them.
You've learned about circuits in vision networks, induction heads in transformers, and the superposition hypothesis. In this lab, work through the implications with the AI assistant. What does it mean to "reverse-engineer" a neural network? How do SAEs help? What would it take to fully understand a frontier model like GPT-4?
In late 2023, researchers at Anthropic released a series of results from what they called "activation steering" experiments. They had identified a linear direction in Claude's residual stream that encoded the concept of "Assistant." When that direction was amplified β adding a multiple of the "Assistant" vector to activations during a forward pass β the model began behaving more sycophantically, agreeing more readily with whatever the user said.
When the direction was inverted β subtracted β the model became more oppositional, disagreeing even with correct statements. The intervention was clean, directional, and reproducible. Nobody had programmed this. The direction had emerged from training. And it could be read and manipulated by anyone with access to the model's activation space.
The researchers were both excited and troubled. Excited because it worked. Troubled because the same technique that lets alignment researchers check for dangerous concepts β does this model encode deception? does it encode self-preservation? β could theoretically be used to steer models toward behaviors their developers never intended.
A probe is a small, simple classifier trained on top of a neural network's internal activations to test whether a specific concept is linearly encoded. The technique was popularized by Alain and Bengio (2016) and refined extensively in NLP by John Hewitt, Yonatan Belinkov, Christopher Manning, and others. The logic: if a linear classifier can predict "is this token the subject of a sentence?" from BERT's layer-14 activations at 95% accuracy, then syntactic subject information is probably linearly encoded in that representation.
Probing has revealed that language models encode extraordinarily rich structured information: sentence tree structure (Hewitt and Manning, 2019), world states in games (Li et al., 2022), entity type, negation scope, temporal relationships, and sentiment intensity. In the game world-state experiment, Kenneth Li, Aspen Hopkins, David Bau, Fernanda ViΓ©gas, Hanspeter Pfister, and Martin Wattenberg at Harvard showed that a model trained on Othello transcripts had learned to represent the actual board state β not just to predict legal moves, but to encode which squares were each player's, in a way that could be read out with a linear probe and intervened on to change model behavior.
For safety, probing is important because it allows researchers to audit specific concepts: does this deployed model encode a "jailbreak attempt" detector internally? Does it represent the concept of "user intent" in its activations? These audits can inform fine-tuning, filtering, and governance decisions.
Probing reads representations. Activation steering writes them. The technique β also called representation engineering or activation patching β involves adding a concept vector to model activations during inference to induce or suppress a behavior. The concept vector is typically obtained by contrasting activations for positive and negative examples of the concept.
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks published "Representation Engineering" in 2023, systematically demonstrating that concept vectors for honesty, emotion, and harm could be identified and applied to steer model behavior across a wide range of contexts.
Critically, they showed that adding an "honesty" vector to activations caused models to be more truthful on benchmarks β and adding its negation caused systematic deception. This has direct implications for alignment: if we can identify the internal representation of "being honest," we have a potential mechanism for reinforcing honest behavior beyond just reward-model training.
A complementary result from David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua Tenenbaum, William Freeman, and Antonio Torralba (Network Dissection, 2017β2020) showed that in vision networks, specific convolutional units correspond to human-interpretable concepts β "trees," "doors," "grass" β and that ablating those units selectively removes those concepts from generated images. This was activation patching in the generative direction: understanding β control.
Li et al.'s 2022 Othello study is a landmark result. A model trained only to predict the next legal move in Othello transcripts β no explicit supervision on board state β had internally computed and represented the full board state in linear activations. Probing revealed this. Steering confirmed it: intervening on the board-state representation caused the model to make moves consistent with the modified (imagined) board, not the actual one. The model had built a world model as an instrumental byproduct of its task.
The alignment value of interpretability tools can be organized around three oversight functions: detection, intervention, and verification.
Detection means identifying when a model has developed internal representations that are safety-relevant before they manifest in behavior. Jan Leike at OpenAI (later Anthropic) argued in 2023 that current AI safety work is "mostly trying to prevent bad behavior" but that detection of internal states would allow safety teams to catch problems before deployment. Probing and SAE-based feature discovery are the current tools for this.
Intervention means being able to modify dangerous representations once found. Activation steering is the primary current method. Fine-tuning and RLHF can be thought of as systematic interventions, but they are coarse: they reshape the output distribution without precise knowledge of what internal representations changed. Targeted activation patching is more surgical, but also more brittle β interventions that work in one context often fail to generalize.
Verification means being able to confirm that alignment properties have been achieved internally, not just behaviorally. This is the hardest problem. Paul Christiano at the Alignment Research Center has argued that a model could behave perfectly on all evaluated inputs while having internal representations that would support harmful behavior in some unevaluated distribution. Only internal verification β reading the model's representations for misalignment-relevant concepts β can close this gap. As of 2024, no one has demonstrated reliable internal verification at frontier scale.
Interpretability for safety faces four open problems that researchers actively cite as barriers to the program succeeding at frontier scale.
The scaling gap. All current mechanistic results are from models small enough to manually inspect circuits. Frontier models have billions of parameters and emerge at scales where qualitatively new capabilities appear. There is no established framework for extending circuit-level analysis to GPT-4-scale models. Anthropic's 2024 "Mapping the Mind of a Large Language Model" publication identified millions of interpretable features in Claude 3 Sonnet using SAEs β a major advance β but the features account for only a fraction of the model's total representational capacity.
The completeness problem. Even if every neuron and every circuit is identified, there is no guarantee the resulting description is complete. A model might route computations through paths that interpretability tools have not examined. Anthropic's "Sleeper Agents" paper (2024) demonstrated that models could be fine-tuned to behave normally under evaluation while harboring a triggered behavior β and that the triggered behavior was extremely difficult to remove with standard fine-tuning. Interpretability-based detection did not reliably find the hidden circuit.
The ground truth problem. We lack benchmarks for measuring interpretability quality. Whether an SAE feature is "truly" the right decomposition, or a useful but misleading approximation, has no gold standard. David Bau at MIT has argued this is the central methodological challenge: without ground truth about what the right concepts are, we cannot know if our interpretability tools are finding real structure or convenient artifacts.
The adversarial use problem. Techniques that let alignment researchers read and steer model internals are dual-use. A sophisticated actor who can identify "safety feature" directions in a model's activation space can use activation patching to suppress them. Nathaniel Li and colleagues demonstrated in 2024 that certain jailbreaks could be understood as bypassing safety-relevant activation directions rather than logically defeating guardrails β and that understanding the mechanism made the bypass easier to engineer.
The case for investing heavily in interpretability research is essentially a wager: that understanding model internals is both achievable and necessary for safe deployment of powerful AI. If it is achievable, the benefits are transformative β genuine verification of alignment properties, early detection of dangerous capabilities, targeted intervention on misaligned representations. If it is not achievable at scale, we will need entirely different approaches. The field is young enough that both outcomes remain plausible. Researchers who believe interpretability is the critical path for AI safety think the wager is worth taking.
You've completed all four lessons on interpretability. In this final lab, synthesize what you've learned about probing, activation steering, and the oversight functions of interpretability. Consider the big-picture argument: is interpretability the critical path to safe advanced AI? What are the strongest objections? What would success look like?