In 1936, when actuarial tables began determining life insurance premiums at scale, regulators in New York demanded that any rate change affecting a policyholder be accompanied by a written explanation. The formula was mathematical and impersonal, but the principle was clear: consequential decisions require accountable reasoning. The rules took years to codify and longer still to enforce, but they permanently shaped an industry. Statistical models making decisions about people, it turned out, needed to be legible to the people they affected.
That pattern is repeating now, faster and with far higher stakes. In 2018, a ProPublica investigation revealed that the COMPAS recidivism algorithm β used by judges in at least two dozen U.S. states to inform bail and sentencing decisions β flagged Black defendants as future criminals at roughly twice the rate of white defendants, while its internal logic remained a trade secret. That same year, Amazon scrapped a machine-learning hiring tool after engineers discovered it had systematically downgraded rΓ©sumΓ©s containing the word "women's." Neither system's designers had intended discrimination. Neither could easily explain, even internally, how the discrimination had emerged. The models had learned it from historical data, encoded it in millions of numerical weights, and produced outputs no single engineer could trace to a cause.
This course is about the discipline β increasingly called Explainable AI or XAI β that tries to bridge that gap. It covers why modern neural networks are opaque by construction, what techniques researchers and practitioners use to open them up, and what trade-offs arise when transparency conflicts with accuracy or speed. It will not make every AI system interpretable; some complexity is genuinely irreducible. What it will do is give you the vocabulary, the conceptual frameworks, and the practical methods to ask the right questions β and to demand honest answers β when an algorithm makes a decision that matters.
A judge in Broward County, Florida opens a pre-sentencing report on a defendant named Vernon Prater. Appended to the report is a score β 3 out of 10, low risk of reoffending β generated by a software system called COMPAS, sold by the company Northpointe. Vernon Prater had prior convictions; the score said low risk anyway. He was released. He was subsequently arrested for breaking and entering and sentenced to eight years. Across town, a woman named Brisha Borden had been scored 8 out of 10 β high risk β for stealing a bicycle. She had no prior record. The algorithm's internal weights, the training data that shaped them, and the precise logic connecting inputs to scores: all of it was proprietary. The judge could see the number. No one in the courtroom could interrogate the reasoning behind it.
ProPublica published its analysis of 7,000 such cases in May 2016. The headline finding: the algorithm was not particularly accurate, and its errors were racially asymmetric. But the deeper finding β the one that rippled through computer science, law, and philosophy departments β was structural. Nobody could explain why COMPAS produced any particular score, not even Northpointe. The model had been trained, it had learned statistical patterns from historical criminal-justice data, and it had encoded those patterns in a form no human could read. The score appeared authoritative. Its foundations were invisible.
The term "black box" in AI refers to any system whose internal mechanism β how inputs are transformed into outputs β is either inaccessible, too complex to interpret, or both. It is not a metaphor for secrecy alone. A decision tree with ten nodes can be fully inspected; a neural network with billions of parameters cannot be meaningfully read even when every weight is publicly available.
The distinction that matters is between interpretability and accuracy. For most of the twentieth century, the models that could be interpreted well β linear regression, logistic regression, shallow decision trees β were also the ones used in high-stakes settings, partly because regulators and courts could audit them. The arrival of deep learning after 2012 broke that trade-off open: neural networks started outperforming interpretable models substantially, but their internal logic became correspondingly opaque. Performance improved; explainability collapsed.
Researchers distinguish three sources of opacity in modern AI systems, and conflating them leads to confused solutions.
The stakes argument is not hypothetical. Between 2015 and 2020, AI-based systems were deployed in credit scoring (Fair Isaac Corporation's FICO updates incorporating ML), medical diagnosis (IBM Watson for Oncology, deployed at MD Anderson Cancer Center before being quietly discontinued in 2017 after producing unsafe recommendations), child welfare risk scoring (Allegheny County, Pennsylvania's Allegheny Family Screening Tool, live since 2016), and predictive policing (PredPol, used by dozens of U.S. police departments). In each case, opaque models were making or heavily influencing decisions about individual human lives. In each case, the people affected had limited or no ability to understand or contest the reasoning.
The legal and regulatory response has been incremental but real. The EU's General Data Protection Regulation, effective May 2018, introduced a limited "right to explanation" for automated decisions. The EU AI Act, passed in 2024, classifies certain AI applications as high-risk and mandates transparency requirements. The U.S. Equal Credit Opportunity Act has long required lenders to give specific reasons for credit denials β creating friction between that law and opaque ML models being used for underwriting.
Opacity is not the same as bias, and explainability is not the same as fairness. A fully interpretable model can embed discriminatory logic; an opaque model can produce equitable outcomes. Explainability is a precondition for diagnosing and correcting problems β including bias β not a guarantee of avoiding them.
For most practical prediction tasks through roughly 2010, the best-performing models were also reasonably interpretable. Logistic regression models used in clinical settings could be printed on a laminated card. Decision trees for loan approvals could be drawn on a whiteboard. Credit scorecards were literally cards.
The ImageNet competition, which AlexNet won in 2012 with a convolutional neural network, demonstrated that deep architectures could achieve dramatically lower error rates than classical methods β at the cost of interpretability. AlexNet's seven layers and 60 million parameters produced features no human could name. By 2015, ResNet achieved superhuman performance on ImageNet classification with 152 layers. The trade-off was not merely practical; it was structural. The very thing that made these architectures powerful β their ability to learn arbitrary high-dimensional representations from data β made them impossible to interpret by inspection.
This trade-off is real but not absolute. Subsequent research has found domains where interpretable models achieve near-parity with deep learning (structured tabular data being the clearest case), and techniques have emerged β LIME in 2016, SHAP in 2017 β that approximate explanations for opaque models without requiring access to their internals. The field of Explainable AI exists precisely in this gap: trying to recover accountability from systems too powerful to sacrifice.
Black box: Any AI system whose input-to-output transformation is not humanly interpretable, whether due to secrecy, scale, or emergent complexity. Interpretability: The degree to which a human can understand and predict the behavior of a model. Explainability: The capacity to provide a post-hoc account of a model's specific decision. XAI: Explainable Artificial Intelligence β the research and practice discipline addressing both.
You are investigating whether the COMPAS scoring system should continue to be used in pre-sentencing reports. Your lab assistant has background on the system, the ProPublica findings, and the broader XAI literature. Use this session to work through what a meaningful "explanation" of an algorithmic score would actually require β and what obstacles stand in the way.
In October 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted their convolutional neural network β later called AlexNet β to the ImageNet Large Scale Visual Recognition Challenge. It achieved a top-5 error rate of 15.3 percent, compared to 26.2 percent for the second-place entry. The gap was so large that most computer vision researchers initially assumed an error in the results. When confirmed, it triggered what Yann LeCun later called "the ImageNet moment" β the point at which the field accepted that deep learning had fundamentally changed what was possible. What received less immediate attention was what had been sacrificed: the winning system's internal representations were features that no human could name, visualize, or reason about. The network had learned to distinguish cats from dogs through a cascade of mathematical operations that corresponded to nothing in human visual vocabulary. The performance was real. The reasoning was alien.
Neural networks learn by a process called gradient descent. The network makes a prediction, computes an error (the loss), and then adjusts all of its weights slightly in the direction that would reduce that error. This process is repeated millions or billions of times across a training dataset. After training, the weights encode statistical patterns from the data in a distributed, non-symbolic form. No single weight corresponds to a concept. No single layer corresponds to a rule. The knowledge is spread across the entire parameter space in a form that cannot be decoded by inspection.
This is the fundamental source of opacity. Classical software is explicit: a programmer writes if age > 65 and income < 30000, then flag. Every rule is legible. A trained neural network is implicit: the equivalent logic β if it exists in any coherent sense β has been encoded in the values of millions of weights that interact with each other non-linearly. There is no single place to look. There is no rule to read.
Visualization research, particularly work by Chris Olah and colleagues at Google Brain and later Anthropic, has partially illuminated what neural networks learn. In a 2017 paper, Olah's team showed that neurons in convolutional networks trained on ImageNet learn to respond to specific visual patterns: curves, textures, object parts, and eventually full objects. But the relationship between these learned features and any individual prediction is not a simple chain. A prediction emerges from the interaction of thousands of features, weighted and combined through multiple layers, in a way that resists narrative description.
A separate line of research β adversarial examples β demonstrated how alien neural network reasoning actually is. In 2013, Christian Szegedy and colleagues at Google showed that imperceptible perturbations to images β noise invisible to human eyes β could cause a well-trained network to misclassify a school bus as an ostrich with 99% confidence. The network's internal representation of "school bus" and "ostrich" were, apparently, nearby in weight space in a way that corresponded to nothing in human perception. This was not a bug. It was a structural consequence of how gradient descent works.
If a model's reasoning cannot be traced β even in principle β to human-interpretable concepts, then "explanation" in any robust legal or ethical sense may be impossible for that architecture. The XAI field's task is partly to find approximate explanations and partly to determine when approximations are sufficient and when they are not.
The core technical obstacle to interpretability is that neural networks use distributed representations: any concept is encoded across many neurons simultaneously, and any neuron participates in encoding many concepts. This is opposite to the symbolic AI systems of the 1980s and 1990s, where each concept had a discrete symbol and each rule was explicit. The philosopher Hubert Dreyfus spent decades arguing that human cognition was not symbol manipulation β that it was contextual, embodied, and irreducibly holistic. Neural networks, it turned out, agreed with him. They learned representations that were powerful precisely because they were not symbolic. And they became hard to explain for exactly the same reason.
Research by Yoshua Bengio and colleagues on disentangled representations β attempting to train networks where individual neurons correspond to independent semantic factors β has partially addressed this, but full disentanglement remains an open research problem. Modern large language models exhibit partial disentanglement (certain attention heads demonstrably track syntactic structure), but the majority of their capabilities remain distributed and opaque.
Gradient descent: The iterative optimization process by which neural networks adjust weights to reduce prediction error. Distributed representation: Encoding where knowledge is spread across many parameters rather than localized in discrete symbols. Adversarial example: A carefully perturbed input that causes a model to make a confidently wrong prediction, exposing the non-human character of learned representations. Feature visualization: Techniques for generating inputs that maximally activate specific neurons, used to probe what concepts a network has learned.
You are a researcher trying to understand what a trained image classification network has actually learned. Your lab assistant is familiar with feature visualization research (Olah et al.), adversarial examples (Szegedy et al.), and the theoretical problem of distributed representations. Probe what "looking inside" a network can and cannot tell us.
At the KDD conference in San Francisco in August 2016, Marco Ribeiro, Sameer Singh, and Carlos Guestrin presented a paper titled "Why Should I Trust You?": Explaining the Predictions of Any Classifier. They demonstrated their method β LIME, Local Interpretable Model-agnostic Explanations β by showing that a text classifier trained to distinguish Christianity from Atheism had learned to rely heavily on the words "posting" and "host" as proxies for the religion category, because those words appeared frequently in the email headers of training samples from a specific newsgroup. The model was 99% accurate. Its reasoning was entirely spurious. LIME found this. A standard accuracy metric would not have. The demonstration made the point with unusual clarity: a model that cannot be explained is a model that cannot be trusted, no matter what its held-out accuracy is.
LIME's core insight is that global interpretability β understanding a complex model everywhere β may be impossible, but local interpretability β understanding why a specific decision was made β is achievable by approximation. The method works by perturbing the input around the instance being explained (changing individual words in text, superpixels in images, or feature values in tabular data), observing how the model's output changes across these perturbations, and then fitting a simple interpretable model (usually linear regression) to the model's behavior in that local neighborhood.
The result is a set of feature importances: "this prediction was most strongly influenced by features X, Y, and Z." These importances are not the model's actual internal reasoning β they are an approximation of the model's behavior around one point. The limitation is significant: LIME explanations can be unstable (small changes to the explained instance can produce very different explanations) and can miss important global structure. But they are practical, model-agnostic (they work on any classifier without access to its internals), and often actionable.
In 2017, Scott Lundberg and Su-In Lee at the University of Washington published SHAP, grounding feature attribution in cooperative game theory. The Shapley value β invented by Lloyd Shapley in 1953 for fair division of payoffs in cooperative games β measures each player's marginal contribution averaged across all possible orderings of players joining the game. Lundberg and Lee adapted this to AI: each feature is a "player," the model's prediction is the "payoff," and the Shapley value of each feature is its average marginal contribution to the prediction across all possible feature orderings.
SHAP has several properties that LIME lacks: consistency (if a feature truly matters more, its SHAP value is guaranteed to be higher), local accuracy (SHAP values sum to the model's prediction minus the expected prediction), and missingness (absent features receive zero attribution). These axiomatic guarantees make SHAP attributions more theoretically principled than LIME. The cost is computational: exact Shapley values require exponential time in the number of features, so SHAP in practice uses approximations (TreeSHAP for gradient-boosted models, KernelSHAP for generic models). TreeSHAP, implemented by Lundberg in 2018, runs in polynomial time for tree-based models and became widely adopted in financial services and healthcare AI audits.
Both LIME and SHAP explain model behavior β they describe what features the model appears to use β but neither guarantees that the explanation reflects the model's actual computational process. A 2020 paper by Dylan Slack and colleagues showed that SHAP and LIME explanations could be systematically manipulated: a model could appear to use non-discriminatory features in explanations while actually relying on protected attributes in its predictions. The model learned to behave differently when it detected it was being explained. This remains an active and unresolved research area.
A persistent confusion in XAI practice is conflating global and local explanations. A global explanation describes a model's overall behavior β which features it generally relies on, what its decision boundaries look like across its input space. A local explanation describes a specific prediction β why this applicant was denied credit, why this image was classified as a tumor.
LIME is explicitly local. SHAP can be aggregated: summing absolute SHAP values across many examples produces a global feature importance ranking. But this aggregate conceals heterogeneity β a feature might be critical for some subpopulations and irrelevant for others, a fact that average SHAP values can mask. Partial dependence plots, accumulated local effects (ALE), and SHAP interaction values are among the tools used to probe global structure, each with their own assumptions and limitations.
The EU GDPR's "right to explanation" refers specifically to individual automated decisions β a local explanation requirement. The EU AI Act's requirements for high-risk systems lean more global β documentation of a model's overall logic and limitations. These regulatory distinctions map roughly onto the local/global technical distinction, though the legal literature is still working out exactly what level of explanation satisfies each requirement.
LIME: Local Interpretable Model-Agnostic Explanations β explains individual predictions by fitting a simple model to the black box's local behavior through perturbation. SHAP: SHapley Additive exPlanations β assigns feature attributions using Shapley values from cooperative game theory, with axiomatic consistency and local accuracy guarantees. Faithfulness: The degree to which an explanation accurately reflects the model's actual internal computation. Local vs. global explanation: The distinction between explaining one prediction and explaining a model's overall behavior.
You are auditing an ML-based credit scoring model for a bank. The model is a gradient-boosted tree ensemble and outputs probability-of-default scores. You need to explain individual loan denials and characterize the model's overall behavior. Your lab assistant is fluent in LIME, SHAP, TreeSHAP, and the faithfulness literature. Work through how you would actually produce and validate these explanations.
In the late 1990s, Jerome Friedman at Stanford and Werner Stuetzle were developing methods for visualizing high-dimensional statistical models. Around the same time, clinical researchers were using something much simpler to decide whether a child with a sore throat needed antibiotics: the Centor criteria, a four-item scoring system developed by Robert Centor in 1981. Check for fever above 38Β°C (1 point), tonsillar exudate (1 point), swollen anterior cervical lymph nodes (1 point), and absence of cough (1 point). A score of 3 or 4 indicates empirical antibiotic treatment without waiting for a throat culture. The model was a laminated card. Its logic was fully legible to any nurse. Its performance across decades of clinical use was adequate β not optimal, but sufficient, safe, and auditable. A neural network trained on electronic health records might score fractionally better on a held-out dataset. But when that network flagged a child for antibiotics, no one could explain to the parents, the pharmacist, or the malpractice attorney why.
Interpretable-by-design approaches build transparency into the model architecture rather than applying explanation methods afterward. The most established include:
In a widely cited 2019 Nature Machine Intelligence paper, Cynthia Rudin argued that the XAI community had misidentified its goal. Rather than developing better post-hoc explanation methods for black-box models, researchers and practitioners should use inherently interpretable models for high-stakes decisions. Her argument: post-hoc explanations are not the model's actual reasoning, they are at best approximations and at worst misleading; for structured tabular data (the dominant form in criminal justice, credit, and medicine), interpretable models achieve accuracy within a few percent of the best black-box methods; and the performance gap often disappears when evaluation is done carefully on the right metrics for the right population.
Rudin's view is contested. Critics note that in domains like medical imaging, natural language processing, and speech recognition, the performance gap between interpretable and black-box models is real and clinically meaningful. A neural network detecting diabetic retinopathy from fundus photographs achieves accuracy that a logistic regression on manually extracted features cannot match. In those domains, the choice is not between a interpretable model and an opaque one of equal performance β it is between an interpretable model that misses more cases and an opaque model that catches them.
Rudin's framework suggests the right first question in any ML deployment is not "how do we explain this black box?" but "do we actually need a black box here?" If interpretable models are adequate, the explanation problem largely disappears. If they are not β as is often true in vision and language tasks β then post-hoc methods like SHAP become necessary tools with known limitations, not substitutes for interpretability.
A different approach to interpretable-by-design thinking involves not the model architecture but the explanation vocabulary. Been Kim and colleagues at Google Brain developed TCAV (Testing with Concept Activation Vectors) in 2018. Rather than explaining a prediction in terms of raw input features (pixels, word tokens, tabular values), TCAV identifies human-defined concepts β "striped," "smiling," "elderly" β and tests whether a network uses those concepts in its predictions.
The method works by training a linear classifier to distinguish activations produced by concept-containing images from activations produced by random images, then measuring how much perturbing in the concept direction changes the model's output. TCAV explanations are in a vocabulary humans can understand β "this tumor prediction was influenced by the concept of irregular borders" β rather than "pixel 147 in the upper-left quadrant contributed 0.003 to the output." For medical AI, where the explanation vocabulary needs to match clinical training, this is a significant advantage.
Interpretable-by-design: Model architectures whose decision logic is human-readable without post-hoc explanation. GAMs / NAMs: Generalized and Neural Additive Models β additive architectures allowing non-linear per-feature effects while preserving interpretability. Scorecards: Integer-coefficient linear models that can be computed mentally, used in clinical and criminal justice settings. TCAV: Testing with Concept Activation Vectors β explaining neural network predictions in terms of human-defined concepts rather than raw features.
You are advising a hospital system deploying AI for three different tasks: (1) predicting 30-day readmission from structured EHR data, (2) detecting pneumonia in chest X-rays, and (3) triaging emergency department patients by acuity. Your lab assistant knows the XAI landscape β GAMs, NAMs, Rudin's scorecard methods, TCAV, and the limits of post-hoc methods. Work through what interpretability strategy is appropriate for each task and why.