A taxonomy of AI failure and the epistemology of trust calibration.
Amazon built an AI recruiting tool trained on a decade of resume data. Since the tech industry had historically hired more men, the training data reflected that imbalance. The system systematically downgraded resumes that included the word "women's" β as in "women's chess club" β and penalized graduates of all-women's colleges. Amazon scrapped the tool in 2018.
The failure wasn't a bug. The model did exactly what it was trained to do: predict who the company had historically hired. The problem was that historical hiring decisions were themselves biased.
AI Failure Taxonomy
AI failures fall into predictable categories. Knowing the taxonomy helps you anticipate rather than react.
Data failures: Unrepresentative, biased, or stale training data.
Specification failures: The objective doesn't capture what you actually want (Goodhart's Law: when a measure becomes a target, it ceases to be a good measure).
Distribution shift: Real-world inputs differ from training distribution.
Brittleness: Small perturbations cause catastrophic output changes.
Emergent failures: Behaviors that appear only at scale, impossible to predict from small-scale testing.
Alignment failures: The model pursues proxy goals that diverge from human intent.
Trust Calibration
Appropriate trust in AI is neither blanket skepticism nor uncritical acceptance. It requires domain-by-domain calibration based on stakes, reversibility, adversarial exposure, and how novel the domain is relative to training data.
Key Principle
The higher the stakes and the less reversible the decision, the more independent verification is warranted β regardless of AI confidence scores.
Amazon's resume-screening AI penalized graduates of all-women's colleges. Which failure category best describes this?
β Data failure. The model faithfully learned from historical hiring data that reflected past discrimination. No malice required β the bias was already baked into the training set.
β Not quite. The developers intended no harm β the problem is that training data carries historical discrimination forward. This is a data failure.
Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure." Which AI failure type does this most directly describe?
β Specification failure. When you optimize for the measurable proxy instead of the underlying goal, the system games the metric. This is the essence of Goodhart's Law.
β Goodhart's Law describes what happens when a model optimizes for the wrong objective β that's a specification failure.
A self-driving car performs flawlessly in California but fails in snowy Finnish roads it was never trained on. This is an example of:
β Distribution shift. The deployment environment (Finland, snow, ice) falls outside the training distribution (California roads). Models can fail silently when real-world conditions differ from training conditions.
β This is distribution shift β the model is deployed in conditions that differ from where it was trained.
A language model unexpectedly develops the ability to perform multi-step arithmetic at a certain scale that wasn't present in smaller versions. This is:
β Emergent capabilities are behaviors that appear suddenly at scale and weren't predictable from evaluations of smaller models. This is one of the most difficult aspects of frontier AI safety research.
β This describes emergence β capabilities that appear unpredictably as model scale increases, making them hard to anticipate or test for in advance.
For which type of decision would lower trust in AI output be most warranted, all else equal?
β High stakes, irreversible, adversarial to error β cancer treatment decisions require the most rigorous independent verification. Trust calibration scales inversely with consequence severity.
β The higher the stakes and the less reversible the outcome, the less you should rely on AI without independent verification. Medical treatment decisions require the most caution.
Hallucination, confabulation, and the architectural limits of language models.
In 2023, attorneys in a real U.S. federal case submitted a legal brief citing six prior court cases β all generated by ChatGPT, none of which existed. The AI produced plausible case names, docket numbers, and fabricated quotes from fictional rulings. When the court demanded copies, the attorneys submitted more AI-generated text purporting to confirm the cases. The judge sanctioned all parties.
Why Language Models Hallucinate
Language models predict next-token probability distributions conditioned on prior context. They do not retrieve facts from a verified database β they generate what statistically fits. Hallucination is a structural consequence of this architecture.
The model has no internal truth oracle β it cannot distinguish generating true statements from plausible-sounding ones.
Legal citations, academic references, and statistics are high-density structured text patterns. The model learned the format; the content may be confabulated.
Retrieval-Augmented Generation (RAG) partially mitigates this by grounding responses in retrieved documents, but doesn't eliminate hallucination.
RLHF training may increase fluency and apparent confidence without improving factual accuracy.
The Confidence Problem
The most dangerous hallucinations are the ones that sound authoritative. Legal citations, medical statistics, and research paper quotes are precisely the content types where hallucination is hardest to detect β because you'd need domain expertise to spot the error, and the person asking AI is often doing so because they lack that expertise.
Key Insight
Hallucinations are hardest to detect on topics you know least about. The AI sounds equally confident whether it's right or wrong.
β Language models predict next tokens based on statistical patterns β they have no internal mechanism to verify whether generated content is factually true.
β Hallucination is structural: models predict statistically plausible text, not verified facts. They have no truth oracle.
In the Mata v. Avianca case, what was the fundamental error?
β The AI produced entirely fabricated case citations β names, docket numbers, quotes β that matched the format of real legal citations but referenced cases that did not exist.
β The AI invented the cases entirely β the citations were structurally correct but referenced fictional rulings.
Which content type is MOST vulnerable to AI hallucination going undetected?
β Specific citations are the most dangerous β the AI can produce a perfectly formatted reference to a paper that doesn't exist, and only someone who checks the original source would catch it.
β Specific citations to obscure or technical sources are most dangerous β they're hardest to verify and the AI produces them in perfect format regardless of whether they exist.
How does Retrieval-Augmented Generation (RAG) reduce hallucination?
β RAG retrieves relevant documents and provides them as context, giving the model source material to reference rather than generating purely from parametric memory.
β RAG works by retrieving relevant documents and supplying them as context β the model then generates based on those documents rather than from memory alone.
RLHF training (Reinforcement Learning from Human Feedback) is primarily designed to:
β RLHF aligns outputs with human preferences β it can improve helpfulness and reduce harmful content, but it doesn't directly improve factual accuracy and may even increase confident-sounding hallucinations.
β RLHF is about aligning with human preferences, not improving factual accuracy. It may actually increase confident-sounding outputs without reducing hallucination rates.
Distributed accountability in sociotechnical systems.
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a proprietary AI tool used in U.S. courts to assess recidivism risk. A 2016 ProPublica analysis found it was significantly more likely to incorrectly flag Black defendants as high-risk and white defendants as low-risk. Judges weren't required to follow COMPAS scores, but evidence suggests scores influenced sentencing. The algorithm's inner workings are protected as a trade secret β defendants cannot examine the tool that helped determine their sentence.
The Accountability Chain
Traditional legal frameworks assume discrete, identifiable human agents responsible for decisions. AI-assisted decision-making distributes causal responsibility across model developers, deploying institutions, individual operators, and regulators.
Model developers β design choices, training data, evaluation metrics.
Individual operators β how they weight AI output against other factors.
Regulators β what they require, audit, and permit.
Automation Bias
The "human in the loop" framing is frequently invoked as a safeguard. But research shows that when humans consistently defer to AI recommendations β automation bias β oversight becomes nominal. Formal responsibility and actual causal responsibility diverge. The human is legally responsible but effectively a rubber stamp.
Core Tension
The more reliable an AI system appears, the more humans defer to it β reducing the oversight that makes the human-in-the-loop framework meaningful.
What is "automation bias" in the context of AI oversight?
β Automation bias is the tendency to over-trust automated systems β deferring to AI output even when independent judgment should override it. This undermines human-in-the-loop safeguards.
β Automation bias describes human behavior: the tendency to defer to AI recommendations without sufficient critical evaluation.
In the COMPAS case, why can't defendants challenge the algorithm's decision?
β COMPAS is a proprietary commercial product. Its weights, training data, and decision logic are trade secrets β defendants cannot examine the system that influenced their sentencing.
β COMPAS is protected as a trade secret. This creates a profound tension between commercial IP rights and defendants' due process rights.
Which party in an AI deployment chain typically bears the LEAST formal legal accountability?
β In most current legal frameworks, AI model developers face limited direct liability β accountability tends to fall on deploying institutions and operators. This is a major gap in current AI governance.
β Under current law in most jurisdictions, model developers bear less direct liability than deploying institutions β though this is an active area of regulatory debate.
What does it mean for accountability when "formal responsibility and actual causal responsibility diverge"?
β When a human is formally responsible (legally liable) but an AI made the actual decision they rubber-stamped, accountability becomes hollow β you can assign blame but it doesn't reflect who made the choice.
β The divergence means the person legally on the hook didn't meaningfully control the outcome β accountability becomes nominal rather than substantive.
What is the core tension in the human-in-the-loop oversight model?
β This is the fundamental paradox: high-performing AI induces automation bias, which reduces meaningful oversight, which removes the safeguard that justified deploying AI in high-stakes decisions in the first place.
β The core tension is that reliable AI produces automation bias β humans defer more as AI improves, undermining the oversight that makes human-in-the-loop meaningful.
The full lifecycle of bias: origins, proxy variables, and feedback loops.
Medical AI systems trained mostly on lighter-skinned patients have shown lower accuracy for darker-skinned patients in detecting conditions including skin cancer and pulse oximetry errors. The patients who most need accurate diagnosis are the ones the system serves least well.
The Bias Pipeline
Historical bias: Training data reflects past discriminatory decisions.
Representation bias: Certain populations are underrepresented in training data.
Measurement bias: Proxy labels don't equally capture the construct across groups.
Aggregation bias: One model for a heterogeneous population obscures subgroup differences.
Deployment feedback loops: Biased outputs affect the world, generating new biased training data.
Equalized odds: Equal true positive rates AND equal false positive rates across groups.
Calibration: Among individuals scoring p, approximately p% have the outcome, regardless of group membership.
The Impossibility Theorem
Chouldechova (2017) and Kleinberg et al. (2016) independently proved: when base rates differ across groups, you cannot simultaneously satisfy calibration, equal false positive rates, and equal false negative rates. This is a mathematical proof, not a design limitation.
Political Dimension
Choosing which fairness metric to optimize is not a technical decision β it is a political one encoding a value judgment about which error is worse and whose interests are prioritized.
Prompt injection, jailbreaks, adversarial examples, and the security surface of AI systems.
The Adversarial Threat Surface
Adversarial examples: Imperceptible input changes causing misclassification. A stop sign with stickers reads as a speed limit sign to an AV classifier.
Prompt injection: Malicious text in data the LLM processes overrides intended instructions.
Jailbreaking: Prompt sequences designed to bypass safety training.
Data poisoning: Injecting malicious examples into training data to create backdoors triggered at deployment.
Model inversion: Querying a model to extract private training data.
Open Problem
Adversarial robustness and standard accuracy are often in tension β there is no current general solution.
β Prompt injection: adversarial instructions embedded in untrusted data (emails, docs) that the LLM treats as instructions.
β Prompt injection hides instructions in data the model processes, causing it to execute them instead of its intended behavior.
What makes adversarial examples in computer vision particularly dangerous?
β Humans see a normal image while the model sees something completely different β with high confidence.
β The danger is imperceptibility: tiny pixel changes invisible to humans flip a model's classification.
Data poisoning is most dangerous because:
β The backdoor is invisible in normal operation β it only activates when the attacker presents the specific trigger pattern.
β Data poisoning inserts a trigger during training β invisible until the attacker uses the specific trigger input in deployment.
Why is adversarial robustness difficult to achieve alongside high accuracy?
β Empirical accuracy-robustness tradeoff: features that maximize clean accuracy tend to be brittle β sensitive to adversarial perturbations.
β Research shows a fundamental tension: optimizing for standard accuracy creates representations more susceptible to adversarial attack.
Model inversion attacks are designed to:
β Model inversion exploits the fact that outputs leak information about training data β attackers can reconstruct approximate representations of private training examples.
β Model inversion uses outputs to reconstruct training data β a significant privacy risk when models are trained on sensitive personal records.
β Distribution shift + underspecification: the validation environment doesn't match deployment, and equally-performing models may diverge in real-world conditions.
β Two compounding problems: distribution mismatch and underspecification β standard metrics simply don't capture real-world robustness.
β When humans rubber-stamp AI decisions, formal accountability remains but substantive oversight disappears.
β Automation bias: humans defer β oversight exists on paper but not in practice.
Which hallucination type is most dangerous in a legal context?
β Citation hallucination: perfectly formatted fake references that only someone who verifies the source would catch β as Mata v. Avianca demonstrated.
β Citation hallucination: perfectly formatted fake references, hardest to detect without domain expertise.
Proxy bias allows discrimination to persist after a protected attribute is excluded because:
β Correlated proxies carry the discriminatory signal forward β excluding "race" while keeping ZIP code still produces racially disparate outcomes.
β Proxy bias: correlated variables carry the protected attribute's signal even when it's excluded.
Model cards are primarily designed to:
β Model cards are transparency artifacts: what a model is for, how it performs across subgroups, and known limitations.
β Model cards document what a model is, how it was evaluated, known limitations, and appropriate use contexts.
Underspecification is a problem because:
β Equally-scoring models may behave very differently in deployment β and you can't tell from standard evaluation alone.
β Underspecification: validation can't distinguish good generalizers from poor ones.
COMPAS disproportionately flagged Black defendants because:
β Proxy bias via prior arrests β over-policed communities generate more arrests, which the model treated as evidence of higher recidivism risk.
β Proxy bias: prior arrests proxied for race because of differential policing intensity.
Which audit type is most capable of detecting data poisoning backdoors?
β Backdoors are invisible in normal operation and undetectable via output probing β white-box access to training data and weights is required.
β Backdoors can't be detected through output probing alone β white-box access to training data and model internals is needed.
Language models predict next-token probability distributions β they don't retrieve facts from a verified database. Knowledge is distributed across parameters as statistical associations. When you ask about a legal case, the model predicts tokens that co-occur with legal citations; it doesn't look the case up.
This is why hallucinations follow the right format: the model learned the pattern of legal citations and academic references. The content within that pattern may be statistical interpolation rather than retrieved fact.
RAG (Retrieval-Augmented Generation) partially mitigates this by supplying retrieved documents as context β but doesn't eliminate hallucination when retrieved context is incomplete or ambiguous.
Proxy Variables and Structural Bias
A proxy variable correlates with a protected attribute without directly encoding it. ZIP code correlates with race due to historical residential segregation. A lending model excluding race but including ZIP code still produces racially disparate outcomes.
The COMPAS tool used prior arrests as a key input. Prior arrests correlate with race due to differential policing intensity. A model trained on arrest data systematically overestimates recidivism risk for individuals from over-policed communities β even without seeing race directly.
This is laundering bias: excluding a protected attribute while retaining correlated proxies creates a veneer of neutrality over a substantively biased process.