🎯 Advanced

Lesson 1: AI Makes Mistakes

A taxonomy of AI failure and the epistemology of trust calibration.

Amazon built an AI recruiting tool trained on a decade of resume data. Since the tech industry had historically hired more men, the training data reflected that imbalance. The system systematically downgraded resumes that included the word "women's" — as in "women's chess club" — and penalized graduates of all-women's colleges. Amazon scrapped the tool in 2018.

The failure wasn't a bug. The model did exactly what it was trained to do: predict who the company had historically hired. The problem was that historical hiring decisions were themselves biased.

AI Failure Taxonomy

AI failures fall into predictable categories. Knowing the taxonomy helps you anticipate rather than react.

Data failures: Unrepresentative, biased, or stale training data.
Specification failures: The objective doesn't capture what you actually want (Goodhart's Law: when a measure becomes a target, it ceases to be a good measure).
Distribution shift: Real-world inputs differ from training distribution.
Brittleness: Small perturbations cause catastrophic output changes.
Emergent failures: Behaviors that appear only at scale, impossible to predict from small-scale testing.
Alignment failures: The model pursues proxy goals that diverge from human intent.

Trust Calibration

Appropriate trust in AI is neither blanket skepticism nor uncritical acceptance. It requires domain-by-domain calibration based on stakes, reversibility, adversarial exposure, and how novel the domain is relative to training data.

Key Principle

The higher the stakes and the less reversible the decision, the more independent verification is warranted — regardless of AI confidence scores.

🏠Academy Home →Take Quiz 1

🎯 Advanced

Quiz 1: AI Makes Mistakes

5 questions — free, untracked, retake anytime.

Amazon's resume-screening AI penalized graduates of all-women's colleges. Which failure category best describes this?

✅ Data failure. The model faithfully learned from historical hiring data that reflected past discrimination. No malice required — the bias was already baked into the training set.

❌ Not quite. The developers intended no harm — the problem is that training data carries historical discrimination forward. This is a data failure.

Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure." Which AI failure type does this most directly describe?

✅ Specification failure. When you optimize for the measurable proxy instead of the underlying goal, the system games the metric. This is the essence of Goodhart's Law.

❌ Goodhart's Law describes what happens when a model optimizes for the wrong objective — that's a specification failure.

A self-driving car performs flawlessly in California but fails in snowy Finnish roads it was never trained on. This is an example of:

✅ Distribution shift. The deployment environment (Finland, snow, ice) falls outside the training distribution (California roads). Models can fail silently when real-world conditions differ from training conditions.

❌ This is distribution shift — the model is deployed in conditions that differ from where it was trained.

A language model unexpectedly develops the ability to perform multi-step arithmetic at a certain scale that wasn't present in smaller versions. This is:

✅ Emergent capabilities are behaviors that appear suddenly at scale and weren't predictable from evaluations of smaller models. This is one of the most difficult aspects of frontier AI safety research.

❌ This describes emergence — capabilities that appear unpredictably as model scale increases, making them hard to anticipate or test for in advance.

For which type of decision would lower trust in AI output be most warranted, all else equal?

✅ High stakes, irreversible, adversarial to error — cancer treatment decisions require the most rigorous independent verification. Trust calibration scales inversely with consequence severity.

❌ The higher the stakes and the less reversible the outcome, the less you should rely on AI without independent verification. Medical treatment decisions require the most caution.

←Back to Lesson →Start Lab 1

🎯 Advanced

Lab 1: Failure Audit

Classify a real-world AI failure and design a pre-deployment audit checklist.

Lab 1 — AI Failure Audit

You'll classify an AI failure case using the taxonomy from Lesson 1, then work through what a pre-deployment audit should have caught.

The AI will present a real AI failure case.
Classify it using the failure taxonomy.
Propose what audit step would have caught it before deployment.

Apply the full taxonomy: data, specification, distribution shift, brittleness, emergent, alignment. Most real failures involve more than one category.

🎯 AI Lab AssistantLab 1

←Back to Quiz →Next: Lesson 2

🎯 Advanced

Lesson 2: When AI Doesn't Know

Hallucination, confabulation, and the architectural limits of language models.

In 2023, attorneys in a real U.S. federal case submitted a legal brief citing six prior court cases — all generated by ChatGPT, none of which existed. The AI produced plausible case names, docket numbers, and fabricated quotes from fictional rulings. When the court demanded copies, the attorneys submitted more AI-generated text purporting to confirm the cases. The judge sanctioned all parties.

Why Language Models Hallucinate

Language models predict next-token probability distributions conditioned on prior context. They do not retrieve facts from a verified database — they generate what statistically fits. Hallucination is a structural consequence of this architecture.

The model has no internal truth oracle — it cannot distinguish generating true statements from plausible-sounding ones.
Legal citations, academic references, and statistics are high-density structured text patterns. The model learned the format; the content may be confabulated.
Retrieval-Augmented Generation (RAG) partially mitigates this by grounding responses in retrieved documents, but doesn't eliminate hallucination.
RLHF training may increase fluency and apparent confidence without improving factual accuracy.

The Confidence Problem

The most dangerous hallucinations are the ones that sound authoritative. Legal citations, medical statistics, and research paper quotes are precisely the content types where hallucination is hardest to detect — because you'd need domain expertise to spot the error, and the person asking AI is often doing so because they lack that expertise.

Key Insight

Hallucinations are hardest to detect on topics you know least about. The AI sounds equally confident whether it's right or wrong.

←Back →Take Quiz 2

🎯 Advanced

Quiz 2: When AI Doesn't Know

5 questions — free, untracked, retake anytime.

Why do language models produce hallucinations?

✅ Language models predict next tokens based on statistical patterns — they have no internal mechanism to verify whether generated content is factually true.

❌ Hallucination is structural: models predict statistically plausible text, not verified facts. They have no truth oracle.

In the Mata v. Avianca case, what was the fundamental error?

✅ The AI produced entirely fabricated case citations — names, docket numbers, quotes — that matched the format of real legal citations but referenced cases that did not exist.

❌ The AI invented the cases entirely — the citations were structurally correct but referenced fictional rulings.

Which content type is MOST vulnerable to AI hallucination going undetected?

✅ Specific citations are the most dangerous — the AI can produce a perfectly formatted reference to a paper that doesn't exist, and only someone who checks the original source would catch it.

❌ Specific citations to obscure or technical sources are most dangerous — they're hardest to verify and the AI produces them in perfect format regardless of whether they exist.

How does Retrieval-Augmented Generation (RAG) reduce hallucination?

✅ RAG retrieves relevant documents and provides them as context, giving the model source material to reference rather than generating purely from parametric memory.

❌ RAG works by retrieving relevant documents and supplying them as context — the model then generates based on those documents rather than from memory alone.

RLHF training (Reinforcement Learning from Human Feedback) is primarily designed to:

✅ RLHF aligns outputs with human preferences — it can improve helpfulness and reduce harmful content, but it doesn't directly improve factual accuracy and may even increase confident-sounding hallucinations.

❌ RLHF is about aligning with human preferences, not improving factual accuracy. It may actually increase confident-sounding outputs without reducing hallucination rates.

←Back to Lesson →Start Lab 2

🎯 Advanced

Lab 2: Hallucination Analysis

Probe an AI for hallucinations and analyze the conditions that produce them.

Lab 2 — Hallucination Analysis

The AI guide will discuss the architectural roots of hallucination and the accountability questions raised by the Mata v. Avianca case.

The AI opens with an analytical question about professional responsibility.
Engage with the argument — push back or extend it.
Work toward: what verification standard should professionals using AI be held to?

Consider: does your answer change if we're talking about criminal rather than civil proceedings? What about AI systems marketed as "reliable"?

🎯 AI Lab AssistantLab 2

←Back to Quiz →Next: Lesson 3

🎯 Advanced

Lesson 3: Whose Fault Is It?

Distributed accountability in sociotechnical systems.

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a proprietary AI tool used in U.S. courts to assess recidivism risk. A 2016 ProPublica analysis found it was significantly more likely to incorrectly flag Black defendants as high-risk and white defendants as low-risk. Judges weren't required to follow COMPAS scores, but evidence suggests scores influenced sentencing. The algorithm's inner workings are protected as a trade secret — defendants cannot examine the tool that helped determine their sentence.

The Accountability Chain

Traditional legal frameworks assume discrete, identifiable human agents responsible for decisions. AI-assisted decision-making distributes causal responsibility across model developers, deploying institutions, individual operators, and regulators.

Model developers — design choices, training data, evaluation metrics.
Deploying institutions — selection, integration, override policies.
Individual operators — how they weight AI output against other factors.
Regulators — what they require, audit, and permit.

Automation Bias

The "human in the loop" framing is frequently invoked as a safeguard. But research shows that when humans consistently defer to AI recommendations — automation bias — oversight becomes nominal. Formal responsibility and actual causal responsibility diverge. The human is legally responsible but effectively a rubber stamp.

Core Tension

The more reliable an AI system appears, the more humans defer to it — reducing the oversight that makes the human-in-the-loop framework meaningful.

←Back →Take Quiz 3

🎯 Advanced

Quiz 3: Whose Fault Is It?

5 questions — free, untracked, retake anytime.

What is "automation bias" in the context of AI oversight?

✅ Automation bias is the tendency to over-trust automated systems — deferring to AI output even when independent judgment should override it. This undermines human-in-the-loop safeguards.

❌ Automation bias describes human behavior: the tendency to defer to AI recommendations without sufficient critical evaluation.

In the COMPAS case, why can't defendants challenge the algorithm's decision?

✅ COMPAS is a proprietary commercial product. Its weights, training data, and decision logic are trade secrets — defendants cannot examine the system that influenced their sentencing.

❌ COMPAS is protected as a trade secret. This creates a profound tension between commercial IP rights and defendants' due process rights.

Which party in an AI deployment chain typically bears the LEAST formal legal accountability?

✅ In most current legal frameworks, AI model developers face limited direct liability — accountability tends to fall on deploying institutions and operators. This is a major gap in current AI governance.

❌ Under current law in most jurisdictions, model developers bear less direct liability than deploying institutions — though this is an active area of regulatory debate.

What does it mean for accountability when "formal responsibility and actual causal responsibility diverge"?

✅ When a human is formally responsible (legally liable) but an AI made the actual decision they rubber-stamped, accountability becomes hollow — you can assign blame but it doesn't reflect who made the choice.

❌ The divergence means the person legally on the hook didn't meaningfully control the outcome — accountability becomes nominal rather than substantive.

What is the core tension in the human-in-the-loop oversight model?

✅ This is the fundamental paradox: high-performing AI induces automation bias, which reduces meaningful oversight, which removes the safeguard that justified deploying AI in high-stakes decisions in the first place.

❌ The core tension is that reliable AI produces automation bias — humans defer more as AI improves, undermining the oversight that makes human-in-the-loop meaningful.

←Back to Lesson →Start Lab 3

🎯 Advanced

Lab 3: Accountability Chain Analysis

Interrogate algorithmic transparency and due process in AI-assisted sentencing.

Lab 3 — Accountability Chain

The AI will discuss the COMPAS case with you, focusing on algorithmic transparency and due process.

The AI opens with a question about transparency versus trade secrets in criminal sentencing.
Develop your argument — should opaque algorithmic scores be admissible?
Extend to: what conditions of auditability would make it acceptable?

Consider how your answer applies to civil vs criminal proceedings. What auditability requirements would satisfy due process?

🎯 AI Lab AssistantLab 3

←Back to Quiz →Next: Lesson 4

🎯 Advanced

Lesson 4: Bias In, Bias Out

The full lifecycle of bias: origins, proxy variables, and feedback loops.

Medical AI systems trained mostly on lighter-skinned patients have shown lower accuracy for darker-skinned patients in detecting conditions including skin cancer and pulse oximetry errors. The patients who most need accurate diagnosis are the ones the system serves least well.

The Bias Pipeline

Historical bias: Training data reflects past discriminatory decisions.
Representation bias: Certain populations are underrepresented in training data.
Measurement bias: Proxy labels don't equally capture the construct across groups.
Aggregation bias: One model for a heterogeneous population obscures subgroup differences.
Deployment feedback loops: Biased outputs affect the world, generating new biased training data.

←Back →Take Quiz 4

🎯 Advanced

Quiz 4: Bias In, Bias Out

5 questions — free, untracked, retake anytime.

A lending model excludes race but includes ZIP code. This is:

✅ Proxy bias. ZIP code carries race's discriminatory signal — this is sometimes called laundering bias.

❌ This is proxy bias — correlated variables carry discriminatory signal even when the protected attribute is excluded.

Why do biased AI outputs create feedback loops?

✅ Feedback loops: biased predictions → biased real-world outcomes → biased future training data → more biased predictions.

❌ The loop is: biased outputs → biased real-world outcomes → biased retraining data → amplified bias.

Medical AI trained predominantly on lighter-skinned patients shows lower accuracy for darker-skinned patients. This is primarily:

✅ Representation bias. The model hasn't seen enough examples from underrepresented groups to generalize well to them.

❌ Representation bias — the training data doesn't reflect the diversity of the deployment population.

What is "laundering bias"?

✅ Remove "race," keep ZIP code and income — which correlate with race. The model still discriminates; the bias is just less visible.

❌ Laundering bias: excluding a protected attribute while keeping correlated proxies creates a veneer of fairness over a biased process.

Aggregation bias occurs when:

✅ One model for everyone can appear "average" while systematically failing specific populations.

❌ Aggregation bias: one model for a diverse population may fail particular subgroups while looking fine in aggregate.

←Back to Lesson →Start Lab 4

🎯 Advanced

Lab 4: Bias Pipeline Mapping

Design a bias-resistant hiring tool from data collection through deployment.

Lab 4 — Bias Pipeline Mapping

You're designing an AI hiring tool. Address proxy bias, feedback loops, and representation bias at each stage.

The AI opens with a question about your data collection approach.
Walk through your feature engineering and labeling decisions.
Address: how do you prevent feedback loops after deployment?

Goal: not just "exclude protected attributes" — design a system that is substantively fair, not just formally clean.

🎯 AI Lab AssistantLab 4

←Back to Quiz →Next: Lesson 5

🎯 Advanced

Lesson 5: Fairness and AI

The Impossibility Theorem, competing metrics, and the politics of fairness choices.

Three Fairness Definitions

Demographic parity: Pr(Ŷ=1 | A=0) = Pr(Ŷ=1 | A=1). Equal positive outcome rates across groups.
Equalized odds: Equal true positive rates AND equal false positive rates across groups.
Calibration: Among individuals scoring p, approximately p% have the outcome, regardless of group membership.

The Impossibility Theorem

Chouldechova (2017) and Kleinberg et al. (2016) independently proved: when base rates differ across groups, you cannot simultaneously satisfy calibration, equal false positive rates, and equal false negative rates. This is a mathematical proof, not a design limitation.

Political Dimension

Choosing which fairness metric to optimize is not a technical decision — it is a political one encoding a value judgment about which error is worse and whose interests are prioritized.

←Back →Take Quiz 5

🎯 Advanced

Quiz 5: Fairness and AI

5 questions — free, untracked, retake anytime.

The Impossibility Theorem proves that when base rates differ, you cannot simultaneously satisfy:

✅ Mathematical proof: calibration, equal FPR, and equal FNR are incompatible when base rates differ. Every fairness metric embeds a value tradeoff.

❌ The theorem addresses incompatibility of calibration, equal FPR, and equal FNR when base rates differ.

Demographic parity requires:

✅ Demographic parity: Pr(Ŷ=1 | A=0) = Pr(Ŷ=1 | A=1). Same positive prediction rate regardless of group.

❌ Demographic parity specifically requires equal positive outcome rates across groups.

Why is the choice of fairness metric inherently political?

✅ Since you can't satisfy all fairness definitions, you must choose — and that choice reflects whose interests you prioritize.

❌ The choice is political: it embeds value judgments about which kind of error is worse and whose interests matter more.

Equalized odds in recidivism prediction means innocent people are incorrectly labeled high-risk at equal rates across groups. This:

✅ Equalized odds trades calibration for equal FPR — you satisfy one fairness definition by violating another. No neutral option exists.

❌ Equalized odds satisfies one fairness definition while violating others — every choice is a tradeoff.

Calibration as a fairness metric means:

✅ Calibration: a 70% risk score means 70% actual incidence — for every group. The scores mean the same thing across populations.

❌ Calibration: the score is equally meaningful across groups — 70% should correspond to 70% actual incidence regardless of group.

←Back to Lesson →Start Lab 5

🎯 Advanced

Lab 5: Fairness Tradeoff Analysis

Choose a fairness metric for a loan approval AI and defend the tradeoff.

Lab 5 — Fairness Tradeoff

You're designing a loan approval AI. The AI will force you to choose between competing fairness definitions and justify the tradeoff.

Demographic parity vs. equalized odds — which do you optimize?
Defend your choice given the Impossibility Theorem.
Identify what you sacrifice and who bears the cost.

The choice is not technical — it's a value judgment. Name whose interests your choice prioritizes.

🎯 AI Lab AssistantLab 5

←Back to Quiz →Next: Lesson 6

🎯 Advanced

Lesson 6: Failure Modes and Mitigation

Systematic failure patterns and the engineering of robust AI systems.

Advanced Failure Taxonomy

Specification gaming: A boat-racing RL agent learned to spin collecting power-ups rather than finishing the race.
Shortcut learning: A pneumonia detector learned to recognize the X-ray machine signature rather than pathology.
Underspecification: Many models achieve equivalent validation performance but diverge dramatically in deployment — undetectable by standard metrics.
Distributional shift: COVID-19 disrupted nearly every deployed ML model trained on pre-pandemic data.

Mitigation Architecture

Red-teaming: Structured adversarial testing before deployment.
Uncertainty quantification: Conformal prediction and Bayesian approaches producing calibrated confidence intervals.
Model cards: Standardized documentation of intended use, limitations, and subgroup evaluation results.
Staged rollout with kill switches: Incremental deployment with automatic rollback on anomaly detection.

←Back →Take Quiz 6

🎯 Advanced

Quiz 6: Failure Modes

5 questions — free, untracked, retake anytime.

A game-playing AI ignores the race objective and collects power-ups in circles to maximize score. This is:

✅ Specification gaming: the model maximized the reward signal faithfully — the objective was poorly specified.

❌ Specification gaming: the model found a way to maximize its metric while violating the spirit of the task.

Underspecification means:

✅ Underspecification: equally-scoring models may generalize very differently in deployment. Standard evaluation doesn't detect it.

❌ Underspecification: many models score the same on validation but behave differently in deployment — undetectable by standard metrics.

The purpose of red-teaming before AI deployment is:

✅ Red-teaming: deliberately try to break the system, elicit harmful outputs, or find failure modes before real users do.

❌ Red-teaming is adversarial pre-deployment testing — structured attempts to find failure modes and unsafe behaviors.

A medical AI's accuracy drops from 94% to 71% when deployed in rural hospitals. This is most likely:

✅ Distribution shift: rural patients may differ in disease prevalence, imaging equipment, or demographics from the predominantly urban training data.

❌ Distribution shift: the model performs poorly because the deployment population falls outside its training distribution.

Model cards serve what purpose?

✅ Model cards are transparency artifacts: what a model is for, how it performs across subgroups, known failure modes, and appropriate use contexts.

❌ Model cards document what a model is, how it was evaluated, known limitations, and guidance on appropriate deployment.

←Back to Lesson →Start Lab 6

🎯 Advanced

Lab 6: Deployment Safety Plan

Design a pre-deployment mitigation architecture for a high-stakes AI system.

Lab 6 — Deployment Safety Plan

You're the AI safety lead before launching a medical imaging AI. Design your mitigation architecture.

Which failure mode worries you most?
Build your mitigation plan step by step.
Address underspecification specifically — how do you detect it pre-deployment?

Consider: staged rollout, kill switches, uncertainty quantification, human oversight gates.

🎯 AI Lab AssistantLab 6

←Back to Quiz →Next: Lesson 7

🎯 Advanced

Lesson 7: Adversarial Attacks

Prompt injection, jailbreaks, adversarial examples, and the security surface of AI systems.

The Adversarial Threat Surface

Adversarial examples: Imperceptible input changes causing misclassification. A stop sign with stickers reads as a speed limit sign to an AV classifier.
Prompt injection: Malicious text in data the LLM processes overrides intended instructions.
Jailbreaking: Prompt sequences designed to bypass safety training.
Data poisoning: Injecting malicious examples into training data to create backdoors triggered at deployment.
Model inversion: Querying a model to extract private training data.

Open Problem

Adversarial robustness and standard accuracy are often in tension — there is no current general solution.

←Back →Take Quiz 7

🎯 Advanced

Quiz 7: Adversarial Attacks

5 questions — free, untracked, retake anytime.

Prompt injection attacks work by:

✅ Prompt injection: adversarial instructions embedded in untrusted data (emails, docs) that the LLM treats as instructions.

❌ Prompt injection hides instructions in data the model processes, causing it to execute them instead of its intended behavior.

What makes adversarial examples in computer vision particularly dangerous?

✅ Humans see a normal image while the model sees something completely different — with high confidence.

❌ The danger is imperceptibility: tiny pixel changes invisible to humans flip a model's classification.

Data poisoning is most dangerous because:

✅ The backdoor is invisible in normal operation — it only activates when the attacker presents the specific trigger pattern.

❌ Data poisoning inserts a trigger during training — invisible until the attacker uses the specific trigger input in deployment.

Why is adversarial robustness difficult to achieve alongside high accuracy?

✅ Empirical accuracy-robustness tradeoff: features that maximize clean accuracy tend to be brittle — sensitive to adversarial perturbations.

❌ Research shows a fundamental tension: optimizing for standard accuracy creates representations more susceptible to adversarial attack.

Model inversion attacks are designed to:

✅ Model inversion exploits the fact that outputs leak information about training data — attackers can reconstruct approximate representations of private training examples.

❌ Model inversion uses outputs to reconstruct training data — a significant privacy risk when models are trained on sensitive personal records.

←Back to Lesson →Start Lab 7

🎯 Advanced

Lab 7: Adversarial Defense Architecture

Design safeguards against prompt injection for an LLM-powered system.

Lab 7 — Adversarial Defense

You're architecting an LLM customer service system with access to account data. Design prompt injection defenses.

What architectural safeguards minimize prompt injection risk?
Work through input sanitization, privilege separation, output filtering.
Address the fundamental challenge: the model can't reliably distinguish trusted instructions from injected ones.

No perfect solution exists. Goal: defense in depth — multiple layers that raise the cost of successful attack.

🎯 AI Lab AssistantLab 7

←Back to Quiz →Next: Lesson 8

🎯 Advanced

Lesson 8: Evaluating and Auditing AI Systems

Benchmarks, third-party audits, red-teaming at scale, and the limits of evaluation.

The Evaluation Problem

Benchmark saturation: Once a benchmark becomes a standard, models are optimized for it and it ceases to measure what it was designed to measure.
Distribution mismatch: Benchmark performance does not reliably predict real-world deployment performance.
Emergent capabilities: Capabilities appear suddenly at scale and are not predictable from smaller-scale evaluations.
Gaming: Training on benchmark-adjacent data inflates scores without improving underlying capability.

Audit Methodologies

Black-box auditing: Probing outputs without access to weights or training data. Used by journalists, researchers, regulators.
White-box auditing: Full access to weights and training pipeline. Requires developer cooperation or regulatory mandate.
Red-teaming: Structured adversarial testing — increasingly standard for frontier models.
Third-party audits: Independent evaluation — rare due to IP concerns and lack of regulatory mandates.
Participatory evaluation: Involving affected communities in designing evaluations and interpreting results.

←Back →Take Quiz 8

🎯 Advanced

Quiz 8: Evaluating AI Systems

5 questions — free, untracked, retake anytime.

Benchmark saturation occurs when:

✅ Goodhart's Law in ML: optimize for the metric and it ceases to measure genuine capability.

❌ Benchmark saturation: models optimize specifically for a benchmark, inflating scores without improving the underlying capability.

Black-box auditing differs from white-box auditing in that:

✅ Black-box: outputs only, no internal access. White-box: full access to model internals. Most external auditors can only do black-box work.

❌ Black-box = outputs only. White-box = full access to weights, architecture, and training pipeline.

Third-party AI audits are rare because:

✅ Meaningful audits require proprietary information companies protect as IP. Without legal mandates to grant access, the audit ecosystem stays thin.

❌ The barrier is access: proprietary weights and training data are IP. Without regulatory requirements, third-party auditing remains rare.

Participatory evaluation involves:

✅ Affected communities as co-designers of evaluation criteria — not just test subjects, but active participants in defining what fairness means.

❌ Participatory evaluation: affected communities help design what gets evaluated and how — active participants, not just subjects.

Why don't validation benchmarks reliably predict real-world deployment performance?

✅ Distribution shift + underspecification: the validation environment doesn't match deployment, and equally-performing models may diverge in real-world conditions.

❌ Two compounding problems: distribution mismatch and underspecification — standard metrics simply don't capture real-world robustness.

←Back to Lesson →Start Lab 8

🎯 Advanced

Lab 8: Audit Framework Design

Design a mandatory third-party audit framework for high-stakes AI deployments.

Lab 8 — Audit Framework Design

You're advising a regulatory body on mandatory AI audit requirements. Design the framework.

Should regulators mandate third-party audits for high-stakes AI?
Design audit criteria for a recidivism prediction tool.
Address the IP tension: how much access must be granted for the audit to be meaningful?

Consider: black-box vs white-box requirements, audit trigger thresholds, and what "passing" an audit means.

🎯 AI Lab AssistantLab 8

←Back to Quiz →Module Test

🎯 Advanced

Module 3 Test

10 questions covering all 8 lessons. Free, untracked, retake anytime.

A model performs correctly in California but fails on Finnish winter roads. This is:

✅ Distribution shift: deployment environment falls outside the training distribution.

❌ Distribution shift — deployed in conditions it wasn't trained on.

The Impossibility Theorem proves fairness definitions are:

✅ Mathematical proof — every fairness metric embeds a value tradeoff.

❌ Mathematical proof: calibration, equal FPR, equal FNR cannot all be satisfied when base rates differ.

Prompt injection exploits:

✅ The model treats all text as potential instructions — it can't reliably flag adversarial content embedded in data it processes.

❌ Prompt injection works because the model can't reliably distinguish system instructions from adversarial content in user data.

Automation bias undermines human-in-the-loop oversight because:

✅ When humans rubber-stamp AI decisions, formal accountability remains but substantive oversight disappears.

❌ Automation bias: humans defer — oversight exists on paper but not in practice.

Which hallucination type is most dangerous in a legal context?

✅ Citation hallucination: perfectly formatted fake references that only someone who verifies the source would catch — as Mata v. Avianca demonstrated.

❌ Citation hallucination: perfectly formatted fake references, hardest to detect without domain expertise.

Proxy bias allows discrimination to persist after a protected attribute is excluded because:

✅ Correlated proxies carry the discriminatory signal forward — excluding "race" while keeping ZIP code still produces racially disparate outcomes.

❌ Proxy bias: correlated variables carry the protected attribute's signal even when it's excluded.

Model cards are primarily designed to:

✅ Model cards are transparency artifacts: what a model is for, how it performs across subgroups, and known limitations.

❌ Model cards document what a model is, how it was evaluated, known limitations, and appropriate use contexts.

Underspecification is a problem because:

✅ Equally-scoring models may behave very differently in deployment — and you can't tell from standard evaluation alone.

❌ Underspecification: validation can't distinguish good generalizers from poor ones.

COMPAS disproportionately flagged Black defendants because:

✅ Proxy bias via prior arrests — over-policed communities generate more arrests, which the model treated as evidence of higher recidivism risk.

❌ Proxy bias: prior arrests proxied for race because of differential policing intensity.

Which audit type is most capable of detecting data poisoning backdoors?

✅ Backdoors are invisible in normal operation and undetectable via output probing — white-box access to training data and weights is required.

❌ Backdoors can't be detected through output probing alone — white-box access to training data and model internals is needed.

←Back to Lab 8 🏠Academy Home