Module 7 · Lesson 1

What Is an AI Claim, Really?

Before you can audit anything, you need to know what you're actually auditing — and the line between a fact, a forecast, and marketing is blurrier than it looks.

When someone says "AI can do X," what exactly are they claiming — and who benefits if you believe it?

Your friend Dani forwards you a LinkedIn post. The headline reads: "AI is now better than the average radiologist at detecting cancer." Dani is a pre-med sophomore, and she's genuinely rattled. "Should I even finish the degree?" she texts you. Below the headline is a graph, a company logo, and a quote from a TED talk. It looks authoritative. It feels like a fact.

The thing is — you've seen this headline before, or something like it. It circulates every eighteen months. Sometimes it's radiology. Sometimes it's legal research, or coding, or therapy. The specifics change; the structure of the claim stays the same. AI has surpassed humans at [high-status profession]. And every time, people make real decisions off it — whether to stay in school, which skills to develop, whether a career is worth pursuing.

That's not a hypothetical risk. That's what's happening right now, in your peer group, with real consequences. So before we get to whether any specific AI claim is true, we need to get clear on what kind of thing an AI claim actually is.

The Taxonomy of AI Claims

AI claims aren't a single type of statement. They get blurred together constantly — in headlines, in pitch decks, in Twitter threads — but they operate very differently and require different kinds of scrutiny. There are roughly four categories worth keeping separate in your head.

Capability claims say AI can do something: "GPT-4 passed the bar exam." These are empirically testable in principle, but the details matter enormously — which version, on what test, under what conditions, compared to whom. The radiologist headline above is a capability claim. So is "our AI writes better marketing copy than humans." The claim sounds concrete but usually hides a metric choice made by whoever ran the study.

Adoption claims say AI is being used at scale: "78% of Fortune 500 companies are deploying AI tools." These are survey-based, self-reported, and definitionally murky — "deploying" can mean anything from a chatbot on an HR portal to genuine process automation. The number sounds precise because it has a percent sign attached to it.

Trajectory claims say AI is heading somewhere: "AI will replace 40% of jobs by 2030." These are forecasts dressed up as facts. Every major research institution has published wildly different numbers. The range is so wide — some studies say 10%, others say 60% — that citing any single figure as authoritative is, at minimum, selective.

Value claims say AI will create or destroy economic value: "AI will add $15.7 trillion to the global economy." These almost always come from consulting firms whose business model depends on companies buying AI strategy services. That doesn't make them false. But it's relevant information.

Why This Matters Now

Most of your peers aren't sorting AI claims into categories. They're treating all of them as roughly equivalent to "facts I saw online." That's not a judgment — it's the default when you're moving fast and information is everywhere. The sorting habit is a learnable skill, and it takes maybe ten seconds per claim once you have the framework.

The Source Layer: Who Benefits?

Every AI claim has an origin point, and that origin point has interests. This doesn't mean every claim is propaganda — but it does mean the interests are relevant to your evaluation.

When OpenAI publishes a benchmark showing GPT-4 outperforms other models, they have a direct financial incentive for that result. That doesn't invalidate the benchmark. But it should prompt you to ask: was this independently replicated? Did third parties run the same test and get the same result? Pharmaceutical companies run their own drug trials too — and we still require independent verification before approving drugs. Why would AI benchmarks be different?

When a management consulting firm publishes a report saying AI will create enormous economic value, notice that their primary product is advising corporations on technology adoption. The report drives business. Again — it might be accurate. But "McKinsey says AI will add $4.4 trillion in value" is not a neutral scientific finding.

When a tech journalist writes a breathless headline about AI surpassing humans, notice that dramatic framing drives clicks and subscriptions. Nuanced, technically accurate headlines ("AI performs comparably to median radiologists on a specific subset of chest X-rays under controlled lab conditions") don't go viral. The incentive structure of media systematically amplifies the most dramatic version of any AI story.

Practical Move

The next time you encounter an AI claim — in your feed, in class, from a recruiter — pause and ask two questions before you form an opinion: What type of claim is this? (capability, adoption, trajectory, or value) and Who produced it and what do they get if I believe it? These two questions take fifteen seconds and will immediately put you ahead of most people in the conversation.

The Benchmark Problem

Let's go back to the radiologist headline. When researchers say "AI outperforms radiologists," they almost always mean: on a specific benchmark dataset, using a specific performance metric, compared to a specific group of radiologists, under specific testing conditions. Each of those specifics can be chosen in ways that favor the outcome the researchers are hoping to demonstrate.

The benchmark dataset might consist of clear, well-labeled images — not the ambiguous ones that cause actual diagnostic difficulty. The performance metric might be sensitivity (catching true positives) while ignoring specificity (false alarm rate). The comparison group might be radiologists working in low-resource settings, not trained specialists with full clinical context. The testing conditions remove the conversation with the patient, the clinical history, the prior scans — the things that actually make a radiologist more accurate than an algorithm.

None of this means the AI isn't impressive, or that Dani should definitely finish her radiology residency without thinking hard about AI's trajectory. The technology is genuinely advancing fast. But "AI outperforms radiologists" as a headline is doing a lot of work that the underlying study cannot support. Understanding that gap — between what the benchmark measured and what the headline claims — is the core skill this module is building.

Benchmark Overhang

When AI systems are optimized specifically to perform well on a particular test, making their benchmark scores a poor proxy for real-world performance. Classic case: an AI that aces an LSAT practice set but fails on real novel legal reasoning tasks.

Metric Selection Bias

Choosing the performance metric that most favorably represents your model's strengths while downplaying the metrics where it underperforms. Not always intentional — researchers naturally report what looks impressive.

Distribution Shift

When an AI trained and tested on one type of data is deployed on a different type of data. The benchmark score doesn't transfer. A radiology AI trained on high-quality images from a top hospital can fail badly when deployed in a lower-resource setting with different equipment.

Starting Your Audit: The First Three Questions

This module is built around a practical audit framework you'll apply to real claims — ones you've actually encountered this week, not hypotheticals invented for a classroom. The framework builds across all four lessons. Here's where it starts.

When you encounter an AI claim, the first three questions to ask are:

What type of claim is this? Capability, adoption, trajectory, or value? Getting clear on category helps you know what kind of evidence would actually validate or invalidate it.
What is the original source? Not the article you read about it — the actual study, report, or data. Can you find it? If not, that's relevant. If yes, who funded it?
What would have to be true for this claim to be accurate? Work backward from the headline to the conditions. "AI is better than radiologists" requires: better at what task, measured how, in what context, at what cost, with what error profile? If those conditions are never specified, the claim is underdetermined — it sounds like a fact but isn't one yet.

In the next three lessons, we'll add layers: evaluating the evidence itself, auditing the deployment claim (real-world performance, not lab performance), and synthesizing a personal judgment that holds up under pushback. The goal isn't skepticism as a lifestyle — it's calibration. Knowing when a claim deserves your confidence and when it deserves your patience.

Lesson 1 Quiz

What Is an AI Claim, Really? — 5 questions

1. A consulting firm publishes a report saying "AI will add $15.7 trillion to the global economy by 2030." Which category of AI claim is this?

Right — this is a value claim. It predicts economic impact, not capability or current usage. Worth noting: the firm publishing it sells AI strategy consulting. That doesn't make it false, but it's relevant context.

Not quite. A value claim specifically predicts economic impact — how much money AI will create or destroy. This one has a dollar figure attached to a future date, which is the signature of a value claim. Capability claims describe what AI can do; adoption claims describe current usage; trajectory claims describe directional change without necessarily specifying dollar amounts.

2. A headline reads: "AI passes the bar exam in the top 10% of test-takers." Your roommate says this means AI can now do lawyering. What's the most precise thing wrong with that inference?

Exactly. This is the benchmark-to-real-world gap. Lawyering involves client relationships, ethical judgment, novel fact patterns, negotiation, courtroom improvisation, and institutional navigation — none of which the bar exam tests. The AI passed a standardized knowledge test, not "law."

The core problem isn't about the exam's difficulty or whether the headline is fabricated (GPT-4 did pass the bar exam). The issue is that a standardized test measures a specific, narrow skill set under controlled conditions. Real legal practice involves client relationships, novel fact patterns, ethical judgment, and courtroom dynamics that the test simply doesn't capture.

3. "78% of Fortune 500 companies are deploying AI tools" — what is the most important word to interrogate in that claim?

Yes. "Deploying" can mean anything from a customer service chatbot to full process automation across core operations. If deploying means "at least one team used an AI tool in the past year," 78% sounds low. If it means "AI is driving core business decisions," 78% sounds impossibly high. The definition isn't neutral.

"Deploying" is the key word because it's doing all the definitional work. If deploying means "any use of any AI tool by any employee," even a single person using Grammarly might count. If it means "AI integrated into core operational decisions," the number would be much smaller. Surveys like this rarely define their terms precisely, which makes the percentage almost meaningless without that context.

4. You find a study showing that an AI correctly classified 94% of skin lesions as benign or malignant. The study was funded by the AI company that built the classifier. A dermatologist friend says "that's still impressive." Who's more right?

This is the calibrated answer. Company-funded research isn't automatically fraudulent — but it has a track record of favorable outcomes due to study design choices (dataset selection, metric choice, comparison groups) that independent researchers often can't replicate. 94% is potentially impressive and potentially cherry-picked. Both are true until independent replication happens.

The calibrated position is that both points have merit. The 94% figure may reflect genuine capability — but funding sources influence study design in ways that aren't always visible. Dataset selection, comparison groups, and metric choices can all be set up to maximize the reported accuracy. Independent replication is the standard that separates "impressive internal result" from "trustworthy clinical finding."

5. What does "distribution shift" mean in the context of AI benchmark results?

Correct. Distribution shift is one of the most common reasons impressive benchmark scores don't translate to impressive real-world performance. The radiology AI trained on ideal images from major research hospitals may perform badly at community clinics with older equipment and different patient demographics.

Distribution shift refers specifically to the gap between training/test data and real-world deployment data. If an AI is trained and benchmarked on carefully curated, high-quality data, its accuracy on messy, varied real-world inputs will often be significantly lower — even though the benchmark score looked strong.

Lab 1: Claim Taxonomy Field Test

Bring a real AI claim you've seen this week. Classify it. Defend your classification.

Your Role: Claim Auditor

You've seen AI claims this week — in your feed, in a class lecture, from a recruiter, on a company website, in a group chat. Pick one that actually caught your attention. It doesn't have to be dramatic. It just has to be real.

Describe the claim to the AI below. Your analyst will ask you to classify it by type (capability, adoption, trajectory, or value), identify the source, and explain what you'd need to know to evaluate it properly. The analyst will push back on vague answers — that's the point.

Start by describing the AI claim you encountered: where you saw it, roughly what it said, and what your initial reaction was. Be specific — "AI is getting better" isn't a claim, it's a vibe.

Claim Analyst

Lab 1

Tell me the AI claim you ran into this week. Give me the actual wording if you can remember it, or as close as you can get. Where did you see it — LinkedIn, a class, a YouTube ad, your group chat? And what was your gut reaction when you read it?

Module 7 · Lesson 2

Going to the Source: Evidence Quality and Red Flags

A claim is only as strong as its evidence — and most AI claims float entirely free of their actual evidence base.

If you can't find the original study in three minutes of searching, what does that tell you?

Marcus is a junior majoring in computer science. His professor mentions in lecture that "studies show AI coding assistants increase developer productivity by up to 55%." Marcus goes home and puts this in his cover letter for a summer internship: "I'm well-versed in tools that have been shown to increase developer productivity by over 50%." He gets the interview. The recruiter asks him about that stat. He can't name the source. He'd never looked it up — why would he? The professor said it.

The stat, for what it's worth, comes from a 2022 GitHub study about GitHub Copilot — a study run by GitHub, which is owned by Microsoft, which owns a major stake in OpenAI, which makes Copilot possible. The study measured how fast programmers completed a specific, pre-defined coding task in a controlled environment. It didn't measure code quality, debugging time, integration errors, or what happened when the AI generated plausible-looking but incorrect code that took an hour to find. It also didn't measure productivity across a full workday or a full project. "55% more productive" is a real number from a real study. It is also not what most people mean when they say productivity.

Marcus got lucky — the recruiter didn't press hard. But the experience rattled him. He'd cited a claim he couldn't defend, built on a study he'd never read, that measured something narrower than the claim implied. That's the gap this lesson is about.

The Evidence Hierarchy

Not all evidence is equal, and in the AI space, the quality distribution is particularly skewed toward the weak end. Here's a rough hierarchy, from strongest to weakest:

Independent replication: The study was run by researchers with no financial stake in the outcome, and it confirmed a result that was first found by the company or motivated researchers. This is rare for AI claims. It's the gold standard.
Peer-reviewed academic study: Published in a journal with independent review. Not perfect — peer review misses plenty of methodological problems — but it at least means someone checked the work. Note: many AI papers bypass journals entirely and live only on arXiv as preprints.
Company-funded academic study: Real researchers, real methods, but the funding relationship creates incentive alignment that shapes study design. Should be treated as preliminary until replicated.
Internal company report: The company ran the study, published the results, controls the data. Useful for understanding what the product can do under ideal conditions. Not trustworthy as a neutral performance evaluation.
Analyst report or consulting paper: McKinsey, Goldman Sachs, Gartner, etc. These are often aggregated from other sources with added modeling. The "primary source" is often another report, not original data.
Journalist summary: May or may not accurately represent even the analyst report it's citing. Often the headline diverges significantly from the actual finding.
Social media post / influencer claim: Could be anything. Often the telephone-game endpoint of a chain that started with a real study and got amplified and simplified at each step.

Peer Reality Check

When your classmates cite an AI stat, most of them are citing a journalist's summary of a consulting report that aggregated internal company data. The claim may be four or five steps removed from anything that was actually measured. You're in that same position most of the time — the difference is knowing it.

Five Red Flags in AI Evidence

You don't need a PhD to spot weak evidence. These five patterns show up constantly in AI claims and take about thirty seconds to check for once you know what you're looking for.

No sample size or demographic info. "AI outperformed doctors" — how many doctors? From where? On how many cases? A study of 50 images from one hospital is not the same as a study of 50,000 images from 200 hospitals. The absence of this information in a headline is almost always intentional.
Cherry-picked metric. When a study reports accuracy but not false positive rate, or reports speed but not error rate, or reports top-line performance but not performance on edge cases — that asymmetry is a signal. Good studies report the metrics that matter, not just the ones that look good.
No comparison baseline. "AI achieves 92% accuracy" is meaningless without knowing what humans achieve on the same task. If the human baseline is 97%, the AI result is underwhelming. If it's 71%, the AI result is impressive. The comparison baseline is essential and is often omitted.
Controlled lab conditions described as real-world performance. Lab conditions are designed to isolate variables — they strip away the noise and complexity of actual deployment. When a study claims real-world applicability from lab data without deployment validation, that's a flag. The GitHub productivity study is a classic example.
The claim gets bigger as it travels. Track the claim backward if you can. If the original study says "comparable to" and the consultant report says "outperforms" and the article says "replaces," you're watching simplification and amplification in real time. The original claim is almost always more hedged than the version you saw.

How to Actually Find the Source in Under Five Minutes

Most people don't go to the source because it feels like a research project. It doesn't have to be. Here's a five-minute process that works for most AI claims you'll encounter.

Step 1: Copy the specific number or named result from the claim ("55% productivity increase," "94% accuracy on skin lesions," "top 10% of bar exam takers") and search it directly in Google Scholar or just Google. The original study almost always surfaces in the first few results when you search the specific metric rather than a paraphrased headline.

Step 2: Find the "Methods" section of the study. This is where the study design lives — the sample, the comparison, the metric definitions. If the study is paywalled, the abstract usually tells you enough. If even the abstract is inaccessible, check arXiv, which hosts free preprints of most AI research.

Step 3: Check the funding disclosure, usually at the end of the paper. It will say something like "this research was supported by a grant from [company]." That's your conflict-of-interest flag.

Step 4: Search for independent coverage or replication. Has another research group run a similar study? Has a science journalist written a skeptical follow-up? Search the study title plus "criticism," "replication," or "limitations."

If you can't locate a source after five minutes of actual searching, that itself is information. A claim with no traceable origin deserves zero confidence as a factual statement — it can still be an interesting hypothesis worth entertaining, but it shouldn't be cited as evidence.

The Practical Takeaway

Before you repeat an AI statistic — in a cover letter, in a class discussion, in a conversation with a recruiter or professor — spend five minutes locating the original source. You don't have to read the whole paper. You just need to be able to say: "This came from [study], funded by [who], measuring [what], in [what context]." That sentence is the difference between citing a claim and understanding it.

The Amplification Chain: How Claims Grow

There's a predictable lifecycle to AI claims. A research team publishes a paper with careful hedging: "Under controlled conditions, our model showed statistically significant improvement over baseline radiologist performance on a subset of chest X-ray classifications." That gets picked up by a science journalist who writes: "New AI beats doctors at reading X-rays." That gets shared on LinkedIn with the caption: "AI is replacing radiologists. Are you ready?" That becomes the version your friend forwards you, fully severed from the hedging that was in the original.

Each step in this chain is often done in good faith. The journalist isn't lying — they're simplifying for a general audience. The person who shared it on LinkedIn isn't fabricating — they're reacting to the article. But the cumulative effect is a claim that's traveled so far from its origins that it no longer represents what the researchers actually found. This is not a flaw in the system — it's a feature of how media and social networks work. Alarming, simple, consequential-sounding content travels faster than nuanced, caveated, methodologically careful content. Always has. The AI domain just gives this dynamic particularly high stakes because the claims are being used to make real decisions about careers, education, and policy.

Lesson 2 Quiz

Evidence Quality and Red Flags — 5 questions

1. You find an AI performance claim that traces back to an internal company report (no external peer review). What is the most accurate characterization of that evidence?

Right. Internal reports aren't worthless — they often use real data and real methods. But without external review or independent replication, you can't rule out that the study design was optimized to produce favorable results. "Useful but preliminary" is the calibrated position.

The calibrated position is that company research is useful as preliminary evidence but shouldn't be treated as a neutral finding. Companies have real incentives to design studies that support favorable conclusions — not through fraud, but through choices about datasets, metrics, and comparison baselines. Independent replication is what graduates the result from "interesting" to "trustworthy."

2. A study reports that an AI tutoring system improved test scores by 23% — but only reports accuracy, not how long students spent using it or how their performance compared to a control group. Which red flag does this illustrate?

Both apply. "23% improvement" without a baseline (what did the non-AI group score?) tells you nothing about whether AI caused the improvement. And reporting only accuracy without time-on-task or error rate hides important costs. Both omissions are convenient for the company presenting the result.

The key issues here are missing comparison baseline (no control group means we can't attribute the improvement to the AI) and selective metric reporting (time-on-task and error rate matter as much as accuracy when evaluating a tutoring tool). The 23% improvement could be real — but without these pieces, you can't know.

3. You search for the original source of an AI productivity claim and after five minutes of searching can't find any primary study — only articles referencing other articles. What does this tell you?

Exactly. A claim with no traceable source isn't necessarily false — but it can't be cited as evidence of anything. It might be a guess, a misremembered number, a telephone-game distortion, or a marketing fabrication. All of those possibilities require treating the claim as hypothesis, not fact.

No traceable source doesn't mean the claim is false or that a study was retracted. It means the claim has no evidentiary foundation you can evaluate. That's a specific kind of epistemic problem — you can't assess confidence without being able to examine the evidence. Treat it as an unverified hypothesis, not a fact.

4. A researcher publishes a paper on arXiv showing an AI model outperforms humans on a math reasoning task. The paper hasn't been peer-reviewed yet. What's the most accurate way to describe the status of this claim?

Right. arXiv is not peer-reviewed — it's a preprint server where researchers post work before (or instead of) formal review. Academic researchers can also have funding relationships, reputational incentives, and methodological blind spots. "Preliminary and interesting" is the correct posture, not dismissal and not full trust.

arXiv papers are preprints — they haven't been peer-reviewed. That doesn't make them worthless; some of the most important AI research appears there first. But it means the work hasn't been checked by independent experts. Academic researchers also aren't automatically neutral — they can have funding relationships and career incentives that shape their findings. "Preliminary" is the right word.

5. The original research finding is "AI is comparable to median radiologists on a specific subset of chest X-ray cases." The LinkedIn version is "AI is replacing radiologists." This illustrates which concept from Lesson 2?

Yes. "Comparable to median radiologists on a subset of cases" and "replacing radiologists" are dramatically different claims. Each retelling simplified and amplified. This isn't usually malicious — it's how media ecosystems function. Your job is to recognize when you're looking at the telephone-game endpoint rather than the original finding.

This is the amplification chain. "Comparable to median on a subset" became "replaces the profession" through a series of simplifications, each reasonable in isolation, collectively catastrophic for accuracy. Recognizing where you are in the chain — primary study vs. journalist summary vs. LinkedIn caption — is the critical skill.

Lab 2: Source Hunt

Track your claim back to its origin. Report what you find — and what you can't find.

Your Role: Evidence Investigator

Take the AI claim you identified in Lab 1, or pick a new one. Your task: spend five minutes trying to find the original source. Then report your findings to the analyst below — where you looked, what you found, what the original study actually says vs. what the claim implied, and what's still unclear.

The analyst will help you evaluate what you found and push back if you're being too credulous or too dismissive. The goal is calibration, not cynicism.

Tell me: what claim are you investigating, where did you search, and what did you actually find? If you hit a dead end, describe that too — "I couldn't find the source" is a legitimate and useful finding.

Evidence Analyst

Lab 2

What claim are you chasing, and what did you find when you went looking for the original source? Walk me through your search — what you tried, what came up, and what the actual study (if you found it) says compared to the headline version.

Module 7 · Lesson 3

Real-World vs. Lab: The Deployment Gap

A system that works brilliantly in a research paper can fail badly in the real world — and understanding why is the difference between informed optimism and expensive credulity.

When a company says their AI "works," what exactly are they claiming — and how would you know if it stopped working?

Priya lands a part-time job as a junior data analyst at a mid-size insurance company. In her first week, she notices that underwriters are still manually reviewing every auto insurance application flagged by the company's AI risk-scoring tool. She asks her manager why — if the AI is supposed to handle this. The manager sighs. "The AI works great in the demo. In production, it flags about 40% of applications as high-risk. Most of them aren't. But the legal team said we can't just act on it without human review because of fair lending laws."

So the AI is deployed. It runs on every application. The company reports to investors that they use "AI-driven underwriting." And every human underwriter is doing more work than they were before the AI was introduced, because now they're reviewing both the application and the AI's frequently-wrong flag. The AI didn't replace anyone. It added a step. And nobody outside the company would know this from the press release, which reads: "We leverage advanced machine learning to streamline our underwriting process."

This is the deployment gap. Lab performance — or even a cherry-picked production metric — is not the same as real-world impact. Priya's company isn't lying in the press release. But the gap between "we use AI" and "AI is making us more efficient" is enormous, and it's a gap that almost no external claim about AI deployment will acknowledge.

Why Deployment Differs From the Lab

Lab conditions and real-world deployment differ in at least five structural ways that matter for evaluating AI claims.

Data quality. Labs use curated, clean data. Real-world data is messy, incomplete, inconsistently labeled, and often structured in ways that weren't anticipated when the model was trained. A customer service AI trained on formal ticket data will perform differently on voice transcripts, regional dialects, or rushed abbreviations that real customers use.

Edge cases. Labs optimize for average performance. Real-world deployment is defined by edge cases — the unusual inputs that happen 5% of the time but cause 50% of the serious errors. Benchmark accuracy of 95% sounds impressive until you realize the 5% error rate is concentrated in the most complex and consequential cases.

Human-system interaction. Lab tests typically isolate the AI from human behavior. In deployment, humans interact with the AI in ways researchers didn't predict — they game it, they over-rely on it, they ignore it, they develop workarounds. These behavioral adaptations change what the system actually produces at scale.

Organizational friction. Real deployment involves legal compliance, IT integration, training, user adoption, and organizational politics. An AI that performs perfectly in isolation may be constrained by legal liability concerns (as in Priya's story), or hobbled by an IT infrastructure that can't support real-time inference, or ignored by employees who don't trust it.

Monitoring and drift. AI models degrade over time as the world changes and the data they were trained on becomes stale. Lab results don't capture this. A model that was 94% accurate at launch may be 78% accurate eighteen months later if the underlying data distribution shifted. Most companies don't have robust monitoring for this.

The Peer Version of This Problem

When a classmate says "I use AI to write my essays and it's amazing," they're reporting lab-condition results on their best use case. What they're less likely to mention: the hours spent editing AI-generated prose that didn't match their voice, the times the AI confidently stated something factually wrong, the professor who noticed the writing style change. Self-reported AI productivity is almost always optimistic for the same reasons company press releases are.

Case Study: AI in Hiring

In 2018, Amazon scrapped an internal AI recruiting tool that had been in development since 2014. The system was designed to review resumes and score candidates. In testing, it worked. In deployment — or rather, before deployment — Amazon's own engineers discovered it was systematically downgrading resumes that included the word "women's" (as in "women's chess club") and downranking candidates from all-women's colleges. The model had learned from historical Amazon hiring data, which reflected the company's existing gender imbalance in technical roles. It was, in effect, trained to replicate past bias.

This is not an obscure corner case. It's a canonical example of what happens when a system that "works" on historical data is validated against that same historical data, which bakes in the problems the system was supposed to solve. Amazon's engineers caught it before the system went live. Most companies don't have the technical sophistication to run this kind of audit.

When you hear an AI claim about hiring efficiency, talent optimization, or candidate scoring, the relevant question isn't "is the AI accurate?" It's "accurate at predicting what — and is that what it should be predicting?" A model that predicts which candidates resemble past hires is accurate in the narrow sense. Whether "resembles past hires" is a valid proxy for "will be a good employee" is a different question entirely.

Proxy Variable Problem

When an AI optimizes for a measurable proxy of what you actually care about, rather than the thing itself. "Will perform like past hires" is a measurable proxy. "Will be a great employee" is what you care about. The gap between them is where bias, error, and strategic gaming live.

Model Drift

The degradation of model performance over time as real-world data shifts away from the distribution the model was trained on. A credit risk model trained on 2019 data may perform poorly when deployed post-pandemic because borrower behavior changed. Drift is the rule, not the exception, and most systems don't monitor for it adequately.

Reading Deployment Claims Critically

When a company, employer, or institution claims to use AI, you're now equipped to ask questions that distinguish real deployment from marketing deployment. These questions aren't aggressive — they're the kind a thoughtful analyst or potential employee should ask.

What decision is the AI actually making? Is it making a final decision, or a recommendation that a human approves? "AI-driven" and "AI-assisted" are very different, and companies routinely conflate them in their public statements.
What happens when the AI is wrong? Who catches errors? How quickly? What's the error rate in production (not in the lab)? A system with no clear error-handling is a system that will systematically fail in undetected ways.
How is performance being measured after deployment? If there's no ongoing measurement, there's no way to know if the system is still working. Most companies have weaker post-deployment monitoring than they have pre-deployment testing.
What data is the system running on, and how was it collected? The Amazon case is a reminder that training data is never neutral. It always reflects the historical decisions, biases, and blind spots of whoever collected it.
What are the legal and liability constraints on the system? Particularly in regulated industries, legal constraints often severely limit what an AI system can actually do — even if the underlying model is capable. Priya's insurance company story is the norm, not the exception.

What You Can Actually Do With This

If you're interviewing for a job or internship and the company mentions AI systems, ask one of these questions. Not as a gotcha — as genuine curiosity. "How do you measure the system's performance after it's deployed?" or "What does the error-handling process look like?" A company that can answer these questions clearly is a company that's actually thinking about deployment. A company that can only show you the demo is a company that hasn't closed the loop between lab and reality.

The Good News Is Real

None of this is an argument that AI doesn't work. It does — in specific contexts, with good data, with appropriate human oversight, with honest measurement, and with realistic expectations about what "working" means. The deployment gap is real, but it's narrowing in some domains. Medical imaging AI is genuinely useful in clinical settings with appropriate human-in-the-loop validation. Code completion tools are saving developers real time on real tasks. Language models are making translation and accessibility tools dramatically better for underserved populations. These are meaningful gains, not hype.

The goal of this module isn't to make you skeptical of everything. It's to help you be skeptical of the right things — the gap between the claim and the evidence, the gap between the benchmark and the deployment, the gap between the press release and the analyst's notes. That calibration is what makes you useful to any organization or project that involves AI: you can see through the noise without dismissing the signal.

Lesson 3 Quiz

The Deployment Gap — 5 questions

1. A company press release says they use "AI-driven decision-making in their loan approval process." You later discover that every AI recommendation still requires a human loan officer to approve it before any action is taken. What is the most precise characterization?

Right. "AI-driven" implies AI is making the decisions. "AI-assisted" means AI is making recommendations that humans then evaluate. These are genuinely different systems with different implications for efficiency, accountability, and risk. The press release used the more impressive framing without being technically dishonest — which is exactly the kind of gap you need to read for.

The company isn't lying — they do use AI in the process. But "AI-driven" and "AI-assisted" are meaningfully different, and the distinction matters for understanding what the system actually does. Human-in-the-loop systems have different risk profiles, accountability structures, and efficiency implications than autonomous AI decision-making. The press release chose the more impressive framing.

2. Amazon's 2018 recruiting AI was scrapped because it was downgrading female candidates. What was the root cause of this failure?

Exactly. The model didn't have a malicious intent — it didn't have any intent. It learned statistical patterns from historical data, and those patterns included the company's historical tendency to hire more men for technical roles. "Predict who will be hired" and "predict who should be hired" are different questions, and the model was only equipped to answer the first one.

The cause was training data, not engineer bias or legal failure. The model learned from past Amazon hiring decisions, which reflected a male-skewed technical workforce. It learned, accurately, that past successful hires tended to be male — and applied that pattern to new candidates. This is the proxy variable problem: optimizing for "resembles past hires" instead of "would be a great employee."

3. A hospital AI model achieved 96% accuracy in its clinical trial. Eighteen months after deployment, performance audits show accuracy has dropped to 81%. What is the most likely explanation?

Model drift is the most common explanation for this pattern. As patient populations change, as medical equipment is updated, as recording practices shift, the data the model sees in production diverges from the data it was trained on. Accuracy degrades. This is expected and manageable — but only if organizations are monitoring for it, which many aren't.

Model drift is the most likely culprit. Real-world data changes over time in ways the training data didn't anticipate. Patient demographics shift, equipment gets updated, documentation practices change. If the model isn't retrained or updated, its accuracy on current data can drop significantly from its trial performance. This isn't fraud — it's a known property of deployed ML systems that requires active management.

4. You're in an interview and the company mentions they use AI to match job candidates to roles. Which question is most useful for understanding whether the system is actually working well?

This is the deployment gap question in disguise. A company that can answer this clearly — with actual production metrics and a defined error-handling process — is demonstrating that they've closed the loop between lab performance and real-world accountability. A company that can only describe the system's capability without describing how they measure it is a company operating on faith.

The most diagnostic question is about post-deployment measurement and error-handling. "What model" tells you about architecture, not performance. "How many candidates" is a throughput metric, not an accuracy metric. "Vendor or internal" is interesting but not directly relevant to whether the system works. Whether they're measuring real-world performance and have a plan for when it fails is the crux.

5. An AI tutoring startup reports their tool improves student test scores by 31% compared to no tutoring at all. Why might this comparison be misleading?

Right. "Better than nothing" is a deliberately chosen comparison that makes the product look good. The question any educator or policymaker should ask is: compared to human tutoring of similar duration? Compared to structured self-study? Compared to better-resourced classrooms? If the AI tool is better than nothing but worse than a decently trained human tutor, that's relevant information the startup has no incentive to surface.

The comparison to "no tutoring" is the problem. Almost any structured educational intervention will show improvement over no intervention. The useful comparison is against existing alternatives — human tutoring, structured study programs, improved classroom instruction. If the AI tool beats nothing but underperforms human alternatives, the 31% figure is technically accurate and practically misleading.

Lab 3: Deployment Gap Analysis

Find an AI deployment claim. Stress-test it against real-world conditions.

Your Role: Deployment Auditor

Take an AI deployment claim — from a company website, a job posting, a news article about an institution you interact with (your university, a bank, a healthcare provider, a retail brand). Something that says "we use AI to do X." Your job is to analyze that claim against the deployment gap framework: What decision is being made? Who handles errors? How is performance measured? What data is it running on?

Describe the claim and your analysis to the auditor below. You'll be pushed to go further than surface-level observation and take a real position on whether this deployment claim holds up.

Describe the deployment claim you're analyzing. Where does it come from, what does it say the AI is doing, and what's your initial read on whether it's substantiated?

Deployment Auditor

Lab 3

Walk me through the deployment claim you're looking at. What does the company or institution say the AI is doing, and what do you think is actually happening behind that claim? Don't just describe it — give me your read on whether it passes or fails the deployment gap test.

Module 7 · Lesson 4

Writing Your Verdict: Synthesizing an Audit

Skepticism without a conclusion is just anxiety. The audit only matters if it ends with a clear, defensible judgment about what you believe and why.

After you've done the work — classified the claim, found the source, stress-tested the deployment — what do you actually say?

Jordan is preparing for a final presentation in their media studies seminar on "AI and the Information Economy." They've spent two weeks researching an AI content moderation system used by a major social platform. They've read the company's transparency report, found two academic papers examining its false positive rate, and found a ProPublica investigation from 2024 showing that the system disproportionately flags content from certain communities at three times the rate of equivalent content from others.

Now they're staring at their notes trying to figure out what to say. The technology is real and does remove some genuinely harmful content — that's not a lie. It also has a documented fairness problem that the company acknowledged but hasn't fixed in three years. Jordan doesn't want to sound naive by saying "AI moderation is fine" and doesn't want to sound paranoid by saying "this is a civil rights violation." But their professor is going to ask them: what do you actually think?

This is the moment the audit is for. Not to win an argument. Not to perform skepticism. But to synthesize what you've found into a position you can actually defend when someone pushes back on it.

The Audit Verdict Framework

An audit verdict isn't a binary — it's not "this claim is true" or "this claim is false." Most real AI claims are more complicated than that. What you're producing is a calibrated assessment: here's what the claim says, here's what the evidence supports, here's where the gap is, here's what I'd need to see to update my view. That's a useful thing to have in the world — much more useful than either credulity or blanket dismissal.

A complete audit verdict has four components:

The claim, precisely stated. Not the headline version — the actual claim with its implicit assumptions made explicit. "AI content moderation is effective" is vague. "This platform's AI moderation system correctly identifies and removes harmful content with a false positive rate comparable to human moderators" is specific enough to be evaluated.
The evidence, characterized honestly. What kind of evidence exists (internal report, peer-reviewed study, investigative journalism), who produced it, what it measured, and how strong it is. Be specific about what you found and what you couldn't find — absent evidence is evidence of absence in a world where companies have strong incentives to publish favorable results.
The gap, identified precisely. What does the claim imply that the evidence doesn't support? This is where the deployment gap, the amplification chain, and the benchmark problems all cash out. Name the specific disconfirmations or uncertainties without overstating them.
Your verdict, with a confidence level. Not just "I'm skeptical" — that's a vibe. "Based on available evidence, this claim is plausible but not established, and I'd lower confidence by 30% because the only supporting evidence is company-funded" is a verdict. Include what would cause you to update: "If an independent academic study replicates the performance claim, I'd treat this as established."

What Jordan Said

Jordan's final verdict on the content moderation system: "The system does remove a meaningful volume of harmful content — that claim is supported by company data and consistent with independent research. However, the fairness claim (that it applies consistent standards across communities) is not supported — it is specifically contradicted by ProPublica's 2024 analysis and the company's own transparency data. My overall assessment: effective at its core task, with a documented and unresolved fairness problem that the company has had three years to address. Confidence in the effectiveness claim: moderate. Confidence in the fairness claim: low, against the company's position."

Holding Two Things at Once

One of the trickier skills in AI literacy is holding contradictory-seeming truths simultaneously without collapsing them into a single narrative. Jordan's content moderation example is a good model: the system works, and the system is unfair. Both are true. The nuance is real, not a cop-out.

This pattern shows up everywhere in the AI landscape. AI language models are impressively capable at a wide range of tasks, and they hallucinate with confidence in ways that can cause real harm. Medical imaging AI is genuinely improving diagnostic accuracy in some settings, and it has documented performance disparities across racial groups that have not been adequately addressed. Generative AI is creating new opportunities for creative work, and it's creating real economic disruption for working artists and illustrators right now. All of these pairs are true at the same time.

The narrative pressure — from AI boosters and AI alarmists alike — is to pick a side. "AI is transformative and mostly positive" or "AI is dangerous and mostly negative." Both of those framings are simpler and more emotionally satisfying than the actual picture. They're also both wrong, or at minimum they're systematically omitting half the evidence. Your job after a good audit is to resist that pressure and say what you actually found.

The Real-World Use Case

You will be asked what you think about AI — in job interviews, in class discussions, in conversations with family, in decisions about your own work and career. Having done an actual audit on a real claim gives you something to say that almost nobody else has. Not "AI is the future" (boring, probably wrong). Not "AI is overhyped" (partially true, underspecified). But: "I looked at a specific claim about [X], here's what the evidence actually shows, and here's what I'd need to see to change my view." That's a response that makes people take you seriously.

Updating Your View: The Calibration Mindset

A verdict isn't a final answer — it's your current best read, held with appropriate confidence, open to revision when new evidence arrives. This is the calibration mindset, and it's more useful than either dogmatic belief or reflexive skepticism.

Calibration means: your confidence in a claim scales with the quality of the evidence. When a single company-funded study supports a capability claim, you hold it loosely. When three independent academic papers and two investigative journalism pieces all point in the same direction, you hold it more firmly. When evidence conflicts — one study shows X, another shows not-X — you hold both loosely and treat the question as genuinely open.

This is harder than it sounds in practice, because people with strong opinions will push back on your uncertainty. "So you think the AI doom people are wrong?" Yes, mostly, based on the current evidence. "So you think AI critics are just Luddites?" No — some of their specific concerns are well-documented. "So you don't have a strong view?" I have a calibrated view, which is different from not having a view. Learning to hold that ground — confident in your process without being overconfident in your conclusions — is the actual skill this course has been building.

Calibrated Confidence

Having a level of belief in a claim that matches the quality and quantity of evidence for it. Not more confident, not less. Calibration is a skill you can improve — it requires regularly checking whether your predictions are coming true and adjusting your process when they're not.

Verdict vs. Opinion

An opinion is what you feel. A verdict is what the evidence supports, characterized honestly. You can have an opinion that goes beyond the verdict — "I think this company is acting in bad faith even though I can't prove it" — but keep those separate, and label the opinion as opinion.

Putting the Full Audit Together

You've now built all four layers of the audit framework across this module. Here's how they fit together as a complete practice:

Classify the claim — What type of claim is this? What are its implicit assumptions? Who produced it and what do they gain if you believe it?
Find the source — Trace the claim to its origin. Where does the evidence sit in the hierarchy? What does the original study actually say vs. what the headline claims?
Assess the deployment gap — If this is a real-world performance claim, what conditions must hold for it to be accurate? Does the evidence address those conditions or is it lab-only?
Write the verdict — Precise claim, honest evidence characterization, specific gap identification, confidence-calibrated conclusion with update conditions.

This framework takes maybe fifteen minutes to run through for a claim you care about. It's not about doing this for every piece of information you encounter — that's not sustainable. It's about developing the habit of doing it for the claims that matter: the ones you're going to repeat, cite, act on, or build a decision around. That's a different population than every tweet you scroll past. It's a small enough set to be manageable and important enough to be worth the time.

The final lab will have you synthesize a complete written audit on a claim of your choice and defend your verdict when challenged. That's the capstone of everything this module has built.

Lesson 4 Quiz

Writing Your Verdict — 5 questions

1. You've audited an AI job-matching tool and found: one company-funded study supporting accuracy, and one independent paper showing disparate performance across racial groups. What is the correct verdict structure?

This is calibrated assessment. The accuracy and fairness claims are separate claims with separate evidence bases. Treating them as a single claim to accept or reject loses the precision. A good verdict characterizes each claim separately with its own confidence level — that's what makes the audit useful rather than just a gut check.

Conflicting evidence doesn't prevent a verdict — it informs one. The accuracy claim and the fairness claim are separate, and the evidence bears on them differently. One company-funded study supports accuracy: moderate confidence, pending replication. An independent study contradicts the fairness claim: low confidence in the fairness assertion. Hold both characterizations simultaneously — that's the audit.

2. Someone tells you: "You're just being contrarian — you always find problems with AI claims." What's the most accurate thing to say in response?

This is the right posture. You're not arguing for a position — you're describing a process. The offer to update based on new evidence is crucial; it distinguishes calibrated skepticism from motivated skepticism. If someone shows you evidence you hadn't seen, you should genuinely update. That's what makes the whole framework credible.

The right response is to describe your process and invite specific counter-evidence. "Calibrating to evidence" is different from "being contrarian" — a contrarian holds a position regardless of evidence; a calibrated thinker holds positions proportional to evidence and updates when shown something new. Making that distinction explicit is more persuasive than defending or abandoning your skepticism.

3. Which of the following is a verdict? Which is an opinion? Identify the verdict.

B is the verdict. It specifies the claim being evaluated, the evidence with its quality characteristics, and a confidence level with a directional assessment. A is an opinion about intent (useful, but not derived from evidence about accuracy). C is a prior, not an assessment. D is anecdote. Only B could be updated by specific new evidence in a way that would change the conclusion.

A verdict is evidence-grounded, specific, and open to revision based on new evidence. B is the only option that specifies what evidence exists, characterizes its quality, and attaches a confidence level. The others are opinions, priors, or anecdote — all potentially useful inputs, none of them verdicts in the sense this framework uses the word.

4. An AI company releases a new benchmark showing their model outperforms competitors. Two months later, two academic teams publish studies showing they can't replicate the result. How should this change your confidence in the original benchmark?

Two independent failures to replicate is strong evidence — not conclusive (the replication teams might have made different methodological choices) but significant. The calibrated update is to substantially lower confidence in the original claim while remaining open to the possibility that the result holds under the specific conditions the company tested. "Significant downgrade" is right; "full dismissal" overclaims.

Two independent replication failures are meaningful evidence. They don't prove the original result was fabricated — the replication teams might have used different conditions — but they substantially lower confidence that the result generalizes. The appropriate move is a significant confidence downgrade, not dismissal and not unchanged confidence. Calibration means taking replication failures seriously as evidence.

5. You finish an audit and conclude: "This claim is probably directionally correct but overstated by about 40%." Your friend says: "So you can't actually say whether it's true or false — what's the point?" What's the best response?

This is the real answer. Audits aren't academic exercises — they're decision support. Knowing a claim is directionally right but overstated by 40% changes how much you'd pay for a product claiming that benefit, how you'd weight it in a career decision, how you'd present it in a class argument. Binary true/false is almost never the right frame for complex empirical claims. Calibrated partial confidence is more useful.

The audit's value is decision support, not binary labeling. "Probably true but overstated by 40%" is actually highly useful information. If you're making a career decision based on "AI will replace 40% of jobs," knowing it's more likely closer to 24% based on the evidence changes your calculus. Real decisions happen on a spectrum of confidence, not a yes/no switch.

Lab 4: Full Audit — Write Your Verdict

Take the claim you've been working with all module. Write the complete audit. Defend your verdict under pressure.

Your Role: Lead Analyst — Final Report

This is the capstone lab. You're going to present a complete audit of an AI claim — the one you've been working with, or a new one if you prefer. The analyst below will receive your audit across four components: (1) the claim precisely stated, (2) the evidence characterized honestly, (3) the gap identified specifically, and (4) your verdict with a confidence level and update conditions.

The analyst will push back — not to be difficult, but to test whether your verdict holds under pressure. If you've done the work, it will. If there are weak spots, you'll find out here, where it's safe to find them.

Present your complete audit. Start with: here's the claim, here's what I found when I went looking for evidence, here's the gap between the claim and the evidence, and here's my verdict with confidence level. Be specific. "I'm skeptical" is not a verdict.

Audit Reviewer

Lab 4

Give me your full audit. Claim, evidence, gap, verdict — all four components. I'm going to push back on whatever part looks weakest, so be ready to defend it. If your verdict can't survive one round of pushback, it needs more work. Start whenever you're ready.

Module 7 Test

Apply It: Audit an AI Claim — 15 questions · Pass at 80%

1. A startup's website says: "Our AI increases employee productivity by 3x." What is the single most important piece of missing information that would let you evaluate this claim?

Without knowing what "productivity" means in this context (output volume? quality? hours saved?) and what the 3x is compared to (previous workflow? industry average? manual processes?), the claim is completely uninformative. The metric definition and comparison baseline are the two most critical elements missing from nearly all AI performance claims.

The model name and company financials don't help you evaluate the claim. The only way to assess "3x productivity" is to know what productivity means here and what it's being compared to. Without those two pieces, the number is decorative.

2. Which of the following is a trajectory claim?

Trajectory claims predict directional change over time. "Will automate 30% within a decade" is a forecast about future change — the signature of a trajectory claim. The others are respectively a capability claim, an adoption claim, and a value claim.

A trajectory claim predicts future directional change. Option C is the only one that describes where AI is heading over time — it's a forecast. A is a capability claim (what the AI can do), B is an adoption claim (how many people use it), and D is a market valuation (a type of value claim).

3. A peer in your class says "I read that AI can now pass the CPA exam." You ask where they read it. They say "I saw it on Twitter." What should your confidence level be in this claim?

A Twitter attribution with no traceable source puts you at the far end of the amplification chain. The claim might be real — an AI possibly has passed some version of the CPA exam — but you have no way to evaluate the original result from this citation. Very low confidence, treat as hypothesis to investigate, not a fact to repeat.

The correct confidence is very low — not zero (it might be true) but not moderate (you have nothing to evaluate it against). Social media posts are typically the telephone-game endpoint of a chain that may have started with a real study. Without being able to trace it back, you can't know if the original claim was accurate, and you certainly can't know if the Twitter version is faithfully representing it.

4. What does "benchmark overhang" mean?

Benchmark overhang happens when models are tuned specifically to perform well on evaluation datasets — sometimes to the point where the test score measures the optimization, not the underlying capability. A model can ace a benchmark through specific training without being generalizable. This is why benchmark scores should never be the sole measure of AI capability.

Benchmark overhang describes models optimized for test scores in ways that don't reflect real-world performance. Think of it like teaching to the test — the score goes up, the underlying capability doesn't necessarily follow. The benchmark becomes a measure of benchmark optimization rather than the thing you actually care about.

5. You find a study showing an AI coding assistant doubles developer speed. The study measured how fast participants completed a single, pre-defined function-writing task. What is the primary limitation of this finding?

This is the controlled lab condition problem. A pre-defined, isolated function-writing task strips out almost everything that makes real development complex: understanding requirements, debugging generated code, ensuring integration with existing systems, code review, and maintenance. The benchmark measured one slice of development and the claim implies all of it.

The core problem is that the task is not representative. Real development isn't just writing new functions to specification — it's understanding ambiguous requirements, debugging, integrating with legacy code, reviewing pull requests, and maintaining systems over time. A study that measures one isolated task and claims to measure "developer productivity" is conflating the two.

6. A company says their hiring AI is "audited for fairness." What would you need to know to evaluate whether that audit is meaningful?

"Audited for fairness" is a phrase that can mean almost anything. Who conducted it (independent or internal?) matters. Which fairness definition (equal accuracy, demographic parity, equalized odds?) matters — there are dozens of mathematically incompatible fairness definitions. Which demographic groups were examined matters. Whether results were disclosed matters. The phrase alone tells you nothing.

"Audited for fairness" is a hollow claim without specifics. Fairness has multiple mathematical definitions that are often mutually incompatible. An internal audit using a favorable fairness metric, examining only the easiest-to-satisfy demographic splits, and never publishing results tells you nothing useful. You need who, which definition, which groups, and what was found.

7. Model drift most directly threatens which aspect of an AI system's performance claims?

Model drift is a deployment problem, not a research or benchmark problem. A model that performed accurately at launch can degrade as the world changes and the input data drifts away from the training distribution. This is why post-deployment monitoring matters — the launch accuracy figure is only the starting point.

Model drift affects deployed system performance, not the original paper's results. The paper was accurate when it was written. The deployment accuracy at month 18 is a different question, and drift is why those two numbers can diverge significantly. Ongoing monitoring is the only way to catch drift before it causes serious failures.

8. You're reading a McKinsey report claiming AI will create $4.4 trillion in annual economic value. You find the original report and see that the number comes from McKinsey's own proprietary economic model, not external data. What is the correct characterization of this evidence?

Proprietary economic models from firms with commercial interests in the outcome sit near the bottom of the evidence hierarchy for a reason. The model assumptions aren't independently reviewable, the incentives point toward large impressive numbers, and the methodology can't be replicated. Treat it as one data point with wide uncertainty bands, not a finding.

A proprietary model from a consulting firm with incentive to project large AI values is weak-to-moderate evidence at best. The methodology can't be independently examined, the assumptions can't be tested, and the firm profits from advising companies on AI adoption. That doesn't make the number wrong — it makes it unverifiable and incentive-shaped. Wide uncertainty bands are appropriate.

9. An AI diagnostic tool has 95% sensitivity (catches 95% of actual cases) but 60% specificity (generates false alarms 40% of the time). A press release says the tool achieves "95% diagnostic accuracy." What's wrong?

This is metric selection bias in action. Reporting only sensitivity as "accuracy" while hiding that 40% of the tool's positive flags are false alarms is a critical omission for any clinical context. A 40% false alarm rate means 40% of patients flagged would undergo unnecessary follow-up, with associated cost, anxiety, and potential harm. The metric selected makes the tool look much better than the full picture supports.

The problem is that "accuracy" is being operationalized as sensitivity alone, while specificity — which is equally important in clinical settings — is conveniently omitted. A 40% false positive rate is a major clinical concern. Labeling sensitivity alone as "accuracy" cherry-picks the favorable metric and misrepresents the tool's overall diagnostic performance.

10. Your professor cites a 2023 study showing AI tutors improve test scores by 28% vs. human tutors. You look it up and find the study used a sample of 47 college students over two weeks. What is the most significant methodological concern?

47 students over two weeks is a very small sample over a very short time horizon. Small samples have high variance — the result might replicate or might not. Two weeks is insufficient to assess whether the improvement reflects genuine learning or just familiarity effects, Hawthorne effects (behavior change from being observed), or novelty motivation. Generalization requires much larger, longer studies.

The primary concern is sample size and study duration. 47 students is not enough to draw robust generalizations, and two weeks is too short to distinguish genuine learning improvements from novelty effects, Hawthorne effects, or temporary motivation changes. A result this preliminary should be described as interesting, not established.

11. You finish auditing an AI claim and your verdict is: "The core capability claim is probably true based on two independent studies, but the company's assertion that this capability scales to enterprise deployment is unsupported." Your friend says your verdict is useless because it doesn't say yes or no. What's the right response?

Separating what's supported from what isn't is precisely the value of the audit. "Capability works in lab conditions, enterprise scaling is unproven" is decision-relevant information — it's a reason to pilot carefully rather than commit fully. Binary yes/no verdicts on complex empirical questions are usually false precision. Calibrated partial confidence is more accurate and more useful.

A verdict that distinguishes supported claims from unsupported ones is more useful than a binary yes/no, not less. If someone is deciding whether to adopt this technology at enterprise scale, knowing "the core capability works" vs. "enterprise scaling is unproven" is exactly the information they need. That's not an incomplete verdict — that's a precise one.

12. A social media post says "STUDY: AI beats humans at detecting fraud 94% vs 76%." You find the original study. It compared the AI to a single human analyst working alone, without access to the contextual information fraud investigators normally use. How does this change the claim?

Comparing AI to a human stripped of contextual tools is a deliberately favorable comparison. Real fraud investigation involves multiple analysts, case history, behavioral patterns over time, and organizational context. The AI's "outperformance" in this study tells you how it does against an artificially weakened human baseline — not whether it would outperform a real fraud team. The study is real; the comparison is cherry-picked.

The comparison baseline is the problem. A single analyst without access to contextual information is not a representative baseline for "humans at fraud detection." Real fraud teams use multiple investigators with full case context. The AI might or might not outperform that — this study doesn't tell you. The headline claim is much stronger than the study supports.

13. You're a pre-med student weighing whether AI will replace radiologists. You find: (1) a 2022 study showing AI comparable to median radiologists on specific chest X-ray tasks; (2) a 2024 deployment audit showing the same AI had 71% accuracy in a community hospital with older imaging equipment; (3) a McKinsey report projecting AI will handle 50% of radiology tasks by 2030. What's the most defensible conclusion?

This is calibrated synthesis. The research shows real capability; the deployment audit shows real performance gap under real conditions; the McKinsey projection is speculative. "Real tool, real limitations, uncertain trajectory" is a more accurate and more useful framing than either "being replaced" or "doesn't work." It's also the honest answer about where the evidence actually sits.

The three sources tell a coherent story that's more nuanced than either extreme. The lab benchmark shows genuine capability. The deployment audit shows a meaningful real-world performance gap. The McKinsey projection is speculative. Together they say: AI is a real and advancing technology in this domain that currently underperforms outside controlled conditions, with an uncertain but non-zero trajectory toward broader application. That's your verdict.

14. "AI is being used by major banks to make loan decisions" — what additional information is most critical before you repeat this as a factual claim?

"Making loan decisions" is doing enormous definitional work. If AI is the final decision-maker, that's a very different world from AI generating a risk score that a human loan officer then evaluates as one factor among many. Banks routinely describe the latter as "AI-driven underwriting" when it's actually "AI-assisted human underwriting." The distinction has major implications for accountability, bias, and regulatory compliance.

The critical question is the AI's actual role in the decision. "Making decisions" can mean autonomous final approval/denial, or it can mean generating a risk score that a human reviews. These are fundamentally different systems with different regulatory implications, accountability structures, and error profiles. The phrase is commonly used to describe both — which is why you need to ask.

15. After completing this module, you're asked in a job interview: "What do you think about AI — is the hype real?" What's the most sophisticated answer you could give?

This is the answer that demonstrates genuine analytical capability rather than a stance. It's honest about complexity without retreating into "I can't say." It demonstrates a process, not just an opinion. It signals that you've engaged with the material seriously. An interviewer at any organization dealing with AI decisions will find this more impressive than either a boosterish or a cynical answer.

Options A and B are both more confident than the evidence supports, in opposite directions. Option C sounds thoughtful but is actually a dodge — "too complicated to generalize" is often how people avoid having to think rigorously. Option D demonstrates the actual skill: process-based analysis, calibrated confidence, and an honest acknowledgment of complexity without hiding behind it. That's what this module was built to produce.