Your friend Dani forwards you a LinkedIn post. The headline reads: "AI is now better than the average radiologist at detecting cancer." Dani is a pre-med sophomore, and she's genuinely rattled. "Should I even finish the degree?" she texts you. Below the headline is a graph, a company logo, and a quote from a TED talk. It looks authoritative. It feels like a fact.
The thing is — you've seen this headline before, or something like it. It circulates every eighteen months. Sometimes it's radiology. Sometimes it's legal research, or coding, or therapy. The specifics change; the structure of the claim stays the same. AI has surpassed humans at [high-status profession]. And every time, people make real decisions off it — whether to stay in school, which skills to develop, whether a career is worth pursuing.
That's not a hypothetical risk. That's what's happening right now, in your peer group, with real consequences. So before we get to whether any specific AI claim is true, we need to get clear on what kind of thing an AI claim actually is.
AI claims aren't a single type of statement. They get blurred together constantly — in headlines, in pitch decks, in Twitter threads — but they operate very differently and require different kinds of scrutiny. There are roughly four categories worth keeping separate in your head.
Capability claims say AI can do something: "GPT-4 passed the bar exam." These are empirically testable in principle, but the details matter enormously — which version, on what test, under what conditions, compared to whom. The radiologist headline above is a capability claim. So is "our AI writes better marketing copy than humans." The claim sounds concrete but usually hides a metric choice made by whoever ran the study.
Adoption claims say AI is being used at scale: "78% of Fortune 500 companies are deploying AI tools." These are survey-based, self-reported, and definitionally murky — "deploying" can mean anything from a chatbot on an HR portal to genuine process automation. The number sounds precise because it has a percent sign attached to it.
Trajectory claims say AI is heading somewhere: "AI will replace 40% of jobs by 2030." These are forecasts dressed up as facts. Every major research institution has published wildly different numbers. The range is so wide — some studies say 10%, others say 60% — that citing any single figure as authoritative is, at minimum, selective.
Value claims say AI will create or destroy economic value: "AI will add $15.7 trillion to the global economy." These almost always come from consulting firms whose business model depends on companies buying AI strategy services. That doesn't make them false. But it's relevant information.
Most of your peers aren't sorting AI claims into categories. They're treating all of them as roughly equivalent to "facts I saw online." That's not a judgment — it's the default when you're moving fast and information is everywhere. The sorting habit is a learnable skill, and it takes maybe ten seconds per claim once you have the framework.
Every AI claim has an origin point, and that origin point has interests. This doesn't mean every claim is propaganda — but it does mean the interests are relevant to your evaluation.
When OpenAI publishes a benchmark showing GPT-4 outperforms other models, they have a direct financial incentive for that result. That doesn't invalidate the benchmark. But it should prompt you to ask: was this independently replicated? Did third parties run the same test and get the same result? Pharmaceutical companies run their own drug trials too — and we still require independent verification before approving drugs. Why would AI benchmarks be different?
When a management consulting firm publishes a report saying AI will create enormous economic value, notice that their primary product is advising corporations on technology adoption. The report drives business. Again — it might be accurate. But "McKinsey says AI will add $4.4 trillion in value" is not a neutral scientific finding.
When a tech journalist writes a breathless headline about AI surpassing humans, notice that dramatic framing drives clicks and subscriptions. Nuanced, technically accurate headlines ("AI performs comparably to median radiologists on a specific subset of chest X-rays under controlled lab conditions") don't go viral. The incentive structure of media systematically amplifies the most dramatic version of any AI story.
The next time you encounter an AI claim — in your feed, in class, from a recruiter — pause and ask two questions before you form an opinion: What type of claim is this? (capability, adoption, trajectory, or value) and Who produced it and what do they get if I believe it? These two questions take fifteen seconds and will immediately put you ahead of most people in the conversation.
Let's go back to the radiologist headline. When researchers say "AI outperforms radiologists," they almost always mean: on a specific benchmark dataset, using a specific performance metric, compared to a specific group of radiologists, under specific testing conditions. Each of those specifics can be chosen in ways that favor the outcome the researchers are hoping to demonstrate.
The benchmark dataset might consist of clear, well-labeled images — not the ambiguous ones that cause actual diagnostic difficulty. The performance metric might be sensitivity (catching true positives) while ignoring specificity (false alarm rate). The comparison group might be radiologists working in low-resource settings, not trained specialists with full clinical context. The testing conditions remove the conversation with the patient, the clinical history, the prior scans — the things that actually make a radiologist more accurate than an algorithm.
None of this means the AI isn't impressive, or that Dani should definitely finish her radiology residency without thinking hard about AI's trajectory. The technology is genuinely advancing fast. But "AI outperforms radiologists" as a headline is doing a lot of work that the underlying study cannot support. Understanding that gap — between what the benchmark measured and what the headline claims — is the core skill this module is building.
This module is built around a practical audit framework you'll apply to real claims — ones you've actually encountered this week, not hypotheticals invented for a classroom. The framework builds across all four lessons. Here's where it starts.
When you encounter an AI claim, the first three questions to ask are:
In the next three lessons, we'll add layers: evaluating the evidence itself, auditing the deployment claim (real-world performance, not lab performance), and synthesizing a personal judgment that holds up under pushback. The goal isn't skepticism as a lifestyle — it's calibration. Knowing when a claim deserves your confidence and when it deserves your patience.
You've seen AI claims this week — in your feed, in a class lecture, from a recruiter, on a company website, in a group chat. Pick one that actually caught your attention. It doesn't have to be dramatic. It just has to be real.
Describe the claim to the AI below. Your analyst will ask you to classify it by type (capability, adoption, trajectory, or value), identify the source, and explain what you'd need to know to evaluate it properly. The analyst will push back on vague answers — that's the point.
Marcus is a junior majoring in computer science. His professor mentions in lecture that "studies show AI coding assistants increase developer productivity by up to 55%." Marcus goes home and puts this in his cover letter for a summer internship: "I'm well-versed in tools that have been shown to increase developer productivity by over 50%." He gets the interview. The recruiter asks him about that stat. He can't name the source. He'd never looked it up — why would he? The professor said it.
The stat, for what it's worth, comes from a 2022 GitHub study about GitHub Copilot — a study run by GitHub, which is owned by Microsoft, which owns a major stake in OpenAI, which makes Copilot possible. The study measured how fast programmers completed a specific, pre-defined coding task in a controlled environment. It didn't measure code quality, debugging time, integration errors, or what happened when the AI generated plausible-looking but incorrect code that took an hour to find. It also didn't measure productivity across a full workday or a full project. "55% more productive" is a real number from a real study. It is also not what most people mean when they say productivity.
Marcus got lucky — the recruiter didn't press hard. But the experience rattled him. He'd cited a claim he couldn't defend, built on a study he'd never read, that measured something narrower than the claim implied. That's the gap this lesson is about.
Not all evidence is equal, and in the AI space, the quality distribution is particularly skewed toward the weak end. Here's a rough hierarchy, from strongest to weakest:
When your classmates cite an AI stat, most of them are citing a journalist's summary of a consulting report that aggregated internal company data. The claim may be four or five steps removed from anything that was actually measured. You're in that same position most of the time — the difference is knowing it.
You don't need a PhD to spot weak evidence. These five patterns show up constantly in AI claims and take about thirty seconds to check for once you know what you're looking for.
Most people don't go to the source because it feels like a research project. It doesn't have to be. Here's a five-minute process that works for most AI claims you'll encounter.
Step 1: Copy the specific number or named result from the claim ("55% productivity increase," "94% accuracy on skin lesions," "top 10% of bar exam takers") and search it directly in Google Scholar or just Google. The original study almost always surfaces in the first few results when you search the specific metric rather than a paraphrased headline.
Step 2: Find the "Methods" section of the study. This is where the study design lives — the sample, the comparison, the metric definitions. If the study is paywalled, the abstract usually tells you enough. If even the abstract is inaccessible, check arXiv, which hosts free preprints of most AI research.
Step 3: Check the funding disclosure, usually at the end of the paper. It will say something like "this research was supported by a grant from [company]." That's your conflict-of-interest flag.
Step 4: Search for independent coverage or replication. Has another research group run a similar study? Has a science journalist written a skeptical follow-up? Search the study title plus "criticism," "replication," or "limitations."
If you can't locate a source after five minutes of actual searching, that itself is information. A claim with no traceable origin deserves zero confidence as a factual statement — it can still be an interesting hypothesis worth entertaining, but it shouldn't be cited as evidence.
Before you repeat an AI statistic — in a cover letter, in a class discussion, in a conversation with a recruiter or professor — spend five minutes locating the original source. You don't have to read the whole paper. You just need to be able to say: "This came from [study], funded by [who], measuring [what], in [what context]." That sentence is the difference between citing a claim and understanding it.
There's a predictable lifecycle to AI claims. A research team publishes a paper with careful hedging: "Under controlled conditions, our model showed statistically significant improvement over baseline radiologist performance on a subset of chest X-ray classifications." That gets picked up by a science journalist who writes: "New AI beats doctors at reading X-rays." That gets shared on LinkedIn with the caption: "AI is replacing radiologists. Are you ready?" That becomes the version your friend forwards you, fully severed from the hedging that was in the original.
Each step in this chain is often done in good faith. The journalist isn't lying — they're simplifying for a general audience. The person who shared it on LinkedIn isn't fabricating — they're reacting to the article. But the cumulative effect is a claim that's traveled so far from its origins that it no longer represents what the researchers actually found. This is not a flaw in the system — it's a feature of how media and social networks work. Alarming, simple, consequential-sounding content travels faster than nuanced, caveated, methodologically careful content. Always has. The AI domain just gives this dynamic particularly high stakes because the claims are being used to make real decisions about careers, education, and policy.
Take the AI claim you identified in Lab 1, or pick a new one. Your task: spend five minutes trying to find the original source. Then report your findings to the analyst below — where you looked, what you found, what the original study actually says vs. what the claim implied, and what's still unclear.
The analyst will help you evaluate what you found and push back if you're being too credulous or too dismissive. The goal is calibration, not cynicism.
Priya lands a part-time job as a junior data analyst at a mid-size insurance company. In her first week, she notices that underwriters are still manually reviewing every auto insurance application flagged by the company's AI risk-scoring tool. She asks her manager why — if the AI is supposed to handle this. The manager sighs. "The AI works great in the demo. In production, it flags about 40% of applications as high-risk. Most of them aren't. But the legal team said we can't just act on it without human review because of fair lending laws."
So the AI is deployed. It runs on every application. The company reports to investors that they use "AI-driven underwriting." And every human underwriter is doing more work than they were before the AI was introduced, because now they're reviewing both the application and the AI's frequently-wrong flag. The AI didn't replace anyone. It added a step. And nobody outside the company would know this from the press release, which reads: "We leverage advanced machine learning to streamline our underwriting process."
This is the deployment gap. Lab performance — or even a cherry-picked production metric — is not the same as real-world impact. Priya's company isn't lying in the press release. But the gap between "we use AI" and "AI is making us more efficient" is enormous, and it's a gap that almost no external claim about AI deployment will acknowledge.
Lab conditions and real-world deployment differ in at least five structural ways that matter for evaluating AI claims.
Data quality. Labs use curated, clean data. Real-world data is messy, incomplete, inconsistently labeled, and often structured in ways that weren't anticipated when the model was trained. A customer service AI trained on formal ticket data will perform differently on voice transcripts, regional dialects, or rushed abbreviations that real customers use.
Edge cases. Labs optimize for average performance. Real-world deployment is defined by edge cases — the unusual inputs that happen 5% of the time but cause 50% of the serious errors. Benchmark accuracy of 95% sounds impressive until you realize the 5% error rate is concentrated in the most complex and consequential cases.
Human-system interaction. Lab tests typically isolate the AI from human behavior. In deployment, humans interact with the AI in ways researchers didn't predict — they game it, they over-rely on it, they ignore it, they develop workarounds. These behavioral adaptations change what the system actually produces at scale.
Organizational friction. Real deployment involves legal compliance, IT integration, training, user adoption, and organizational politics. An AI that performs perfectly in isolation may be constrained by legal liability concerns (as in Priya's story), or hobbled by an IT infrastructure that can't support real-time inference, or ignored by employees who don't trust it.
Monitoring and drift. AI models degrade over time as the world changes and the data they were trained on becomes stale. Lab results don't capture this. A model that was 94% accurate at launch may be 78% accurate eighteen months later if the underlying data distribution shifted. Most companies don't have robust monitoring for this.
When a classmate says "I use AI to write my essays and it's amazing," they're reporting lab-condition results on their best use case. What they're less likely to mention: the hours spent editing AI-generated prose that didn't match their voice, the times the AI confidently stated something factually wrong, the professor who noticed the writing style change. Self-reported AI productivity is almost always optimistic for the same reasons company press releases are.
In 2018, Amazon scrapped an internal AI recruiting tool that had been in development since 2014. The system was designed to review resumes and score candidates. In testing, it worked. In deployment — or rather, before deployment — Amazon's own engineers discovered it was systematically downgrading resumes that included the word "women's" (as in "women's chess club") and downranking candidates from all-women's colleges. The model had learned from historical Amazon hiring data, which reflected the company's existing gender imbalance in technical roles. It was, in effect, trained to replicate past bias.
This is not an obscure corner case. It's a canonical example of what happens when a system that "works" on historical data is validated against that same historical data, which bakes in the problems the system was supposed to solve. Amazon's engineers caught it before the system went live. Most companies don't have the technical sophistication to run this kind of audit.
When you hear an AI claim about hiring efficiency, talent optimization, or candidate scoring, the relevant question isn't "is the AI accurate?" It's "accurate at predicting what — and is that what it should be predicting?" A model that predicts which candidates resemble past hires is accurate in the narrow sense. Whether "resembles past hires" is a valid proxy for "will be a good employee" is a different question entirely.
When a company, employer, or institution claims to use AI, you're now equipped to ask questions that distinguish real deployment from marketing deployment. These questions aren't aggressive — they're the kind a thoughtful analyst or potential employee should ask.
If you're interviewing for a job or internship and the company mentions AI systems, ask one of these questions. Not as a gotcha — as genuine curiosity. "How do you measure the system's performance after it's deployed?" or "What does the error-handling process look like?" A company that can answer these questions clearly is a company that's actually thinking about deployment. A company that can only show you the demo is a company that hasn't closed the loop between lab and reality.
None of this is an argument that AI doesn't work. It does — in specific contexts, with good data, with appropriate human oversight, with honest measurement, and with realistic expectations about what "working" means. The deployment gap is real, but it's narrowing in some domains. Medical imaging AI is genuinely useful in clinical settings with appropriate human-in-the-loop validation. Code completion tools are saving developers real time on real tasks. Language models are making translation and accessibility tools dramatically better for underserved populations. These are meaningful gains, not hype.
The goal of this module isn't to make you skeptical of everything. It's to help you be skeptical of the right things — the gap between the claim and the evidence, the gap between the benchmark and the deployment, the gap between the press release and the analyst's notes. That calibration is what makes you useful to any organization or project that involves AI: you can see through the noise without dismissing the signal.
Take an AI deployment claim — from a company website, a job posting, a news article about an institution you interact with (your university, a bank, a healthcare provider, a retail brand). Something that says "we use AI to do X." Your job is to analyze that claim against the deployment gap framework: What decision is being made? Who handles errors? How is performance measured? What data is it running on?
Describe the claim and your analysis to the auditor below. You'll be pushed to go further than surface-level observation and take a real position on whether this deployment claim holds up.
Jordan is preparing for a final presentation in their media studies seminar on "AI and the Information Economy." They've spent two weeks researching an AI content moderation system used by a major social platform. They've read the company's transparency report, found two academic papers examining its false positive rate, and found a ProPublica investigation from 2024 showing that the system disproportionately flags content from certain communities at three times the rate of equivalent content from others.
Now they're staring at their notes trying to figure out what to say. The technology is real and does remove some genuinely harmful content — that's not a lie. It also has a documented fairness problem that the company acknowledged but hasn't fixed in three years. Jordan doesn't want to sound naive by saying "AI moderation is fine" and doesn't want to sound paranoid by saying "this is a civil rights violation." But their professor is going to ask them: what do you actually think?
This is the moment the audit is for. Not to win an argument. Not to perform skepticism. But to synthesize what you've found into a position you can actually defend when someone pushes back on it.
An audit verdict isn't a binary — it's not "this claim is true" or "this claim is false." Most real AI claims are more complicated than that. What you're producing is a calibrated assessment: here's what the claim says, here's what the evidence supports, here's where the gap is, here's what I'd need to see to update my view. That's a useful thing to have in the world — much more useful than either credulity or blanket dismissal.
A complete audit verdict has four components:
Jordan's final verdict on the content moderation system: "The system does remove a meaningful volume of harmful content — that claim is supported by company data and consistent with independent research. However, the fairness claim (that it applies consistent standards across communities) is not supported — it is specifically contradicted by ProPublica's 2024 analysis and the company's own transparency data. My overall assessment: effective at its core task, with a documented and unresolved fairness problem that the company has had three years to address. Confidence in the effectiveness claim: moderate. Confidence in the fairness claim: low, against the company's position."
One of the trickier skills in AI literacy is holding contradictory-seeming truths simultaneously without collapsing them into a single narrative. Jordan's content moderation example is a good model: the system works, and the system is unfair. Both are true. The nuance is real, not a cop-out.
This pattern shows up everywhere in the AI landscape. AI language models are impressively capable at a wide range of tasks, and they hallucinate with confidence in ways that can cause real harm. Medical imaging AI is genuinely improving diagnostic accuracy in some settings, and it has documented performance disparities across racial groups that have not been adequately addressed. Generative AI is creating new opportunities for creative work, and it's creating real economic disruption for working artists and illustrators right now. All of these pairs are true at the same time.
The narrative pressure — from AI boosters and AI alarmists alike — is to pick a side. "AI is transformative and mostly positive" or "AI is dangerous and mostly negative." Both of those framings are simpler and more emotionally satisfying than the actual picture. They're also both wrong, or at minimum they're systematically omitting half the evidence. Your job after a good audit is to resist that pressure and say what you actually found.
You will be asked what you think about AI — in job interviews, in class discussions, in conversations with family, in decisions about your own work and career. Having done an actual audit on a real claim gives you something to say that almost nobody else has. Not "AI is the future" (boring, probably wrong). Not "AI is overhyped" (partially true, underspecified). But: "I looked at a specific claim about [X], here's what the evidence actually shows, and here's what I'd need to see to change my view." That's a response that makes people take you seriously.
A verdict isn't a final answer — it's your current best read, held with appropriate confidence, open to revision when new evidence arrives. This is the calibration mindset, and it's more useful than either dogmatic belief or reflexive skepticism.
Calibration means: your confidence in a claim scales with the quality of the evidence. When a single company-funded study supports a capability claim, you hold it loosely. When three independent academic papers and two investigative journalism pieces all point in the same direction, you hold it more firmly. When evidence conflicts — one study shows X, another shows not-X — you hold both loosely and treat the question as genuinely open.
This is harder than it sounds in practice, because people with strong opinions will push back on your uncertainty. "So you think the AI doom people are wrong?" Yes, mostly, based on the current evidence. "So you think AI critics are just Luddites?" No — some of their specific concerns are well-documented. "So you don't have a strong view?" I have a calibrated view, which is different from not having a view. Learning to hold that ground — confident in your process without being overconfident in your conclusions — is the actual skill this course has been building.
You've now built all four layers of the audit framework across this module. Here's how they fit together as a complete practice:
This framework takes maybe fifteen minutes to run through for a claim you care about. It's not about doing this for every piece of information you encounter — that's not sustainable. It's about developing the habit of doing it for the claims that matter: the ones you're going to repeat, cite, act on, or build a decision around. That's a different population than every tweet you scroll past. It's a small enough set to be manageable and important enough to be worth the time.
The final lab will have you synthesize a complete written audit on a claim of your choice and defend your verdict when challenged. That's the capstone of everything this module has built.
This is the capstone lab. You're going to present a complete audit of an AI claim — the one you've been working with, or a new one if you prefer. The analyst below will receive your audit across four components: (1) the claim precisely stated, (2) the evidence characterized honestly, (3) the gap identified specifically, and (4) your verdict with a confidence level and update conditions.
The analyst will push back — not to be difficult, but to test whether your verdict holds under pressure. If you've done the work, it will. If there are weak spots, you'll find out here, where it's safe to find them.