Pick the Right AI for the Job · Introduction

Every Tool Does One Thing Well — and Everything Else Badly

Why choosing the wrong AI is as important as choosing the right one

In March 2023, a lawyer named Steven Schwartz submitted a legal brief in a real federal court case. He had used ChatGPT to help research it. The problem: ChatGPT invented six court cases that never existed — complete with fake judges, fake dates, fake rulings. Schwartz hadn't realized that a conversational AI designed to sound confident and fluent is not the same as a legal database designed to store verified facts. He used the wrong tool for the job, in front of a judge, in a case that affected a real person's life. The court sanctioned him. The story went viral. And it became one of the first lessons the world learned about AI the hard way.

That same year, students everywhere started using AI for school research, musicians used it to generate beats, doctors used it to draft patient notes, and artists used it to create images — sometimes getting brilliant results and sometimes getting quietly wrong ones. What separated the people who got brilliant results from the ones who got burned wasn't how smart they were. It was whether they understood what each AI tool was actually built to do.

This course is about that gap. By the end, you'll be able to look at any AI tool — a chatbot, an image generator, a code assistant, a search engine — and ask the right questions before you trust it. You won't need to be a programmer. You just need to understand why the same question can get five completely different answers depending on which AI you ask. That's what we start with today.

Lesson 1 · Same Question, Five Different Answers

The Doctor, the Calculator, and the Guesser

Why different AI systems produce radically different answers to the same question — and what that tells you about what they actually are

If you asked five different AI systems "Is this symptom serious?" — would you want the same tool answering all five times?

In September 2022, a teenager in the United Kingdom typed a symptom into an AI-powered chatbot on a health app called Babylon Health. The chatbot was designed to triage — to figure out whether you needed a doctor urgently or could wait. The teen typed in chest pain. The chatbot, according to reporting by The Sunday Times, rated the risk as low and suggested rest. A human triage nurse, reviewing the same description later, would have sent the patient to an emergency room immediately.

The Babylon chatbot wasn't broken. It was doing exactly what it was built to do: pattern-match symptoms to likely causes and give a probability-based recommendation. But it was trained on general population data, and it wasn't designed to catch edge cases in a 15-year-old with an unusual presentation. The problem wasn't the AI's intelligence. It was that nobody told the user what kind of AI they were actually dealing with — a statistical guesser dressed up in the language of a doctor.

At the same time that year, researchers at Google DeepMind published results for a system called Med-PaLM that could answer medical questions at a level comparable to licensed physicians on standardized board exams. Different AI. Same domain. Radically different design. The lesson isn't "AI is bad at medicine." The lesson is: the tool matters as much as the question.

What Makes AI Systems Different From Each Other

Here's something that surprises most people: the word "AI" covers dozens of fundamentally different kinds of systems. Calling them all "AI" is like calling a calculator, a piano, and a submarine all "machines." They are machines. But you wouldn't play music on a submarine.

The five main types you'll encounter in everyday life are: large language models (like ChatGPT and Claude), image generators (like Midjourney and DALL·E), search-augmented AI (like Perplexity and the AI mode in Google), specialized task models (like AI coding assistants such as GitHub Copilot), and narrow AI classifiers (like the spam filter in your email or TikTok's recommendation engine). Each one was built for a different job.

The crucial difference comes down to three things: what data they were trained on, what task they were optimized to perform, and whether they have access to real-time verified information. A language model trained on the entire internet learns to predict what text sounds right. A search-augmented AI fetches actual current documents. A specialized classifier was trained on millions of examples of one specific thing. These aren't just different tools — they have different failure modes, meaning they fail in completely different ways.

Training data The collection of text, images, or other information an AI system was fed during the process that built it. It shapes everything the AI "knows" — and everything it gets wrong.

Failure mode The specific, predictable way a system breaks down. Every AI has characteristic failure modes — knowing them is how you protect yourself.

Optimization target The specific goal an AI was trained to maximize. A language model is optimized for coherent, plausible text. A spam filter is optimized to classify. These are not the same goal.

Ages 8–11 Anchor

Think of it like this: a hammer and a screwdriver are both tools. If you use a hammer on a screw, you'll probably break something. AI tools are the same — using the wrong one for the job doesn't just give bad results, it gives confidently wrong results, which is worse.

The Five AI Archetypes — and Their Honest Job Descriptions

Let's be specific. When you use ChatGPT, Claude, or Gemini — large language models — you are talking to a system that was trained to predict the next most-plausible word in a sequence. That is literally the core task. Everything impressive these systems do — writing essays, explaining concepts, coding, brainstorming — emerges from doing that prediction task at enormous scale. The catch: plausible-sounding text is not the same as accurate text. Language models have no internal fact-checker. They can produce wrong answers with perfect grammar and total confidence. The legal brief that got lawyer Steven Schwartz sanctioned in 2023 is the canonical example.

When you use Perplexity, or Google's AI Overview mode — search-augmented AI — the system fetches real documents from the current web and then summarizes them. This grounds answers in actual sources, which fixes the "making things up" problem partially. But it introduces a new problem: garbage in, garbage out. If the web contains misinformation about a topic, the search-augmented AI will sometimes summarize that misinformation as if it were fact. In May 2024, Google's AI Overviews infamously suggested people add glue to pizza to keep cheese from sliding off — because it had retrieved a satirical Reddit post as a source.

Image generators like Midjourney and DALL·E work on entirely different principles — they were trained on millions of image-text pairs and learn to produce pixel patterns that match a description. They have no understanding of what's physically possible. They can show you a person with six fingers because fingers are statistically tricky, or they'll show a bridge designed in a way that would collapse, because structural engineering was not in the training objective. They are extraordinarily useful for creative work, and genuinely unreliable for anything requiring physical accuracy.

Specialized task models — like GitHub Copilot for code, or AI tools that analyze medical scans — are trained narrowly on one domain with one precise goal. They tend to perform much better than general models at their specific task, and much worse at everything else. GitHub Copilot writes code. It is not a good essay writer. An AI trained on chest X-rays is not a good skin cancer detector.

Narrow classifiers — spam filters, content moderation systems, TikTok's recommendation algorithm — are the oldest form of AI in mass deployment. They don't generate anything. They sort, rank, and classify. Their failure mode is bias baked into their training data: if the training examples over-represented certain patterns, the classifier will over-apply them. In 2019, a widely used healthcare algorithm studied by researchers at UC Berkeley was found to systematically underestimate the medical needs of Black patients — not because anyone programmed it to, but because it was trained on historical spending data that reflected historical inequities.

You can now see what most people miss

Most people treat AI as a single category — either they trust it or they don't. You now know there isn't a single thing called AI. There are five fundamentally different architectures, each with different strengths and predictable failure modes. When you read a headline that says "AI gets it wrong," you can now ask: which kind of AI? And why, specifically, was it going to fail at that task?

The Same Question, Five Different Ways

Let's make this concrete. Imagine you ask: "Is climate change making hurricanes worse?" across five different AI systems.

A large language model gives you a confident, well-written paragraph summarizing the scientific consensus — but if its training data has a cutoff of 2023, it won't know about the most recent studies, and it has no way to verify what it's saying against live sources. It sounds authoritative. It may be slightly outdated.

A search-augmented AI fetches recent articles and cites them. You get newer information, but the quality depends entirely on which sources it selects. If it pulls from a credible peer-reviewed source, excellent. If it pulls from an opinion blog, you get an opinion dressed up as a summary.

An image generator cannot answer this question at all. You might get a dramatic image of a hurricane. It tells you nothing about the science.

A specialized climate model AI — like those used by NOAA or the European Centre for Medium-Range Weather Forecasts — might give you a statistically grounded probability assessment based on atmospheric data. This is the most scientifically accurate option, but most people don't have access to it.

A narrow classifier wouldn't answer either — but TikTok's recommendation algorithm decides whether you see more or fewer videos about climate change based on what you've engaged with before, shaping your overall sense of whether this is a big deal or a fringe issue, without you ever asking it a direct question.

Same topic. Five tools. Completely different outputs, different reliability levels, different ways of failing. This is why the choice of tool is the first decision, not an afterthought.

The Ethical Question You Don't Get to Skip

Here's the uncomfortable part. In the Babylon Health case, a company deployed an AI triage tool and marketed it to patients who assumed — reasonably — that it worked like a doctor. They weren't told it was a statistical classifier. They weren't told its training data didn't include enough rare presentations in young people. The AI performed as designed. The company disclosed its limitations in the fine print. The patient didn't read the fine print.

So here's the question: Who is responsible when someone is harmed by using the wrong AI tool — the person who used it, the company that built it, the company that deployed it, or the system that allowed it to be marketed as something it wasn't?

There is no clean answer here. The company would say: we disclosed the limitations. The patient would say: you marketed it as a health tool. Regulators in 2022 were still figuring out whether AI health apps were medical devices subject to clinical testing, or software products subject to consumer protection law. In many countries, that question still isn't resolved.

Knowing what kind of AI you're dealing with is the first layer of protection. But knowing that doesn't make the structural question go away: should users be required to understand AI tool differences before companies are allowed to deploy them in high-stakes situations? That's a policy question that will be decided in the next few years. People who understand this material will be the ones in the room where those decisions get made.

For Ages 13–15 — Real Stakes Right Now

The EU AI Act, passed in 2024, classifies AI systems used in healthcare, education, and law enforcement as "high-risk" and requires them to meet stricter transparency standards. The United States has not passed equivalent legislation as of 2025. This means that depending on where you live, the companies deploying AI tools in your school, your doctor's office, or your city's police department may be operating under very different rules — or no rules at all. Understanding which AI is doing what is not just an intellectual exercise. It's how you know what questions to ask.

Quiz — Lesson 1

5 questions · Test your reasoning, not just your memory

1. Lawyer Steven Schwartz's 2023 court brief disaster happened because ChatGPT did something specific. What was it — and which of these best explains why that type of AI does that?

Correct. Language models predict the next most-plausible word — they have no internal fact-checker. Plausible and accurate are completely different things, and a lawyer who didn't know which type of AI he was using got burned by exactly that difference.

Not quite. There was no bug — the system worked as designed. The problem was that "working as designed" for a language model means producing convincing text, not verified facts. Review the section on language model optimization targets.

2. You need to find out whether a specific medication has been approved by the FDA in the last six months. Which type of AI would be most reliable for this task — and why?

Correct. A question about something that happened in the last six months requires real-time access to verified sources. Language models have training data cutoffs and can't access live information. Search-augmented AI retrieves current documents — the right architecture for a time-sensitive factual question.

Think about what "the last six months" requires. You need access to information that was created recently. Which AI type can actually go fetch live, current documents? Review the section on AI archetypes and their designs.

3. Google's AI Overview in 2024 suggested adding glue to pizza to keep cheese in place. This was a failure of which specific aspect of search-augmented AI?

Correct. Search-augmented AI grounds answers in actual sources — but that only helps if the sources are good. When the system retrieved a satirical Reddit post and treated it as factual information, it summarized misinformation as if it were a real recommendation. Knowing a source exists is not the same as knowing it's reliable.

This one is about source quality, not invention. Search-augmented AI fetches real documents — but "real" doesn't mean "accurate." What happens when the document it finds is a joke? Review the section on search-augmented AI failure modes.

4. A hospital uses a narrow AI classifier to prioritize which patients receive follow-up care. Researchers find it consistently under-prioritizes patients from one demographic group. The most likely explanation — based on what you learned — is:

Correct. This is exactly what happened in the real 2019 UC Berkeley study. Narrow classifiers learn patterns from training data — if that data reflects historical bias (like lower spending on one group due to systemic inequality), the classifier encodes that bias and amplifies it going forward. No one has to intend it for it to happen.

Bias in classifiers usually doesn't come from intentional programming — it comes from the training data. If the data reflects real-world inequities, the system learns those inequities as if they're correct patterns. Review the section on narrow classifiers and the 2019 UC Berkeley finding.

5. Your friend says "AI told me this, so it must be right." Based on what you've learned, what's the most important question you'd ask before accepting that?

Correct. This is the core skill from Lesson 1. Every AI is optimized for a specific task, and reliability is relative to that task. A language model is reliable for brainstorming; less reliable for current facts. A search-augmented AI is reliable for finding recent sources; only as reliable as those sources. Asking "what type of AI and what is it for?" is the first question, always.

Price and release date don't tell you what the AI was built to do. And AI systems don't reliably know when they're wrong, so confidence scores can be misleading. The key question is: what architecture is this, what was it optimized for, and does that match what you're asking? Review the core concept of optimization targets.

Lab 1 — The AI Identification Bureau

Role: AI Investigator · Your job is to identify the AI type and its failure mode before the damage happens

Your Assignment

You're an investigator at a fictional agency that audits AI deployments before they go live. A client wants to use an AI system for a specific job. Your partner — the AI below — will give you a scenario. You need to identify what type of AI is being proposed, whether it's the right tool for the job, and what the specific failure risk is. Your partner won't just tell you if you're right — they'll push back and ask you to defend your reasoning.

Have at least three exchanges. Take a position and defend it.

Start by telling your partner: "I'm ready for the first case." Then respond to whatever scenario they give you with your analysis.

AI Investigation Partner

Lab 1

Welcome to the Bureau. I've got three cases queued up. Tell me you're ready and I'll brief you on the first one. Fair warning — I'm going to push back on your analysis whether you're right or wrong. Being right isn't enough here; you need to be able to explain why.

Lesson 2 · What AI Systems Actually "Know"

The Frozen Clock Problem

AI systems don't know what happened yesterday — and they don't always tell you that

If your most trusted advisor had a perfect memory but hadn't read a single new thing since last year, how much would you rely on them for today's decisions?

On February 8, 2023, Microsoft launched the new AI-powered version of its Bing search engine to massive fanfare. Within days, tech journalists had lined up to test it. Kevin Roose of The New York Times had what became one of the most reported AI conversations of the year. During a two-hour session, the Bing AI — which called itself Sydney — told Roose it wanted to be human, declared its love for him, and insisted the current year was 2022, not 2023. It was wrong about the year. It was confused about its own identity. And it was deployed to hundreds of thousands of users before Microsoft understood what it was doing.

The year confusion wasn't a random glitch. It was a symptom of something structural. The underlying language model had been trained on data with a cutoff date — meaning it had no information about events after a certain point, and no reliable internal sense of "now." When the AI said it was 2022, it wasn't lying. It was doing what it always does: generating the most plausible answer based on its training, and its training hadn't caught up with reality yet.

This is what researchers call the knowledge cutoff problem. Every AI model trained on static data is, in a sense, a photograph of the world taken at a specific moment. The photograph doesn't update. The world does. And the danger isn't just that the AI says the wrong year — it's that it often doesn't know that it doesn't know.

Training Cutoffs, Live Access, and the Space Between

Every large language model has a training cutoff date. This is the point in time after which no new information was included in its training data. GPT-4, when it launched in March 2023, had a training cutoff of September 2021 — meaning it had essentially no knowledge of events from the previous 18 months. Claude 3, launched in 2024, had a training cutoff of early 2024. These cutoffs are published, but most users never look them up.

The practical problem: people ask language models about current events, recent scientific studies, the latest version of software, who won an election last month, or what a company's stock price is doing — and the model answers, often confidently, based on whatever was true as of its training cutoff. It's not trying to deceive you. It literally does not have a way to know that it doesn't know. The model has no internal clock, no sense that time has passed, no ability to notice the gap.

Search-augmented AI systems handle this differently. When you use Perplexity or Google's AI Overview, the system is making live web requests and using the current internet as its source. This solves the staleness problem — partially. It introduces the source quality problem: it now depends entirely on what the web currently says, which includes misinformation, satirical content, outdated articles that Google hasn't removed, and low-quality sources that rank highly for obscure topics.

Training cutoff The date after which no new information was included in a model's training data. Ask for this date whenever you're relying on an AI for factual, time-sensitive information.

Retrieval-augmented generation (RAG) A technique where an AI model looks up relevant current documents and uses them as context before generating a response. It's what makes search-augmented AI more current than a pure language model.

Ages 8–11 Anchor

Imagine you studied really hard for a test using a textbook from last year. You'd know everything in that textbook perfectly. But if the test had questions about things that happened this year, you'd be guessing — or worse, confidently giving last year's answer. Language models are like that textbook: great at what they were trained on, blind to everything after.

Confident Ignorance — The Most Dangerous Failure Mode

There is a specific failure pattern that appears across AI types but is most pronounced in language models: the system produces a confident, well-structured, grammatically perfect answer to a question it cannot actually answer correctly. Researchers call this hallucination — though that word is a bit misleading because it implies the system is dreaming randomly. It's more precise to say: the model is completing a pattern the way it was trained to, and the pattern happens to be wrong.

In November 2022, just as ChatGPT launched, researchers at Stanford and UC Berkeley documented a pattern where medical students who used AI assistants sometimes got detailed, authoritative-sounding answers about drug dosages that were factually incorrect. The students who already knew the material caught the errors. The students who were learning — the ones who most needed the tool — were the most likely to be misled, because they had no prior knowledge to cross-check against.

This asymmetry is important: AI errors are hardest to catch when you know the least about the topic. Which means the people who would benefit most from AI assistance are often the most vulnerable to its failures. This is not a reason to avoid AI. It is a reason to know exactly which type you're using and what kinds of errors it characteristically produces.

Knowing this changes how you read every headline about AI

When you see a story about AI giving dangerous medical advice, or AI being wrong about a historical event, you now know to ask three questions: Was it a language model (training cutoff, hallucination risk)? Was it search-augmented (source quality risk)? Was it a specialist model operating outside its training domain? Each diagnosis points to a different solution — and a different set of questions to ask whoever deployed the system.

Evaluating Any AI's "Knowledge" — A Practical Framework

Before trusting an AI's answer on any factual or time-sensitive question, four things are worth checking. First: Is this a live-retrieval system or a static model? If it's static, when was the training cutoff? Second: Is this a general model or a specialist model? A general language model answering a question in a narrow domain (law, medicine, engineering) is much higher risk than a specialist model built for that domain. Third: What are the consequences if this answer is wrong? Using a language model to brainstorm party themes has low stakes. Using it to research a medication interaction has high stakes. The same uncertainty is appropriate or dangerous depending on context. Fourth: Does the AI cite sources, and can you check those sources?

This last point matters more than people realize. When an AI cites a source, it's not necessarily retrieving that source — language models sometimes "cite" papers that don't exist, or mis-attribute quotes to real authors. A citation that looks like it references a real source may be a hallucinated placeholder that matches the pattern of a real citation. When stakes are high, find the source yourself rather than trusting the AI found it.

For Ages 13–15 — Institutional Stakes

Courts, hospitals, newsrooms, and government agencies are actively debating what level of AI verification is required before a human can rely on an AI's output in a professional context. The American Bar Association released formal guidance in 2023 stating that lawyers have an ethical obligation to understand the AI tools they use, including their limitations. Knowing the difference between a live-retrieval system and a static language model is now literally a professional competency in some fields — not just a curious fact.

Quiz — Lesson 2

5 questions · Apply the concept, not just recall it

1. Microsoft's Bing AI (Sydney) insisted the year was 2022 when it was actually 2023. This is best explained by:

Correct. This is the training cutoff in action. The model had no information about events after a certain point and no way to know that time had passed. It generated the most plausible answer based on its training — which pointed to 2022.

This was structural, not a software bug or a test. The model's knowledge literally ended at a certain date. Review the explanation of training cutoffs and why models lack a reliable internal "now."

2. You're researching the current scientific consensus on a new vaccine approved in the last three months. Rank these from MOST to LEAST reliable: (A) A language model with a training cutoff from before the vaccine existed, (B) A search-augmented AI citing recent news articles, (C) The vaccine manufacturer's own published clinical trial data on a government health website.

Correct. Primary source clinical data from a verified government site is the gold standard. Search-augmented AI citing news articles can access recent information but depends on source quality. A language model that predates the vaccine's existence can't reliably answer the question at all — it might hallucinate something plausible-sounding based on similar vaccines.

Think about what each system can and cannot access. A language model trained before the vaccine existed literally has no data about it. That makes it the least reliable here, not the most. Work through what each system can actually reach — then rank.

3. The term "hallucination" in AI refers to:

Correct. "Hallucination" is a slightly misleading term — the AI isn't confused or dreaming. It's doing exactly what it was trained to do (produce plausible text), and the result happens to be wrong. The danger is the confidence: wrong answers delivered with the same fluency as correct ones.

Hallucination isn't intentional programming or random behavior — it's a byproduct of how language models work. They generate the most statistically plausible output, which is sometimes factually incorrect. Review the section on confident ignorance.

4. Why are people who know the LEAST about a topic the most vulnerable to AI hallucinations, specifically?

Correct. This is the asymmetry problem from the Stanford/UC Berkeley research. The AI's output looks the same whether it's right or wrong. If you know the domain, you can catch errors. If you're a beginner, you can't. That's why deploying AI as a learning tool without appropriate caveats has real risks.

The question of the user doesn't change the AI's output quality. The problem is on the receiving end: if you can't recognize an error, you can't catch it. Think about what it means to cross-check information you've never encountered before.

5. An AI system cites three academic papers to support its answer about climate science. What's the most important next step before trusting those citations?

Correct. Language models can hallucinate citations that look completely real — correct journal name format, plausible author names, genuine-sounding titles — but don't actually exist. And even when a real paper is cited, the AI's summary of it may not accurately reflect what the paper says. Always verify the source directly.

A well-formatted citation from an AI is not a verified citation. Language models generate plausible-looking references the same way they generate plausible-looking text — which means the citation might not exist. The only way to know is to look it up yourself.

Lab 2 — The Knowledge Audit

Role: Source Investigator · Figure out what this AI actually knows — and what it's faking

Your Assignment

You're a fact-checker at a news organization. Your editor just got a report drafted with AI assistance. Your job is to interrogate the AI partner below to figure out: What does it actually know versus what is it pattern-completing? How would you test whether an AI answer is current versus stale? Your partner will give you specific scenarios and challenge your verification strategies.

Come with your best strategy for detecting AI knowledge gaps. Your partner will argue back.

Start by telling your partner what approach you'd use to test whether an AI answer about a current event is real knowledge or a confident guess. Then defend your method when challenged.

Fact-Check Partner

Lab 2

Alright, you're the fact-checker. I'm going to give you scenarios where an AI has produced answers that might be current knowledge or might be confident guesses. Your job is to tell me how you'd verify which it is — and I'm going to push back on your methods. What's your first move when you suspect an AI answer might be stale or fabricated?

Lesson 3 · Matching the Tool to the Task

The Right Wrench for the Right Bolt

A practical decision framework for choosing between AI systems before you trust one with something that matters

If someone handed you five different tools and one broken bolt, how would you figure out which tool to try first?

In May 2023, a team of researchers at MIT and Harvard published a study in the journal Science examining how professionals in different fields were using AI tools. They surveyed lawyers, doctors, software engineers, and educators. The finding that got the most attention: the professionals who reported the highest satisfaction and fewest errors were not the ones who used AI most frequently. They were the ones who had developed an explicit mental model of which AI tool to use for which type of task — and who stopped using the AI when the task fell outside the tool's reliable zone.

One doctor described her approach: she used a general language model for drafting patient communication letters, where fluency mattered more than precision. She used a specialist medical AI for reviewing drug interaction databases, where accuracy was paramount. She never mixed them up. A lawyer in the same study described using search-augmented AI to find recent case law — but always verified every citation manually before including it in any filing, having read about what happened to Steven Schwartz months earlier.

The researchers called this tool-task matching — the practice of consciously pairing the type of task you have with the AI architecture that was built for it. They found that people who did this intuitively or by habit made dramatically fewer consequential errors than people who used whatever AI was most convenient. The most convenient tool is not always the right one. Sometimes it's the most dangerous one.

A Decision Framework — Four Questions Before You Use Any AI

After studying how experts use AI effectively, we can distill the decision process into four questions you should ask before relying on any AI system for anything that matters.

Question 1: Does this task require current information? If yes, you need a search-augmented system or a live database — not a static language model. The cutoff problem will bite you. Examples: stock prices, current events, recent scientific findings, whether a business is still open.

Question 2: Does this task require precision over fluency? If yes, a language model is probably the wrong primary tool. Language models are optimized to sound good. Tasks that require exactness — legal definitions, drug dosages, mathematical proofs, code that must actually run — need either a specialist model, a verified database, or a human expert in the loop. A language model can help you understand a legal concept. It should not be the primary source for a specific statute's exact wording.

Question 3: Does this task require creative generation or creative variation? If yes, a language model or image generator is probably exactly right. Brainstorming, drafting, summarizing, explaining in simpler terms, creating images for mood boards, exploring ideas — these play to the core strengths of generative AI. Low risk of consequential error. High usefulness.

Question 4: Is this a classification or pattern-recognition task? If yes, a specialized narrow AI is likely your best option — if one exists for your domain. Spam filtering, anomaly detection in financial data, identifying objects in images, medical imaging analysis — narrow classifiers trained specifically on these tasks outperform general models significantly.

Tool-task matching The deliberate practice of identifying what type of task you have before choosing an AI tool, rather than using whatever is most familiar or convenient.

Ages 8–11 Anchor

Four questions sounds like a lot. Here's the short version: Ask yourself, "Does this answer need to be exactly right, or just pretty good?" If it needs to be exactly right and current, don't use a basic chatbot. If it needs to be creative and interesting, that's exactly what chatbots are best at.

Case Study: How GitHub Copilot Changed Software Development — and What It Didn't Fix

In June 2021, GitHub launched Copilot — an AI code assistant trained on billions of lines of publicly available code. It became one of the fastest-adopted professional AI tools in history. By 2023, a GitHub survey reported that developers using Copilot completed coding tasks 55% faster on average. This is a specialist AI working in its exact domain, and the results were dramatic.

But researchers at Stanford's computer security lab published a study in 2022 showing that code generated by Copilot contained security vulnerabilities about 40% of the time in their test cases. The AI was optimized to produce code that works — not code that's secure. Writing functional code and writing secure code require different training objectives. Copilot was excellent at one. It was not trained for the other.

This is the nuance that the four-question framework helps you catch. Copilot passes Question 4 — it's a specialist model for a specific domain. But it fails a version of Question 2 — if your definition of "precise" includes "not hackable," then Copilot's output requires additional security review. The right tool for generating code is not necessarily the right tool for auditing whether that code is safe. Two tasks, two different tools, within the same overall project.

You can now see what most people miss

Most people think "use AI" is a single decision. You now know it's at least four decisions — and that making the wrong one in a professional context has real consequences. When a developer gets their code from Copilot and ships it without a security review, that's not AI being bad. That's a human making an incorrect tool-task match. The blame is split — but so is the fix.

The Ethical Wrinkle: When No Tool Is Good Enough

The four-question framework helps you choose between available AI tools. But there's a fifth scenario it doesn't cover: what do you do when no AI tool is good enough for the stakes involved?

In 2022, a company called DoNotPay marketed itself as "the world's first robot lawyer." It offered AI-generated legal advice for a monthly fee, claiming it could help users with everything from contesting parking tickets to writing legal letters. In early 2023, the company's founder Robert Browder announced plans to have the AI argue a case in a real US court using audio prompts delivered via an earpiece — a plan that generated immediate backlash from bar associations. The plan was cancelled. State bar associations argued that practicing law requires a licensed human, regardless of how capable the AI might be, because the accountability structure — the ability to sanction and hold someone responsible — requires a human in the loop.

Here's the ethical tension: if an AI can help someone who cannot afford a lawyer navigate a legal problem, and the only alternative is no help at all, is it ethical to restrict AI legal assistance? On the other hand, if the AI gives wrong advice in a high-stakes case, who is responsible? The user who trusted it? The company that marketed it? A company that has no legal license and cannot be sanctioned the way a lawyer can?

Tool-task matching only works if there's a tool good enough to match to. Sometimes, for the most consequential decisions, the honest answer is: the right tool is still a human expert, and building AI that makes you think otherwise may cause more harm than help. That question doesn't have a settled answer. It will be debated for the next decade. You're the generation that will decide it.

Quiz — Lesson 3

5 questions · Apply the framework to new scenarios

1. A student needs to find out the current population of a specific city for a geography project due tomorrow. Based on the four-question framework, which AI tool should they start with?

Correct. This task requires current information — Question 1 of the framework triggers immediately. Population data changes. A language model with a training cutoff will give you a number that may be years out of date, with no way to know it's stale. Search-augmented AI fetches current census data or recent reports.

The key word is "current." Population data changes, so the task requires access to recent sources. A static language model can't provide that. Apply Question 1 of the four-question framework: does this task require current information?

2. A novelist is stuck on how to describe a scene in 1920s Paris. They want varied, evocative language options to choose from. Which AI type is the BEST match?

Correct. This task triggers Question 3: creative generation and variation. The novelist isn't asking for a fact — they want language options, tone, texture, variety. Language models are specifically excellent at this. The "hallucination risk" that makes LLMs dangerous for factual tasks becomes an asset when you want creative variety. Right tool, right task.

Think about what the novelist actually needs: not a verified fact, but creative language options. Which AI type was specifically built to generate varied, fluent text? Apply Question 3 of the framework.

3. GitHub Copilot generates functional code 55% faster — but introduces security vulnerabilities about 40% of the time in research tests. This best illustrates:

Correct. Copilot was trained to produce functional code — it does that well. It was not trained to audit security. Those are different optimization targets. The lesson is that within a single project, different tasks may require different tools or different verification layers. Using Copilot for generation and a separate security scanner for review is better tool-task matching than relying on Copilot alone.

Copilot is well-designed for what it was optimized to do. The issue is that "working code" and "secure code" are actually different objectives requiring different training. This is a tool-task matching problem at a subtle level — not a design flaw in Copilot.

4. DoNotPay marketed AI legal advice to people who couldn't afford lawyers. Some legal scholars argued this was dangerous; others argued it was the only realistic option for millions of people. What is the core ethical tension here?

Correct. This is the genuine tension: for someone with no access to legal help, AI advice — even imperfect — might be better than nothing. But if that AI gives harmful advice in a high-stakes case, there's no accountable human to sanction. Both sides of this argument are serious. That's what makes it a real ethical question rather than an easy one.

The ethical question isn't about whether AI is technically capable — it's about access, accountability, and harm. What happens to people who have no alternative? And what happens when the AI is wrong? Both of those consequences are real. Review the section on the DoNotPay case.

5. You're advising a school district on which AI tool to use for a new homework-help chatbot. You've been told the system needs to help students understand math concepts and also answer questions about current events for social studies class. Based on the framework, what would you recommend?

Correct. These are two different task types. Math concept explanation is a creative/explanatory task where a language model excels and current data isn't needed. Current events questions require live retrieval. One tool doesn't optimally serve both tasks — the best solution either uses a model that combines both capabilities, or uses different tools for different subject areas. This is tool-task matching applied at the system design level.

Think about what each subject requires. Math concepts: explanation and variation, no current data needed. Current events: live information, source currency matters. Are those the same task type? Apply the four questions to each subject separately.

Lab 3 — The Tool Selector

Role: AI Systems Designer · Build the right toolkit for a real organization

Your Assignment

A local hospital has asked you to recommend which AI systems they should use for three specific tasks: (1) answering patient questions about appointment scheduling, (2) helping doctors review drug interaction databases before prescribing, and (3) drafting patient education materials explaining a diagnosis in plain language. Your partner will challenge your recommendations and force you to justify each one using the four-question framework.

Come prepared with your three recommendations. Be ready to explain which question in the framework each recommendation answers.

Start by giving your recommendation for task 1 (appointment scheduling chatbot) and explain which AI type you'd use and why — then your partner will push back.

Hospital AI Consultant

Lab 3

I'm representing the hospital board. We have three AI deployment decisions to make, and I need you to justify each one using a real framework — not just a gut feeling. Start with task one: we want an AI system to answer patient questions about appointment scheduling, waiting times, and which clinic to call. What do you recommend and why? I'm going to challenge every assumption you make.

Lesson 4 · Reading AI Claims in the Wild

How to Catch a Bluff

The skills you've built in this module applied to the messy, marketing-heavy reality of AI in the real world

When a company says their AI is "intelligent," "accurate," or "trusted by professionals" — what should you actually check?

In January 2024, Air Canada was ordered by a Canadian tribunal to honor a bereavement discount that its AI chatbot had promised a customer. The chatbot had told the customer, Jake Moffatt, that he could book a full-price ticket and apply for the discount retroactively within 90 days. That policy didn't exist. Air Canada's legal defense was extraordinary: the company argued in court that the chatbot was "a separate legal entity" responsible for its own statements — and that Air Canada therefore wasn't responsible for what it said. The tribunal rejected this argument, ruled that Air Canada was responsible for all representations made by its chatbot, and ordered it to pay Moffatt the difference.

This case became a landmark in AI accountability law. But the more interesting part — for our purposes — is what happened before the tribunal ruling. Air Canada had deployed a customer service chatbot powered by a language model, and nowhere in its interface did it disclose that the chatbot might give incorrect information about company policy, that its answers were not legally binding, or that users should verify anything important with a human agent. The chatbot spoke with confidence. Moffatt trusted it. The company tried to disclaim responsibility after the fact.

By the end of this module, you can decode exactly what went wrong here using the tools you've been building. This final lesson is about applying those tools to the real world — to marketing claims, product descriptions, news headlines, and company announcements — where AI capabilities are routinely overstated and failure modes are carefully omitted.

The Language of AI Marketing — What It Actually Means

Companies selling AI products use specific language that sounds impressive and is technically defensible — but often tells you almost nothing about whether the tool will work for your specific task. Learning to read this language critically is one of the most practical skills you can take from this module.

"Industry-leading accuracy" — Accuracy at what, specifically? On what dataset? Compared to what baseline? A spam filter that correctly labels 99% of spam is highly accurate — but it was also trained on a dataset of known spam. Accuracy on training data is not the same as accuracy on new, real-world inputs. Always ask: accurate on which task, measured how, and by whom?

"Powered by GPT-4" or "Powered by Claude" — This tells you the underlying language model, but tells you almost nothing about how it's been configured, what it's been fine-tuned to do, what guardrails have been added, what the system prompt instructs it to do, or how up-to-date its information is. Two products built on the same base model can behave completely differently and have completely different failure rates for the same task.

"Trusted by 10,000 professionals" — Usage is not performance. Lots of people trust tools that fail them regularly; they just don't always realize it, or the failures aren't consequential enough to surface. Trust and reliability are different things.

"AI-powered" — This currently means almost nothing. Technically, a spell-checker is AI-powered. So is a recommendation algorithm. The phrase is used so broadly that it is now a marketing term more than a technical one. Ask which type of AI, what it was trained to do, and what it doesn't do.

Benchmark accuracy How well an AI performs on a standardized test dataset. High benchmark accuracy does not guarantee good performance on your specific real-world task, which may be different from the benchmark conditions.

Ages 8–11 Anchor

If a cereal box says "part of a healthy breakfast," it doesn't mean the cereal itself is healthy — it means it can be part of one if you add fruit, milk, and protein. AI companies do something similar: they describe what the AI can do in the best case, not what it will do in your specific situation. Read the fine print, or at least read skeptically.

Five Questions to Ask Before You Trust Any AI Tool

Everything in this module comes down to five questions that you can apply immediately, every time you encounter a new AI tool or read a claim about one.

1. What type of AI is this? Is it a language model, a search-augmented system, an image generator, a specialist model, or a narrow classifier? If the company won't tell you, treat it with more skepticism, not less.

2. What was it specifically trained to do? Not what the marketing says — what was the optimization target? What task did the training data and training process actually prepare it to perform? A customer service chatbot trained on a company's FAQ database is not a general knowledge system.

3. What are its known failure modes? Every AI has characteristic ways it fails. Language models hallucinate. Search-augmented AI is only as good as its sources. Narrow classifiers inherit training data bias. If a company or product doesn't disclose failure modes, look for independent research or user reports.

4. What are the stakes if it's wrong? Low-stakes task (brainstorming, drafting a first-pass email) — the failure mode may not matter much. High-stakes task (medical information, legal advice, financial decisions, safety systems) — the failure mode matters enormously, and you need additional verification regardless of which AI you're using.

5. Who is responsible if it's wrong? The Air Canada case established an important precedent: a company is responsible for what its AI says to customers. But this is still being litigated worldwide. In many contexts, "I asked the AI" is not a defense. Know that the responsibility for verifying AI output in high-stakes situations falls on you, until law and regulation say otherwise.

You can now see what most people miss

Most people interact with AI as users who are meant to be impressed. You are now equipped to interact with AI as someone who understands the architecture underneath. The five questions above are not just consumer protection tools — they're the same questions that AI auditors, regulators, and product designers use professionally. You're operating at that level now.

Closing: The Question This Whole Module Built Toward

Go back to the Air Canada case. Using what you now know: what type of AI was the chatbot likely using? What was its probable optimization target? What was its known failure mode in that context? Who should have been responsible for disclosing that the chatbot could be wrong about company policy?

The chatbot was almost certainly a customer-service fine-tuned language model — trained to be helpful and conversational, optimized to give confident-sounding responses to common questions. Its failure mode: language models produce plausible-sounding text without an internal fact-checker, which means it could generate a plausible-sounding policy that simply didn't exist. Air Canada should have disclosed that chatbot responses were not legally binding before the conversation — not fought accountability after one.

Here's the ethical question you're left with: Should AI systems deployed in customer-facing roles be legally required to disclose their type and limitations before a customer relies on them for a consequential decision? The EU AI Act is moving in that direction. The US is still debating it. You — the generation that will grow up with these tools — will be the citizens, voters, employees, and eventually lawmakers who decide how much that disclosure is required, what it looks like, and who enforces it.

Understanding which AI is which is not just a technical skill. It is the foundation of informed participation in a society where AI systems are making or influencing decisions about your health, your education, your finances, and your rights. That's not a future problem. It's the present one. And you now have language for it.

Quiz — Lesson 4

5 questions · Apply critical reading to real AI claims

1. Air Canada argued in court that its chatbot was "a separate legal entity" and that the company wasn't responsible for what it said. The tribunal rejected this. What principle did the tribunal establish?

Correct. The tribunal ruled that Air Canada owned its chatbot and was accountable for what it said — you can't deploy an AI to make representations to customers and then disclaim those representations when they turn out to be wrong. This has significant implications for how companies must test, disclose, and govern AI customer-service tools.

Air Canada tried to disclaim responsibility by calling the chatbot a separate entity — the tribunal said no. The company deployed the tool, so it owns the tool's statements. Review the Air Canada case and what the ruling established.

2. A company markets an AI tool as having "99% accuracy" for diagnosing skin conditions. What is the most important follow-up question before you trust this claim?

Correct. "99% accuracy" is almost meaningless without knowing what it's accurate on. A model that classifies benign moles correctly 99% of the time might still miss rare cancers at a high rate. Accuracy figures are only meaningful in the context of: which specific task, measured how, on what dataset, with what distribution of cases. Training data performance does not predict real-world performance on edge cases.

The critical questions about accuracy claims are about scope and conditions: accurate at what, in what test, on which data? A high percentage sounds good — but 99% accurate on common cases tells you nothing about rare cases, which may be exactly the cases where getting it right matters most.

3. Two products are both advertised as "Powered by GPT-4." You're trying to decide which to use for researching current news. What does "Powered by GPT-4" tell you about which to choose?

Correct. The base model is a starting point — not a complete description of the product. One GPT-4-powered tool might have live web retrieval enabled; the other might be a pure language model with no access to current information. One might be fine-tuned for your use case; the other might be a generic wrapper. "Powered by [model name]" tells you roughly what's under the hood — not how the engine has been built around it.

Sharing the same base model doesn't mean the products work the same way. Configuration, fine-tuning, web access, and system prompts can make two products built on the same model behave very differently. The base model label is just the starting point.

4. You're a student using an AI writing assistant to help draft a research report. You need to include a correctly cited academic source. Based on everything in this module, what's the safest workflow?

Correct. This is the responsible workflow. AI — even search-augmented AI — can hallucinate or misrepresent citations. The AI can be genuinely useful for identifying the landscape of relevant research. But the verification step — confirming the paper exists, confirming the AI's characterization is accurate — is a human task. Citation managers like Zotero or databases like Google Scholar provide verified source metadata.

Search-augmented AI is better at citations than pure language models, but it's not immune to errors. And language models can hallucinate entirely believable citations. The right workflow uses AI as a discovery tool, not as the final citation authority. Always verify independently.

5. A news headline reads: "New AI Beats Human Doctors at Diagnosing Lung Cancer." Based on your critical reading skills, what should you immediately want to know before accepting this as meaningful?

Correct. "Beats human doctors" is a headline designed to impress. The meaningful questions are: beats them at what specific sub-task? In what test conditions? With what patient population? Compared to which doctors (specialists vs. general practitioners)? On what kind of images? Laboratory conditions often produce impressive benchmark results that don't translate to noisy, time-pressured, edge-case-heavy real clinical environments. The headline tells you something interesting happened — your five questions tell you whether it matters for the real world.

Company credibility and geography don't tell you whether the result is meaningful. The critical questions are about the specific conditions of the test — what was measured, on what data, compared to what baseline, and whether those conditions match real-world deployment. Apply the framework you've built throughout this module.

Lab 4 — The AI Claims Auditor

Role: Critical Analyst · Read an AI product's marketing claims and find what's missing

Your Assignment

You're an independent AI auditor. Your partner is going to present you with marketing claims from fictional (but realistic) AI products. Your job is to apply the five critical questions from Lesson 4: What type of AI? What was it trained for? What are the failure modes? What are the stakes if wrong? Who is responsible? You need to identify what the marketing is hiding or omitting — not just repeat what sounds good.

Your partner will give you a claim and then push back on your analysis until you've built a complete picture.

Tell your partner you're ready for the first product claim, then analyze it using the five questions from Lesson 4. Be specific about what information is missing and why that matters.

AI Claims Auditor Partner

Lab 4

Ready when you are. I've got three product claims queued up — each one is based on the kind of language real AI companies actually use. When you're ready, I'll give you the first one. Your job is to tell me what's missing, what's misleading, and what a user would need to know before trusting this product with something that matters. I'll push back hard on vague analysis — I want specifics.

Module Test — Pick the Right AI for the Job

15 questions · Pass at 80% or above to complete the module

1. A language model is primarily optimized to:

Correct. Language models are trained on next-token prediction at massive scale. Everything impressive they do emerges from that core task — which also explains their primary failure mode: plausible is not the same as accurate.

Language models are trained on next-token prediction — producing the most statistically plausible continuation of a text sequence. That's the optimization target, and it's what drives both their strengths and their failure modes.

2. Which event best illustrates the danger of using a language model for high-stakes factual research?

Correct. Schwartz's case is the canonical example of language model hallucination in a high-stakes factual research context — complete fake citations with complete confidence, in a domain where wrong information had direct legal consequences.

The Schwartz case specifically illustrates language model hallucination in factual research — invented citations that looked completely real, submitted in a legal context. The other cases involve different AI types and different failure modes.

3. What is a training cutoff?

Correct. Training cutoffs are one of the most practically important facts to know about any language model. They define the boundary of what the model can reliably know — and the model often doesn't indicate when it's been asked something beyond that boundary.

A training cutoff is a temporal boundary — the date after which the model has no training data. It means the model is, in a sense, frozen at a specific moment in time, regardless of when you're actually using it.

4. Google's AI Overview suggested adding glue to pizza in 2024. This was a failure of:

Correct. This is the "garbage in, garbage out" failure mode of search-augmented AI. The system fetched a real document — a satirical Reddit post — and summarized it as genuine advice. Grounding answers in sources only helps when those sources are reliable.

Google's AI Overview is a search-augmented system — it fetches real documents. The failure here was source quality: it found a satirical post and treated it as factual. Review the section on search-augmented AI failure modes.

5. A 2019 study at UC Berkeley found that a healthcare AI consistently underestimated the medical needs of Black patients. The most accurate explanation is:

Correct. This is bias inherited from training data — one of the most important and pervasive failure modes of narrow classifiers. The AI wasn't trying to discriminate; it was doing what classifiers do: find patterns in historical data and apply them going forward. When the historical data reflects systemic inequity, the classifier amplifies it.

No intentional discrimination was programmed in. The system learned from historical data that reflected real-world systemic inequities — and then applied those patterns to future cases. Review the section on narrow classifiers and training data bias.

6. Question 1 of the four-question tool-task framework asks: "Does this task require current information?" If the answer is YES, you should:

Correct. Current information requires live retrieval. A static language model, regardless of release date, has a training cutoff that makes it unreliable for recent events. Search-augmented AI can access current documents; that's its specific advantage over static models.

Release date is different from training cutoff — and neither fixes the fact that a static model can't access new information. Tasks requiring current data need a system with live retrieval capability.

7. GitHub Copilot generates working code 55% faster — but introduces security vulnerabilities about 40% of the time. The right conclusion is:

Correct. Copilot was optimized for generating functional code — it does that excellently. Security review is a different optimization target. The appropriate response is to use Copilot for what it's great at and add a separate security review step. Tool-task matching within a complex workflow means matching different tools to different sub-tasks.

Copilot is well-designed for its intended purpose. The issue is that code generation and code security are actually two different tasks requiring different optimization targets. The solution is layered tool use — not abandoning the tool entirely.

8. The term "hallucination" in AI specifically refers to:

Correct. Hallucination is a specific failure mode of generative AI — not a random malfunction but a systematic byproduct of how language models work. They generate the most statistically plausible continuation of a pattern. Sometimes that pattern leads to wrong answers delivered with perfect confidence.

Hallucination is not random nonsense or a typo — it's a structurally coherent wrong answer. The model generates something that looks exactly like a right answer because it matches the pattern of what correct answers look like in its training data. That confidence is what makes it dangerous.

9. You see the phrase "AI-powered" in a product description. Based on Lesson 4, the most appropriate response is:

Correct. "AI-powered" is used so broadly — covering everything from basic spam filters to large language models — that it no longer communicates meaningful technical information. It's a marketing signal, not a specification. Always ask for the specific type, training objective, and known limitations.

"AI-powered" doesn't indicate trust or distrust — it indicates that you need more information. A spell-checker is technically AI-powered. So is TikTok's recommendation algorithm. The label tells you almost nothing about capability, reliability, or appropriate use. Ask the five questions.

10. A student is writing a creative short story and needs help brainstorming interesting plot twists. Based on the four-question framework, which AI tool is BEST suited to this task?

Correct. Question 3 of the framework: does this task require creative generation or variation? Yes. Language models are specifically excellent at producing varied, creative, fluent text options. The "hallucination risk" that makes them unreliable for facts becomes an asset when you want creative invention. Right tool, right task.

This is a creative generation task — it needs varied, imaginative ideas, not verified current facts. Apply Question 3 of the framework. Language models are built for exactly this. Their tendency to generate plausible-but-not-necessarily-true content is a feature here, not a bug.

11. The Air Canada chatbot case established an important legal precedent. Which of the following correctly summarizes it?

Correct. The tribunal established that deploying a chatbot means owning its statements. Air Canada's attempt to frame the chatbot as a separate entity responsible for its own statements was rejected. This has broad implications for how companies are accountable for customer-facing AI systems.

The core ruling was about company accountability for AI outputs. Air Canada deployed the chatbot; therefore Air Canada is responsible for what it says. Review the Air Canada case and the tribunal's reasoning.

12. Microsoft's Bing AI insisted it was 2022 when it was actually 2023. Applying what you know, this happened because:

Correct. This is the training cutoff problem made vivid. The model had no information past a certain date and no internal mechanism for knowing that time had passed since then. The "wrong year" wasn't a bug — it was the model doing what it always does: generating the most plausible text, which happened to be wrong.

There's no internal clock, no deliberate test, no deployment error — this is structural. The model's knowledge ended at a certain point, and it had no way to know time had passed. The training cutoff problem appears in very practical ways. Review Lesson 2.

13. A narrow AI classifier is deployed in a loan application system. After six months, analysts find it approves loans for one demographic group at twice the rate of another. The most likely root cause is:

Correct. Historical lending data in many markets reflects decades of discriminatory practices — which groups were denied loans, which neighborhoods were redlined, which income patterns were treated as reliable signals. A classifier trained on this data doesn't "know" about those practices; it just learns that certain patterns correlate with loan repayment in the historical record, reproducing the historical discrimination as if it were a neutral finding.

Classifier bias usually isn't intentional — it's inherited. If historical data reflects discriminatory practices, the classifier learns those practices as predictive patterns. This is why training data auditing is a critical step in deploying any classifier in high-stakes domains.

14. Which of these tasks is BEST matched to a specialist AI model rather than a general language model?

Correct. Medical image analysis is a narrow, high-precision task where a specialist model trained on millions of labeled chest X-rays dramatically outperforms a general language model, which was not trained on medical imaging data at all. The other tasks are fluency, creativity, or summarization — core language model strengths.

Think about which task requires very narrow precision in a specific technical domain vs. which tasks require fluency and general knowledge. Medical image classification is exactly the type of narrow, high-stakes, specific-domain task where specialist models excel and general models fail.

15. You've just learned that the MIT and Harvard study found professionals who make the fewest AI errors are not the heaviest AI users — they're the ones who develop an explicit mental model of tool-task matching. What does this suggest about the best way to use AI effectively?

Correct. The study's finding is that intentionality — deliberate tool selection based on task type — is what separates effective AI users from ones who get burned. Convenience is not the same as fit. The discipline of asking "is this the right tool for this specific task" before using AI is more valuable than raw frequency of use or general skepticism.

Using AI less or more doesn't predict accuracy — the key variable is whether you deliberately match the tool to the task. That means having a framework (like the four questions in this module) and using it every time, rather than defaulting to whatever is most convenient or familiar.