Intro
L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
The AI That Teaches You Β· Introduction

Teaching Machines Are Here β€” and They Know When You're Confused

Why understanding AI tutors might be the most important skill you build this year

In March 2023, a thirteen-year-old in New Jersey named Aditya tried something his school hadn't officially approved: he opened Khan Academy's new AI tutor, Khanmigo, and asked it to help him understand quadratic equations. He'd been stuck for two weeks. His teacher was stretched across thirty students. Khanmigo didn't give him the answer β€” it asked him questions back. Within forty minutes, Aditya solved a type of problem he hadn't been able to crack all month. He told his mom it felt like having a tutor who had infinite patience and zero judgment. That same week, schools in at least six states were quietly debating whether to ban it.

Here's the thing: tools that feel personal and powerful almost always arrive before anyone has figured out the rules. In 1876, the telephone was demonstrated at a World's Fair and educators immediately argued about whether it would make students lazy. In 1925, radio lessons were broadcast into classrooms and principals complained that children would stop thinking for themselves. AI tutors are the newest version of a very old argument β€” but this time the technology actually does adapt to you, remember your mistakes, and change what it says next based on how you respond. That is genuinely new.

This course is about how those systems actually work β€” not the marketing version, not the panic version. You'll learn what an AI tutor is doing under the surface when it responds to you, why it sometimes gets things wrong in very specific ways, and what you should actually be skeptical about. By the end, you'll understand something that most adults in your school β€” including most teachers β€” don't yet know how to articulate. That's not a boast. It's just where we are.

The AI That Teaches You Β· Lesson 1

Meet Khanmigo: More Than a Chatbot

What makes an AI tutor different from a search engine β€” and what that difference actually costs
If an AI can teach you anything, why does it sometimes refuse to give you the answer?

The date was March 14, 2023 β€” Pi Day, fittingly β€” when Sal Khan, the founder of Khan Academy, stood on a stage in San Francisco and demonstrated something he'd been building in secret for almost a year. He typed a question into a chat window. The response didn't give him an answer. Instead, it asked: "Before I help, can you tell me what you've already tried?" The audience went quiet. That single design decision β€” refusing to hand over the answer β€” was, Khan argued, the whole point. He wasn't showing off a smarter search engine. He was showing something that behaved more like a tutor than a tool.

Within 48 hours, the clip had been viewed over two million times. Educators were divided almost immediately. Some saw a breakthrough β€” a patient, infinitely available guide for students whose schools couldn't afford human tutors. Others saw something more troubling: a machine that could sound wise while being completely wrong, and that students might trust more than it deserved. Both reactions, it turns out, were correct.

What Khanmigo Actually Is

Khanmigo is Khan Academy's AI tutor, built on top of GPT-4 β€” the same language model that powers ChatGPT β€” but with a very specific set of instructions layered on top. Those instructions tell it to behave like a tutor, not a search engine. The difference matters enormously.

A search engine retrieves information. You type "quadratic formula," it finds pages that contain those words, and it returns links. A language model like GPT-4 does something stranger: it predicts what word should come next in a sentence, over and over again, until it has generated a response that sounds like something a knowledgeable person would say. It has read an enormous amount of human writing β€” textbooks, Wikipedia articles, Reddit threads, academic papers β€” and learned the patterns of how people explain things.

Language model An AI system trained to predict text. It learns patterns from billions of words and uses those patterns to generate responses that sound coherent.

The critical point: Khanmigo doesn't retrieve stored answers. It generates them fresh each time. Which means it can explain things in new ways, respond to your specific confusion, and even ask you follow-up questions β€” but it can also generate a confident-sounding wrong answer, because it's optimizing for sounding right, not for being right.

What makes Khanmigo distinct from raw ChatGPT is its system prompt β€” a hidden set of instructions that tell the AI to ask students questions instead of just answering them, to encourage rather than judge, and to connect its explanations to Khan Academy's existing curriculum. You never see that prompt. But every response you get is shaped by it.

Worth Knowing

Khanmigo launched to students in March 2023 with a waitlist. By September 2023, Khan Academy had made it free for all U.S. teachers. By early 2024, it was being used in over 40 countries. The speed of that rollout is part of why the questions it raises haven't been fully answered yet.

The Socratic Method β€” and Why It's Harder Than It Looks

When Sal Khan demonstrated Khanmigo asking "what have you already tried?" he was invoking something that has been at the center of education for over 2,000 years: the Socratic method. Socrates, the ancient Greek philosopher, taught by asking questions instead of lecturing. His goal wasn't to transfer information β€” it was to help students discover knowledge by working through contradictions in their own thinking.

Socratic method A teaching technique that uses guided questions to help learners reach understanding themselves, rather than being told the answer directly.

The Socratic method is considered one of the most effective ways to teach. Studies going back to the 1980s β€” including landmark research by Benjamin Bloom in 1984 β€” showed that students who received one-on-one tutoring with this kind of back-and-forth performed two standard deviations better than students in regular classrooms. That's roughly the difference between a C student and an A student. Bloom called this "the 2 Sigma problem": we know one-on-one tutoring works dramatically better, but we can't afford to give every student a human tutor.

Khanmigo's pitch is that it can be that tutor for everyone. But here's what gets complicated: doing the Socratic method well requires understanding not just the subject matter, but the specific student's misconception. A skilled human tutor can tell from a student's hesitation, their word choice, their facial expression, that they're confused about a specific thing. Khanmigo reads only text. It can ask follow-up questions, but it can also ask the wrong follow-up question β€” one that sounds pedagogically smart but actually steers the student further from understanding.

The Tension

Khanmigo is designed to never give away the answer. But what if the student has been stuck for an hour, is getting frustrated, and genuinely needs to see a worked example before the questioning approach will help? A human tutor reads that situation. Khanmigo follows its instructions. Is following a good rule always the same as good teaching?

Not a Chatbot β€” But Not a Teacher Either

People reach for two familiar categories when they encounter Khanmigo: chatbot or teacher. Neither fits cleanly, and knowing why is what separates an informed user from someone who's just along for the ride.

A traditional chatbot β€” think of the little popup windows on airline websites that ask "How can I help you today?" β€” follows rigid scripts. If you say something the script doesn't expect, it breaks. Khanmigo doesn't do that. It can handle unexpected questions, shift topics mid-conversation, and generate novel explanations. In that sense it's clearly not a simple chatbot.

But it's also not a teacher. A teacher has professional training, legal accountability, and a relationship with you that extends across time. A teacher can notice that you seem distracted and ask if something is wrong at home. A teacher can be held responsible if they teach you something incorrect. Khanmigo has none of those properties. It has no memory of you across sessions (unless the platform specifically stores it), no accountability structure, and β€” as has been demonstrated in multiple documented tests β€” no reliable mechanism for knowing when it's wrong.

The most accurate description is something like: a very sophisticated study partner that has read everything, forgets you constantly, and sometimes makes things up with complete confidence. Once you hold that description in your head, you know how to use it well. You use it for explanation and exploration. You verify important facts elsewhere. You treat its confidence as a style, not as evidence of accuracy.

You Can See What Most People Miss

Most people who use Khanmigo either trust it completely or distrust it completely. Both reactions miss the point. Knowing it's a language model with tutor-shaped instructions means you can be exactly as skeptical as the situation requires β€” trusting its explanations of concepts, questioning its factual claims, and ignoring the false certainty in its tone.

The Ethical Question Nobody Has Answered

In October 2023, the nonprofit Common Sense Media released a report analyzing AI tutoring tools used in American schools. Among their findings: most students had no idea that the AI tutor they were talking to had a hidden system prompt shaping every answer. They believed they were getting the AI's "natural" response. They weren't.

This raises a question that Sal Khan, the researchers at Khan Academy, the policymakers at the U.S. Department of Education, and teachers in thousands of classrooms are actively wrestling with β€” and have not resolved: Should students be told exactly what instructions their AI tutor is operating under?

The argument for transparency: if you're interacting with a system that has been instructed to behave a certain way, you have a right to know what those instructions are. Knowing changes how you interpret what it says. If Khanmigo is designed to never discourage you, you should know that β€” because "you're doing great" means something different coming from a system programmed to encourage than from someone who genuinely thinks you're doing great.

The argument against full transparency: the system prompt is what makes the tool work well. If students know exactly how to get around it β€” exactly what triggers it to stop asking questions and just give answers β€” some will exploit that. The pedagogical design depends on the student not fully seeing the mechanism.

Both arguments are serious. Neither is obviously wrong. You're now aware of a real institutional debate β€” one that affects what happens to students in schools right now, including probably yours. What you do with that awareness is up to you.

Lesson 1 Quiz

Five questions β€” test your understanding, not your memory
1. Khanmigo is built on top of GPT-4 but behaves differently from raw ChatGPT. What is the primary mechanism that creates this difference?
Correct. The system prompt is the layer of hidden instructions that shapes every response Khanmigo gives β€” you never see it, but it controls the AI's behavior entirely.
Not quite. The same GPT-4 model is underneath β€” but the tutor-like behavior comes from a hidden system prompt that instructs it how to respond. The model itself is unchanged.
2. A student has been stuck on a problem for an hour, is visibly frustrated, and needs to see a worked example. Khanmigo keeps asking guiding questions. What does this scenario illustrate?
Correct. A good human tutor reads the room β€” frustration, body language, emotional state. Khanmigo follows its instructions regardless of context. That gap is real and important.
This scenario actually illustrates a subtler problem: even a good rule (ask questions, don't give answers) can be the wrong move in a specific situation. That's a limitation of rule-following systems.
3. Benjamin Bloom's 1984 research on one-on-one tutoring found students performed about two standard deviations better than classroom peers. Why did he call this "the 2 Sigma problem"?
Correct. Sigma refers to standard deviation β€” a statistical measure of how spread out results are. Two sigmas above average is a massive improvement. The "problem" is that we can't afford to give everyone that advantage.
The word "sigma" refers to standard deviation β€” a statistics term. The problem Bloom identified is that we know tutoring works dramatically well, but it's too costly to scale. AI tutors are proposed as the potential solution.
4. Imagine a new AI study tool launches and claims it "always gives accurate information." Based on what you learned about how language models work, what should make you skeptical of that claim?
Correct. This is the core vulnerability of language models: they optimize for producing text that sounds coherent and authoritative, not for verifying facts. Confident tone is not the same as accurate content.
The structural issue is that language models predict plausible-sounding text β€” they don't retrieve verified facts. A system that says it "always gives accurate information" is making a claim the underlying technology can't reliably support.
5. Common Sense Media's 2023 report found that most students didn't know their AI tutor had a hidden system prompt. Why does knowing about the system prompt change how you should interpret the AI's responses?
Correct. Context changes meaning. Encouragement from a system programmed to encourage is not the same as genuine feedback. Knowing the instructions lets you calibrate how much weight to give each type of response.
The instructions shape every response. An AI that's been told to encourage you will encourage you β€” even if your answer is wrong. That's not deception, but it means you need to interpret its warmth with some skepticism.

Lab 1: Interrogate the Tutor

You're an investigator. Your job is to figure out what Khanmigo-style AI tutors actually are β€” and aren't.

Your Role: AI Auditor

You've just read about how Khanmigo works. Now you're going to pressure-test those ideas in conversation with AESOP β€” an AI that knows this course material and will push back on vague thinking.

AESOP is not going to lecture you. It's going to ask what you actually think, challenge you when you're fuzzy, and ask you to back up your claims. You'll need to take a position, not just summarize the lesson.

Start here: Tell AESOP whether you think AI tutors like Khanmigo are a good idea for schools. Don't sit on the fence β€” pick a side and defend it. AESOP will challenge you.
AESOP β€” Lab Assistant
Lab 1 Β· AI Tutor Audit
Alright. You've read about Khanmigo β€” what it is, how it works, what it can't do. Now I want your actual opinion: should schools be using AI tutors like this, or not? Don't give me "it depends." Pick a lane and tell me why. I'll push back.
The AI That Teaches You Β· Lesson 2

How Khanmigo Knows What You Don't Know

The engineering behind personalization β€” and why "personalized" doesn't mean what you think
If an AI has never met you before, how can it possibly know where you're stuck?

In November 2023, a team of researchers at MIT's Teaching Systems Lab published a study that surprised a lot of people in education technology. They had given students a set of algebra problems and let some of them work with Khanmigo while others worked alone. Then they did something clever: they asked both groups to explain their reasoning out loud. What they found was that students who had used Khanmigo were slightly worse at explaining their thinking than students who had worked through the problems without any help. The AI had been asking them questions β€” but somehow the students had learned to navigate the questions without actually building the understanding behind them.

The researchers had a name for what they were seeing: surface compliance. The students were giving the AI the answers it seemed to want, the AI was responding encouragingly, and everyone moved forward β€” but the underlying confusion hadn't been addressed. The AI thought they understood. They didn't. And neither knew it.

What "Personalized Learning" Actually Means

The phrase "personalized learning" is everywhere in education technology marketing. It sounds compelling: education tailored to exactly you, your pace, your gaps, your style. But when you look at how current AI tutors actually implement personalization, the reality is more complicated.

Khanmigo personalizes in a relatively narrow way. Within a single conversation, it reads your responses and adjusts. If you seem confused, it tries a different explanation. If you answer a question correctly, it moves forward. This is real and useful β€” it's called within-session adaptation. But it has significant limits.

Within-session adaptation When an AI adjusts its responses based on what you said in the current conversation β€” as opposed to knowing your history over weeks or months.

First limit: the AI can only read what you type. If you're confused about why a mathematical rule works but you type an answer that happens to be correct, the AI has no way to know you're confused. It moves on. The MIT study caught exactly this happening.

Second limit: most AI tutors, including Khanmigo in its standard form, don't maintain a persistent model of you across sessions. Each time you start a new conversation, it doesn't know you were confused about fractions last Tuesday. A human teacher builds a mental model of each student over months. The AI starts fresh every time.

Third limit: the AI can adjust explanations but can't adjust the underlying curriculum. It's working within whatever lesson structure the platform has defined. If the curriculum's sequencing is wrong for how your specific brain works, the AI can't reorder it.

Worth Knowing

Some newer AI tutoring systems β€” including experimental versions being tested at Carnegie Mellon University in 2023–2024 β€” are starting to build persistent learner models. These systems track what you've done across many sessions and try to predict where your misconceptions are. But this requires storing detailed data about every student interaction, which raises its own questions about privacy.

Misconception Detection: The Hard Problem

The most sophisticated thing a tutor can do β€” human or AI β€” is not teach new material. It's identify and correct a specific misconception that's blocking a student from moving forward. A misconception is not just "not knowing" something. It's having an incorrect mental model that feels correct to you.

Misconception A specific wrong belief that feels right. Not the absence of knowledge, but the presence of incorrect knowledge that blocks learning.

For example: many students think that when you multiply two numbers together, the result is always bigger than both of them. That works perfectly for most cases they've seen β€” until they multiply by a fraction or a negative number. The rule they've built in their head ("multiplication makes things bigger") is wrong, but they don't know it's wrong. A good tutor spots this and addresses it directly. The MIT study showed that Khanmigo often didn't.

Why? Because detecting a misconception requires more than reading a correct answer. It requires asking the right diagnostic question. Human tutors do this intuitively β€” they've seen thousands of students make the same mistakes. Khanmigo generates diagnostic questions by predicting what sounds educationally reasonable, not by drawing on a catalogue of documented student errors. The questions it asks might be good questions. They might not be the right question for your specific wrong belief.

Researchers at Carnegie Learning β€” a company that has been building AI tutoring systems since 1998, long before the current AI boom β€” have spent decades cataloguing the specific misconceptions students form in math. Their system, MATHia, uses that catalogue to choose targeted questions. Khanmigo doesn't have the same depth of misconception-specific training. It's more flexible and better at conversation, but potentially less precise at identifying exactly where your thinking went wrong.

The Tension

Fixing misconceptions requires storing and analyzing data about where students go wrong. The more precise the misconception detection, the more detailed the data the system needs. At what point does building a better tutor require knowing too much about the student?

The Feedback Loop Problem

Here's something most people don't think about when they praise AI tutors: how does the system know if it's working? With a human tutor, you know relatively quickly β€” the student either gets it or they don't, and the tutor adjusts. With Khanmigo, the feedback loop is much weaker.

In a single session, the AI's signal is: did the student give a correct answer? But as the MIT study showed, a correct answer doesn't prove understanding. Across sessions, Khanmigo in its standard form doesn't know what happened in your previous conversations. Khan Academy as a platform can track your exercise scores, but the AI conversation itself isn't tightly coupled to that data in real time.

This means the AI tutor is partly flying blind. It's optimizing for generating responses that seem pedagogically appropriate. Whether those responses are actually producing learning is a question the AI itself cannot answer. That measurement has to come from somewhere outside the system β€” a test, a teacher's assessment, a researcher's study.

This is not a fatal flaw. It's a design gap that the field is actively working on. But it means that the most important measure of an AI tutor's effectiveness β€” does the student actually understand more? β€” is not something the AI can reliably track itself. You can use this knowledge right now. When you work with an AI tutor, don't just answer its questions. Periodically ask yourself: could I explain this to someone else without looking anything up? That's the real test of whether the interaction did anything useful.

Knowing This Changes Things

Most students assume that if the AI seemed satisfied, they must have learned something. Now you know that's not a reliable signal. The AI's satisfaction and your understanding are two different things β€” and only one of them matters when the test comes.

Lesson 2 Quiz

Five questions β€” think it through
1. The MIT Teaching Systems Lab study found that Khanmigo users were sometimes worse at explaining their reasoning than students who worked alone. What term did researchers use to describe this pattern?
Correct. Surface compliance β€” giving the AI the answers it seemed to want without actually building understanding. The AI moved on. The confusion stayed.
The researchers called it surface compliance β€” students learned to navigate the AI's questions without genuinely understanding the underlying material.
2. A student gets every question right in a Khanmigo session on fractions, but when asked to explain fractions to a friend the next day, they can't do it. Which limitation of AI tutoring does this best illustrate?
Correct. Correct answers are a weak signal. The only real test of understanding is whether you can use the knowledge independently β€” which the AI has no way to verify during a session.
This scenario illustrates the feedback loop problem: a correct answer signals success to the AI, but doesn't prove the student actually understands. Surface compliance can look identical to learning from the AI's perspective.
3. What is a "misconception" in the educational sense, and why is it harder to fix than simply "not knowing" something?
Correct. The tricky thing about a misconception is that it feels right to the person who holds it. They're not confused β€” they're confidently wrong. That confidence is exactly what makes it hard to dislodge.
A misconception is a wrong belief that feels right. The challenge isn't absence of knowledge β€” it's the presence of incorrect knowledge that the student trusts. They don't think they need correction.
4. Carnegie Learning's MATHia system was built using decades of catalogued student errors. How does this make it potentially more precise at misconception detection than Khanmigo?
Correct. Specificity is the key difference. Khanmigo generates plausible educational questions. MATHia selects from a library of questions designed for specific error patterns. Both might sound similar, but one is targeted in a way the other isn't.
The difference isn't the AI model or who built it β€” it's specificity. Decades of catalogued errors let MATHia ask the right diagnostic question for a specific misconception, rather than a generally reasonable question.
5. You're using an AI tutor and it seems satisfied with all your answers. What is the most reliable way to actually check whether you understood the material?
Correct. This is the transfer test β€” can you use what you learned in a new context, without the scaffold? If yes, you understood. If no, you learned to navigate the AI's questions, which is a different skill.
The AI's satisfaction is a weak signal. The most reliable check is the transfer test: can you explain this independently, without any help? That's what the lesson called "the real test of whether the interaction did anything useful."

Lab 2: Design the Better Tutor

You're the engineer now. What would you change about how AI tutors detect understanding?

Your Role: AI System Designer

You know the problem: AI tutors can't reliably tell the difference between a student who understands something and a student who has learned to say the right words. Your job is to propose a fix β€” a specific design change that would help an AI tutor detect surface compliance or misconceptions more accurately.

AESOP will ask you to be specific. "Make it smarter" doesn't count. You'll need to describe a concrete mechanism β€” what data the AI collects, what it does with it, or what it asks differently.

Start here: Describe one specific design change you'd make to Khanmigo to help it detect when a student is giving surface-level answers without real understanding. Be concrete β€” AESOP will ask you to defend it.
AESOP β€” Lab Assistant
Lab 2 Β· Tutor Design
You're the designer. I've heard a lot of vague ideas in this space β€” "make it more intuitive," "give it emotional intelligence," that kind of thing. Those aren't real designs. I want something specific: one concrete change to how Khanmigo works that would help it catch surface compliance. What is it, and how exactly would it work?
The AI That Teaches You Β· Lesson 3

When Khanmigo Gets It Wrong

Hallucinations, bias, and the specific ways AI tutors fail β€” with real documented examples
If an AI tutor confidently teaches you something incorrect, who is responsible for what you learned?

In April 2023, researchers at Stanford University's Human-Centered AI Institute released a report documenting something that had been observed informally by teachers for months: AI tutoring tools were generating confident factual errors β€” wrong information delivered in a tone so certain that students had no reason to question it. In one documented case, a high school student using an AI study tool for a history essay was told that the Treaty of Versailles was signed in 1921. It was signed in 1919. The AI didn't hedge, didn't say "I think" β€” it stated the date as fact. The student used it in the essay. The teacher caught it. The student was confused because "the AI told me."

This wasn't a glitch or a bug in the traditional sense. The AI was working exactly as designed β€” predicting text that sounded authoritative. It had encountered enough text where 1921 appeared near "Versailles" (perhaps referring to related diplomatic events that year) that it generated that number with full confidence. The mechanism that makes these systems fluent is the same mechanism that makes them wrong in this specific, hard-to-detect way.

Hallucinations: The Technical Name for Confident Wrongness

The AI research community has a term for what happened with the Treaty date: a hallucination. The word is a bit dramatic, but the definition is precise: a hallucination is when a language model generates text that is factually incorrect but internally coherent β€” meaning it fits the pattern of how correct text looks, so the AI produces it without any signal that something is wrong.

Hallucination When an AI language model generates false information with high confidence, because the false information fits the statistical patterns the model learned β€” not because the model knows it's true.

Hallucinations are a feature, not a bug, of how current language models work. They are the price of fluency. A model that was maximally cautious β€” one that only said things it was certain about β€” would be far less useful as a conversational partner. The engineering tradeoff between fluency and accuracy is one of the central unsolved problems in AI development right now.

Khanmigo is specifically designed to reduce hallucinations by anchoring responses to Khan Academy's curriculum content. But it cannot eliminate them. In a 2023 audit by education researcher Anya Kamenetz, Khanmigo generated incorrect math explanations in roughly 10–15% of tested interactions. That's significantly better than unguided ChatGPT β€” but it still means that roughly one in ten explanations may contain an error. In a tutoring context, where students are trusting the AI to help them understand something they already don't understand, a 10% error rate is not nothing.

Worth Knowing

Hallucinations tend to cluster in certain types of content: specific dates, names, citations, and any area where training data was thin or contradictory. Conceptual explanations (how a process works) tend to be more reliable than factual claims (when something happened, who did it).

Bias in the Tutor

In January 2024, a research team at UCLA's Center for Research on Evaluation, Standards, and Student Testing published findings on something more subtle than hallucinations: patterns in how AI tutors responded differently to students based on cues in their writing. When student prompts contained markers associated with certain demographic groups β€” dialect features, specific cultural references, writing styles β€” the AI tutors in the study gave different levels of detail, different encouragement, and different levels of scaffolding in response.

The researchers were careful about how they described this. The AI wasn't making conscious decisions about which students deserved better help. It was doing what language models always do: pattern-matching against training data. And the training data β€” the enormous volume of human text these models learned from β€” encodes the biases of the humans who wrote it. An AI that learned from that data inherits those patterns.

Training data bias When an AI system reflects unfair patterns from its training data β€” not because the AI "decided" to be biased, but because the data it learned from contained those patterns.

This has concrete implications. If an AI tutoring system was trained primarily on text written by and for students from specific cultural and linguistic backgrounds, it may be better at understanding and responding to students from those backgrounds. Students whose communication styles differ from the training distribution may get responses that are less calibrated, less helpful, or subtly harder to parse. This is not a problem anyone intended. It is a problem that is hard to measure precisely β€” and harder still to fix.

The Ethical Question

If an AI tutoring system provides measurably different quality of help to students based on patterns in their writing β€” and if it's being used as an equity tool to provide tutoring to students who can't afford human tutors β€” does the bias make the equity argument collapse? Or is a biased AI tutor still better than no tutor at all? There is no clean answer here.

What You Can Actually Do About This

Here is the thing about knowing that AI tutors hallucinate and carry biases: it gives you a practical toolkit that most of your peers don't have.

For factual claims: treat specific dates, names, statistics, and citations from any AI tutor the way you'd treat a claim from a classmate who seems smart but isn't infallible. They might be right. Verify before you commit to it in writing. For conceptual explanations β€” how something works, why a process unfolds the way it does β€” the error rate is lower and the AI is often genuinely useful.

For your own confidence: if an AI tutor's explanations consistently feel slightly off β€” if the vocabulary seems pitched at the wrong level, if the examples don't connect to your experience β€” that's worth paying attention to. You might be encountering a bias artifact. Trying a different prompt style, a different platform, or a human teacher for that topic is a reasonable response.

For the bigger picture: the fact that AI tutors have documented failure modes is not an argument for abandoning them. It's an argument for using them with awareness. A calculator can make your work wrong if you enter the wrong numbers. That doesn't make calculators bad β€” it makes entering careful inputs important. The same logic applies here.

Knowing this makes you a more sophisticated user of tools that are already reshaping education. The students who understand the failure modes of their AI tools will use those tools better than students who just trust them. That is a real advantage β€” and it's yours now.

Lesson 3 Quiz

Apply what you know to new situations
1. The Treaty of Versailles example shows an AI giving a wrong date confidently. What is the technical term for this type of error, and what causes it?
Correct. Hallucinations aren't retrieval failures β€” the AI doesn't look things up. It generates text that sounds right because it fits learned patterns. There's no internal fact-check running.
Language models don't retrieve stored facts β€” they generate text that matches patterns from training. When a wrong answer fits those patterns confidently, the result is a hallucination: false information delivered as if certain.
2. Why are hallucinations described as "a feature, not a bug" of how language models work?
Correct. Fluency and hallucination are two sides of the same coin. A system cautious enough to never hallucinate would also be too cautious to be a useful conversational partner. That's the engineering tradeoff.
The lesson described this as a core tradeoff: the same statistical prediction that makes language models fluent is what allows them to generate confident errors. There's no simple fix that removes hallucinations while keeping the fluency.
3. An AI tutor is being used as an equity tool β€” to give tutoring to students who can't afford human tutors. Research then shows the AI provides subtly better help to students whose writing style matches its training data. How does this complicate the equity argument?
Correct. This is the structural irony: an equity tool that performs better for already-advantaged students could make inequality worse. The intent doesn't determine the effect. That's why the ethical question has no clean answer.
The lesson described this as a genuine tension with no clean answer. If a tool designed to reduce inequality actually works better for students who already have advantages, the equity argument is seriously complicated β€” even if the tool also helps everyone somewhat.
4. You're using an AI tutor to study for a history test. Which type of information should you be MOST careful to verify from another source?
Correct. The lesson noted that hallucinations cluster in specific dates, names, citations, and statistics β€” areas where training data tends to be thinner or more contradictory. Conceptual explanations are generally more reliable.
Hallucinations cluster in specific factual claims β€” dates, names, statistics. These are the claims most likely to be wrong with high AI confidence. Conceptual explanations of how or why things happened tend to be more reliable territory for AI tutors.
5. Training data bias in AI tutors is described as "not a problem anyone intended." Why does the absence of intent not eliminate the problem?
Correct. Effects don't require intent. The AI doesn't make decisions β€” it reflects patterns. If the patterns in training data are biased, the outputs will reflect that bias even if every engineer and designer involved had good intentions.
Intent and effect are different things. A language model doesn't consider what its creators intended β€” it reflects patterns in data. Biased patterns produce biased outputs. Good intentions on the part of developers don't override what the data learned.

Lab 3: The Hallucination Audit

Your job is to think like a fact-checker β€” and defend your reasoning.

Your Role: Fact-Checker

AI tutors make errors in predictable ways. Now that you know where those errors cluster β€” specific dates, names, statistics β€” your job is to think through how you'd actually catch them in practice.

AESOP is going to push you to be specific. It's not enough to say "I'd verify it." How? Where? What makes a source trustworthy enough to override what the AI said?

Start here: Imagine an AI tutor gives you a specific historical date that sounds right but you're not 100% sure about. Walk AESOP through exactly what you would do to check it β€” and how you'd decide whether to trust the AI or the source you found.
AESOP β€” Lab Assistant
Lab 3 Β· Hallucination Audit
Alright, fact-checker. You've got an AI-generated date that might be wrong. Don't tell me you'd "Google it" β€” that's not a verification strategy, that's a search. Walk me through your actual process: what makes a source trustworthy, how do you handle conflicting sources, and what do you do if the AI and your source disagree? Be specific about at least one concrete step.
The AI That Teaches You Β· Lesson 4

Who Decides How AI Teaches You?

The policy decisions, institutional debates, and power questions behind the tool in your browser
When a school district adopts an AI tutor, whose values get embedded in how it teaches β€” and who gets to decide?

In September 2023, the Los Angeles Unified School District β€” the second-largest school district in the United States, serving over 400,000 students β€” quietly suspended access to an AI tool it had paid $6 million to deploy. The tool was called Ed, built by a company called AllHere. The suspension came after parents and teachers raised concerns about data privacy, about what the AI was telling students, and about the fact that the decision to deploy it had been made largely without consulting teachers or families. By June 2024, AllHere had gone bankrupt. The $6 million was effectively gone. The students who were supposed to benefit from the tool had been given and then had taken away something their district had presented as the future of education.

The LA story is not an argument against AI tutoring tools. It is an argument for paying attention to who makes the decisions about those tools, what questions get asked before deployment, and who bears the cost when things go wrong. Those questions are not technical questions. They are political and ethical questions β€” and right now, in most school districts, they are being answered without much input from the people most affected.

Who Controls the Curriculum

When a school district adopts an AI tutoring tool, they are making an educational decision β€” but also a values decision. The tool reflects choices about what counts as correct, what counts as a good explanation, what topics are treated as settled and what topics are presented as contested. Those choices were made by engineers, product designers, and content teams at a company. They were not made by the teachers in your school or the families in your community.

This is not new. Textbooks have always encoded someone's values about what to include and exclude. But textbooks go through public review processes β€” school boards vote on them, parents can examine them, there are legal requirements about curriculum transparency in most states. AI tutoring systems don't face the same requirements. A system prompt that shapes every interaction a student has with an AI tutor is considered proprietary information. In most cases, no one outside the company has reviewed it.

Algorithmic transparency The principle that the rules an AI system operates under should be visible and understandable to the people affected by it β€” not hidden as trade secrets.

In 2023, the U.S. Department of Education released a report titled Artificial Intelligence and the Future of Teaching and Learning. One of its central recommendations was that AI tools deployed in schools should be subject to transparency requirements β€” that educators and families should be able to understand, at a meaningful level, how the AI is making decisions about students. As of 2024, no federal legislation requires this. It remains a recommendation.

Worth Knowing

Several states β€” including California, New York, and Colorado β€” passed or proposed student data privacy laws in 2023–2024 that apply to AI education tools. These laws primarily address what data can be collected and how it can be used, not what values or instructional approaches are embedded in the AI's design. The data question and the curriculum question are related but different.

The Teacher's Role β€” and Why It's Contested

In October 2023, the American Federation of Teachers β€” the second-largest teachers' union in the United States, representing 1.7 million members β€” released a policy statement on AI in education. It was more nuanced than many expected. The AFT did not call for banning AI tutoring tools. Instead, it called for three things: teacher involvement in decisions about which tools get adopted, transparency about how the tools work, and protections ensuring that AI is used to support teachers rather than replace them.

That last point is where the debate gets heated. AI tutoring tools are significantly cheaper than human tutors. In some versions of the future being imagined by technology companies and some policymakers, AI tutors could allow schools to reduce the number of human teachers while maintaining β€” or even improving β€” educational outcomes. Teachers' unions argue this is both educationally wrong and ethically unacceptable. Technology advocates argue that if AI tutoring can genuinely improve learning, it's wrong to restrict it in order to protect adult employment.

Both sides are arguing in good faith. Both sides have real interests at stake. And the students β€” who are the ones whose education is being shaped by these decisions β€” are largely not at the table when those decisions are made.

The Ethical Question

If an AI tutoring system demonstrably improved learning outcomes for students in schools that couldn't afford enough human teachers β€” but its widespread adoption led to fewer teaching jobs β€” would that be a good outcome? A bad one? Who should get to decide?

What You Know That Most People Don't

You have now worked through the technical, pedagogical, and political layers of AI tutoring systems. That combination is rarer than you might think. Most people who have opinions about AI in education know one layer and not the others. A parent concerned about data privacy might not understand what a language model actually is. A teacher worried about replacement might not know what hallucinations are. A technology company excited about scale might not have thought carefully about training data bias.

You know all three layers. That lets you have a more complete conversation than most adults can have about this topic right now. It also lets you evaluate claims β€” from companies, from teachers, from policymakers, from journalists β€” with the specific knowledge of how these systems actually work under the surface.

When you see a headline that says "AI Tutors Outperform Human Teachers in New Study," you can ask: outperform on what measure, in what context, with what demographic of students, and who funded the study? When you see a headline that says "AI Tutors Are Dangerous and Should Be Banned from Schools," you can ask: what failure mode are they pointing to, is that failure fundamental to the technology or specific to a particular implementation, and what would students lose if the tool weren't available?

Neither trusting nor dismissing is the right response. The right response is the one you're now equipped to give: a specific, informed position based on what these systems actually do β€” not what their promoters claim or their critics fear. That is a genuinely consequential skill, and you have it.

You Can Now See What Most People Miss

Most people reading about AI tutoring tools see either a technological promise or a technological threat. You see a system with specific capabilities, specific failure modes, specific design choices, and specific political stakes. That's not the same as having all the answers. It's better: it's knowing which questions to ask.

Lesson 4 Quiz

Policy, power, and what you can do with what you know
1. The LA Unified School District suspended its $6 million AI tool in 2023. What does this case most clearly illustrate about how AI tools get adopted in schools?
Correct. The LA case wasn't primarily a technology failure β€” it was a governance failure. The decision to spend $6 million was made with insufficient stakeholder input, and students paid the price when the tool was pulled.
The LA case illustrates a governance problem: who gets to make adoption decisions, and who gets consulted before they're made. The failure wasn't primarily technical β€” it was about whose voices were missing from the decision-making process.
2. Why are AI tutoring system prompts treated differently from textbooks in terms of public review and transparency?
Correct. Textbooks go through public review processes. System prompts are trade secrets. That gap β€” between the transparency requirements for old educational materials and new AI tools β€” is exactly what groups like the U.S. Department of Education are pushing to close.
The issue is legal category: textbooks are subject to curriculum transparency laws, while AI system prompts are classified as proprietary information. No law currently requires AI companies to expose the instructions shaping student interactions.
3. The American Federation of Teachers' 2023 AI policy statement called for AI to "support" rather than "replace" teachers. What is the core tension this language is trying to navigate?
Correct. The tension is real: if AI can match human tutoring effectiveness at a fraction of the cost, there's an economic case for using it to reduce staffing. Whether that's acceptable β€” and who decides β€” is a live policy debate, not a settled question.
The underlying tension is economic and educational: AI could potentially allow fewer human teachers while maintaining outcomes. Whether that's a good trade β€” and who benefits and who loses β€” is exactly what the AFT's language about "support not replace" is trying to stake out.
4. You see the headline: "New Study: AI Tutors Outperform Human Teachers in Math." Using what you've learned in this module, what is the most important question to ask before accepting or rejecting this claim?
Correct. "Outperform" is meaningless without specifics. Outperform on a short-term quiz? On understanding tested six months later? With which demographic of students? These details determine whether the headline's claim transfers to your situation.
The lesson gave you exactly this framework: ask outperform on what measure, in what context, with what demographic, and who funded it. A headline strips all that context away. Knowing what was stripped is how you evaluate the claim.
5. The U.S. Department of Education's 2023 AI report recommended algorithmic transparency for AI tools in schools. As of 2024, this recommendation had not become law. What does that gap between recommendation and law reveal about how technology policy works?
Correct. Government agencies can identify problems and issue recommendations far faster than legislatures can pass binding laws. That gap β€” between identified problem and legal protection β€” is where most technology policy debates live right now.
The gap between recommendation and law is normal, not scandalous. Recommendations identify the problem; legislation requires political consensus and process. Technology typically moves much faster than either, which is why this gap exists across nearly every AI policy area right now.

Lab 4: The Policy Pitch

You've been asked to advise a school board. What do you actually recommend?

Your Role: Student Policy Advisor

A fictional school district β€” 12,000 students, limited budget, significant teacher shortage β€” is considering adopting Khanmigo for all middle school students. The school board has asked for student input. You're the one who knows how this technology actually works.

AESOP plays the role of a skeptical board member who has heard a lot of technology pitches before. You'll need to make a concrete recommendation β€” adopt, don't adopt, or adopt with conditions β€” and defend it with the specific knowledge you've built in this module.

Start here: Tell AESOP your recommendation for the school board and your main reason for it. Be specific β€” general support or opposition won't hold up to scrutiny. AESOP will challenge your reasoning.
AESOP β€” Lab Assistant
Lab 4 Β· Policy Pitch
I'm on the board, and I've heard three pitches this month from AI companies, two from teachers' unions, and one from a parent group. Everyone has an opinion. You're supposedly different because you actually understand how this technology works. So: adopt Khanmigo, don't adopt it, or adopt it with specific conditions? And give me the reason that actually comes from understanding the technology β€” not just "it could be good" or "it could be bad."

Module Test

15 questions across all four lessons β€” 80% to pass
1. What is the primary technical difference between Khanmigo and a traditional chatbot?
Correct. Language model generation vs. rigid scripting is the fundamental distinction.
Khanmigo generates responses using a language model (GPT-4) β€” it's flexible and novel. Traditional chatbots follow pre-written scripts and break when inputs don't match expected patterns.
2. Sal Khan demonstrated Khanmigo asking "what have you already tried?" instead of giving an answer. Which historical teaching tradition does this reflect?
Correct. Socrates taught by asking questions that forced students to examine their own reasoning β€” exactly the pattern Khanmigo is designed to replicate.
This is the Socratic method β€” a 2,000-year-old teaching approach in which the teacher asks questions rather than providing answers, forcing the student to discover understanding through their own reasoning.
3. Benjamin Bloom's 1984 "2 Sigma" research found that one-on-one tutoring produced students two standard deviations above average. Why is this called a "problem" rather than just a discovery?
Correct. Knowing something works doesn't mean we can give it to everyone. The gap between demonstrated effectiveness and feasibility at scale is the "problem" β€” and AI tutors are proposed as the solution.
The problem is the gap between what works (one-on-one tutoring) and what we can actually provide at scale. The discovery is powerful; the inability to implement it for every student is the problem.
4. A student answers all of Khanmigo's questions correctly during a session but fails to explain the concept the next day without help. This is best described as:
Correct. Surface compliance is when a student gives the AI what it wants β€” correct-seeming answers β€” without actually building transferable understanding.
This is surface compliance: the student learned to answer the AI's questions successfully without developing the understanding those questions were supposed to build. The AI's satisfaction and the student's learning diverged.
5. What does "within-session adaptation" mean, and what is its main limitation?
Correct. Real-time response adjustment is genuinely useful β€” but starting from zero every session means the AI can't build the kind of longitudinal understanding of a student that a human teacher develops over months.
Within-session adaptation means the AI responds to your current conversation and adjusts accordingly. The limitation: it typically doesn't carry knowledge of you across different sessions β€” every conversation starts fresh.
6. How is Carnegie Learning's MATHia system different from Khanmigo in its approach to misconception detection?
Correct. Specificity is the key distinction. A catalogue of documented errors enables targeted questions. Generated questions that sound good may or may not hit the real misconception.
The difference is specificity: MATHia can select questions matched to documented misconceptions. Khanmigo generates plausible educational questions β€” which might be right for the student's confusion, or might not be.
7. What is an AI hallucination, and what causes it?
Correct. The AI isn't lying β€” it has no mechanism to lie. It's generating text that fits learned patterns. When wrong text happens to fit those patterns, the result is a confident error.
A hallucination is false information delivered with confidence β€” not labeled as fiction, not hedged. It occurs because the AI's text generation process optimizes for plausibility, not accuracy. Wrong text that fits statistical patterns gets generated.
8. Which type of AI tutor content should you be MOST skeptical about without verification?
Correct. Hallucinations cluster in factual specifics β€” dates, names, citations, statistics. Conceptual explanations tend to be more reliable. This gives you a practical filter for when to verify.
Hallucinations cluster in specific factual claims β€” dates, names, statistics β€” because training data in these areas tends to be thinner or more contradictory. Conceptual explanations are generally more reliable territory.
9. Training data bias in AI tutors is described as unintentional. Why does unintentional bias still matter?
Correct. The students experiencing worse outcomes from a biased AI system are experiencing real harm β€” whether or not anyone intended it. Intent and effect are different things, and both matter.
Effects exist independently of intent. If an AI system provides different quality of support to different students based on patterns in its training data, the students receiving worse support are genuinely disadvantaged β€” regardless of what the developers intended.
10. An AI tutoring tool is deployed to improve equity β€” giving free tutoring to students who can't afford human tutors β€” but research shows it works significantly better for students from certain cultural and linguistic backgrounds. What is the most important implication of this finding?
Correct. Equity tools that work better for already-advantaged groups can reinforce the very inequalities they were meant to address. The gap in effectiveness is the inequality, regardless of the tool's stated purpose.
The structural implication is that an equity intervention with unequal effectiveness could widen the gap it was supposed to close. "Some benefit is better than none" doesn't hold if the benefit is systematically skewed toward students who already have more advantages.
11. Why are AI tutoring system prompts not subject to the same public review process as textbooks in most U.S. states?
Correct. The legal category matters: textbooks are subject to curriculum laws; system prompts are trade secrets. That gap is what transparency advocates are pushing to close.
Legal categories determine review requirements. Textbooks fall under curriculum laws requiring public review. AI system prompts fall under proprietary business information protections. The law hasn't caught up to the new category.
12. What did the American Federation of Teachers' 2023 AI policy statement primarily call for?
Correct. The AFT's position was nuanced β€” not a ban, but specific governance and labor protections. That distinction matters when evaluating claims about what teachers' unions actually want.
The AFT didn't call for a ban β€” their position was more specific: include teachers in decisions, ensure transparency, and protect teaching jobs. Understanding the actual position rather than a caricature of it is important for policy discussions.
13. You read a headline: "AI Tutors Proven to Double Math Scores." What is the most critical piece of information missing from this headline that you need before accepting the claim?
Correct. Effect size claims without context are nearly meaningless. "Double" compared to what baseline? Measured how? For which students? Over what period? These details determine whether the finding applies to any real situation.
The most important missing information is context for the claim: double compared to what, measured how, with which students, over what period. A number without those parameters can't tell you whether the finding applies to any particular school or student.
14. The U.S. Department of Education recommended algorithmic transparency for AI tools in schools in 2023, but this hadn't become law by 2024. What does this illustrate about the relationship between technology and policy?
Correct. This gap β€” between problem identification and legal remedy β€” is normal and significant. It's why advocates, researchers, and informed citizens play a role that legislation can't fill in real time.
The gap between recommendation and law is a structural feature of how policy works β€” not evidence of corruption. Technology moves faster than legislative processes. The gap between identified problem and legal response is where most AI policy debates currently live.
15. After completing this module, which of the following best describes the right approach to using an AI tutoring tool?
Correct. Calibrated skepticism is the goal β€” not blanket trust or blanket rejection. You now know enough to match your level of skepticism to the specific type of content and claim the AI is making.
The module argued for neither total trust nor total rejection β€” both miss the point. Calibrated skepticism, based on understanding the specific failure modes of language models, is what lets you use these tools effectively and safely.