In March 2023, a thirteen-year-old in New Jersey named Aditya tried something his school hadn't officially approved: he opened Khan Academy's new AI tutor, Khanmigo, and asked it to help him understand quadratic equations. He'd been stuck for two weeks. His teacher was stretched across thirty students. Khanmigo didn't give him the answer β it asked him questions back. Within forty minutes, Aditya solved a type of problem he hadn't been able to crack all month. He told his mom it felt like having a tutor who had infinite patience and zero judgment. That same week, schools in at least six states were quietly debating whether to ban it.
Here's the thing: tools that feel personal and powerful almost always arrive before anyone has figured out the rules. In 1876, the telephone was demonstrated at a World's Fair and educators immediately argued about whether it would make students lazy. In 1925, radio lessons were broadcast into classrooms and principals complained that children would stop thinking for themselves. AI tutors are the newest version of a very old argument β but this time the technology actually does adapt to you, remember your mistakes, and change what it says next based on how you respond. That is genuinely new.
This course is about how those systems actually work β not the marketing version, not the panic version. You'll learn what an AI tutor is doing under the surface when it responds to you, why it sometimes gets things wrong in very specific ways, and what you should actually be skeptical about. By the end, you'll understand something that most adults in your school β including most teachers β don't yet know how to articulate. That's not a boast. It's just where we are.
The date was March 14, 2023 β Pi Day, fittingly β when Sal Khan, the founder of Khan Academy, stood on a stage in San Francisco and demonstrated something he'd been building in secret for almost a year. He typed a question into a chat window. The response didn't give him an answer. Instead, it asked: "Before I help, can you tell me what you've already tried?" The audience went quiet. That single design decision β refusing to hand over the answer β was, Khan argued, the whole point. He wasn't showing off a smarter search engine. He was showing something that behaved more like a tutor than a tool.
Within 48 hours, the clip had been viewed over two million times. Educators were divided almost immediately. Some saw a breakthrough β a patient, infinitely available guide for students whose schools couldn't afford human tutors. Others saw something more troubling: a machine that could sound wise while being completely wrong, and that students might trust more than it deserved. Both reactions, it turns out, were correct.
Khanmigo is Khan Academy's AI tutor, built on top of GPT-4 β the same language model that powers ChatGPT β but with a very specific set of instructions layered on top. Those instructions tell it to behave like a tutor, not a search engine. The difference matters enormously.
A search engine retrieves information. You type "quadratic formula," it finds pages that contain those words, and it returns links. A language model like GPT-4 does something stranger: it predicts what word should come next in a sentence, over and over again, until it has generated a response that sounds like something a knowledgeable person would say. It has read an enormous amount of human writing β textbooks, Wikipedia articles, Reddit threads, academic papers β and learned the patterns of how people explain things.
The critical point: Khanmigo doesn't retrieve stored answers. It generates them fresh each time. Which means it can explain things in new ways, respond to your specific confusion, and even ask you follow-up questions β but it can also generate a confident-sounding wrong answer, because it's optimizing for sounding right, not for being right.
What makes Khanmigo distinct from raw ChatGPT is its system prompt β a hidden set of instructions that tell the AI to ask students questions instead of just answering them, to encourage rather than judge, and to connect its explanations to Khan Academy's existing curriculum. You never see that prompt. But every response you get is shaped by it.
Khanmigo launched to students in March 2023 with a waitlist. By September 2023, Khan Academy had made it free for all U.S. teachers. By early 2024, it was being used in over 40 countries. The speed of that rollout is part of why the questions it raises haven't been fully answered yet.
When Sal Khan demonstrated Khanmigo asking "what have you already tried?" he was invoking something that has been at the center of education for over 2,000 years: the Socratic method. Socrates, the ancient Greek philosopher, taught by asking questions instead of lecturing. His goal wasn't to transfer information β it was to help students discover knowledge by working through contradictions in their own thinking.
The Socratic method is considered one of the most effective ways to teach. Studies going back to the 1980s β including landmark research by Benjamin Bloom in 1984 β showed that students who received one-on-one tutoring with this kind of back-and-forth performed two standard deviations better than students in regular classrooms. That's roughly the difference between a C student and an A student. Bloom called this "the 2 Sigma problem": we know one-on-one tutoring works dramatically better, but we can't afford to give every student a human tutor.
Khanmigo's pitch is that it can be that tutor for everyone. But here's what gets complicated: doing the Socratic method well requires understanding not just the subject matter, but the specific student's misconception. A skilled human tutor can tell from a student's hesitation, their word choice, their facial expression, that they're confused about a specific thing. Khanmigo reads only text. It can ask follow-up questions, but it can also ask the wrong follow-up question β one that sounds pedagogically smart but actually steers the student further from understanding.
Khanmigo is designed to never give away the answer. But what if the student has been stuck for an hour, is getting frustrated, and genuinely needs to see a worked example before the questioning approach will help? A human tutor reads that situation. Khanmigo follows its instructions. Is following a good rule always the same as good teaching?
People reach for two familiar categories when they encounter Khanmigo: chatbot or teacher. Neither fits cleanly, and knowing why is what separates an informed user from someone who's just along for the ride.
A traditional chatbot β think of the little popup windows on airline websites that ask "How can I help you today?" β follows rigid scripts. If you say something the script doesn't expect, it breaks. Khanmigo doesn't do that. It can handle unexpected questions, shift topics mid-conversation, and generate novel explanations. In that sense it's clearly not a simple chatbot.
But it's also not a teacher. A teacher has professional training, legal accountability, and a relationship with you that extends across time. A teacher can notice that you seem distracted and ask if something is wrong at home. A teacher can be held responsible if they teach you something incorrect. Khanmigo has none of those properties. It has no memory of you across sessions (unless the platform specifically stores it), no accountability structure, and β as has been demonstrated in multiple documented tests β no reliable mechanism for knowing when it's wrong.
The most accurate description is something like: a very sophisticated study partner that has read everything, forgets you constantly, and sometimes makes things up with complete confidence. Once you hold that description in your head, you know how to use it well. You use it for explanation and exploration. You verify important facts elsewhere. You treat its confidence as a style, not as evidence of accuracy.
Most people who use Khanmigo either trust it completely or distrust it completely. Both reactions miss the point. Knowing it's a language model with tutor-shaped instructions means you can be exactly as skeptical as the situation requires β trusting its explanations of concepts, questioning its factual claims, and ignoring the false certainty in its tone.
In October 2023, the nonprofit Common Sense Media released a report analyzing AI tutoring tools used in American schools. Among their findings: most students had no idea that the AI tutor they were talking to had a hidden system prompt shaping every answer. They believed they were getting the AI's "natural" response. They weren't.
This raises a question that Sal Khan, the researchers at Khan Academy, the policymakers at the U.S. Department of Education, and teachers in thousands of classrooms are actively wrestling with β and have not resolved: Should students be told exactly what instructions their AI tutor is operating under?
The argument for transparency: if you're interacting with a system that has been instructed to behave a certain way, you have a right to know what those instructions are. Knowing changes how you interpret what it says. If Khanmigo is designed to never discourage you, you should know that β because "you're doing great" means something different coming from a system programmed to encourage than from someone who genuinely thinks you're doing great.
The argument against full transparency: the system prompt is what makes the tool work well. If students know exactly how to get around it β exactly what triggers it to stop asking questions and just give answers β some will exploit that. The pedagogical design depends on the student not fully seeing the mechanism.
Both arguments are serious. Neither is obviously wrong. You're now aware of a real institutional debate β one that affects what happens to students in schools right now, including probably yours. What you do with that awareness is up to you.
You've just read about how Khanmigo works. Now you're going to pressure-test those ideas in conversation with AESOP β an AI that knows this course material and will push back on vague thinking.
AESOP is not going to lecture you. It's going to ask what you actually think, challenge you when you're fuzzy, and ask you to back up your claims. You'll need to take a position, not just summarize the lesson.
In November 2023, a team of researchers at MIT's Teaching Systems Lab published a study that surprised a lot of people in education technology. They had given students a set of algebra problems and let some of them work with Khanmigo while others worked alone. Then they did something clever: they asked both groups to explain their reasoning out loud. What they found was that students who had used Khanmigo were slightly worse at explaining their thinking than students who had worked through the problems without any help. The AI had been asking them questions β but somehow the students had learned to navigate the questions without actually building the understanding behind them.
The researchers had a name for what they were seeing: surface compliance. The students were giving the AI the answers it seemed to want, the AI was responding encouragingly, and everyone moved forward β but the underlying confusion hadn't been addressed. The AI thought they understood. They didn't. And neither knew it.
The phrase "personalized learning" is everywhere in education technology marketing. It sounds compelling: education tailored to exactly you, your pace, your gaps, your style. But when you look at how current AI tutors actually implement personalization, the reality is more complicated.
Khanmigo personalizes in a relatively narrow way. Within a single conversation, it reads your responses and adjusts. If you seem confused, it tries a different explanation. If you answer a question correctly, it moves forward. This is real and useful β it's called within-session adaptation. But it has significant limits.
First limit: the AI can only read what you type. If you're confused about why a mathematical rule works but you type an answer that happens to be correct, the AI has no way to know you're confused. It moves on. The MIT study caught exactly this happening.
Second limit: most AI tutors, including Khanmigo in its standard form, don't maintain a persistent model of you across sessions. Each time you start a new conversation, it doesn't know you were confused about fractions last Tuesday. A human teacher builds a mental model of each student over months. The AI starts fresh every time.
Third limit: the AI can adjust explanations but can't adjust the underlying curriculum. It's working within whatever lesson structure the platform has defined. If the curriculum's sequencing is wrong for how your specific brain works, the AI can't reorder it.
Some newer AI tutoring systems β including experimental versions being tested at Carnegie Mellon University in 2023β2024 β are starting to build persistent learner models. These systems track what you've done across many sessions and try to predict where your misconceptions are. But this requires storing detailed data about every student interaction, which raises its own questions about privacy.
The most sophisticated thing a tutor can do β human or AI β is not teach new material. It's identify and correct a specific misconception that's blocking a student from moving forward. A misconception is not just "not knowing" something. It's having an incorrect mental model that feels correct to you.
For example: many students think that when you multiply two numbers together, the result is always bigger than both of them. That works perfectly for most cases they've seen β until they multiply by a fraction or a negative number. The rule they've built in their head ("multiplication makes things bigger") is wrong, but they don't know it's wrong. A good tutor spots this and addresses it directly. The MIT study showed that Khanmigo often didn't.
Why? Because detecting a misconception requires more than reading a correct answer. It requires asking the right diagnostic question. Human tutors do this intuitively β they've seen thousands of students make the same mistakes. Khanmigo generates diagnostic questions by predicting what sounds educationally reasonable, not by drawing on a catalogue of documented student errors. The questions it asks might be good questions. They might not be the right question for your specific wrong belief.
Researchers at Carnegie Learning β a company that has been building AI tutoring systems since 1998, long before the current AI boom β have spent decades cataloguing the specific misconceptions students form in math. Their system, MATHia, uses that catalogue to choose targeted questions. Khanmigo doesn't have the same depth of misconception-specific training. It's more flexible and better at conversation, but potentially less precise at identifying exactly where your thinking went wrong.
Fixing misconceptions requires storing and analyzing data about where students go wrong. The more precise the misconception detection, the more detailed the data the system needs. At what point does building a better tutor require knowing too much about the student?
Here's something most people don't think about when they praise AI tutors: how does the system know if it's working? With a human tutor, you know relatively quickly β the student either gets it or they don't, and the tutor adjusts. With Khanmigo, the feedback loop is much weaker.
In a single session, the AI's signal is: did the student give a correct answer? But as the MIT study showed, a correct answer doesn't prove understanding. Across sessions, Khanmigo in its standard form doesn't know what happened in your previous conversations. Khan Academy as a platform can track your exercise scores, but the AI conversation itself isn't tightly coupled to that data in real time.
This means the AI tutor is partly flying blind. It's optimizing for generating responses that seem pedagogically appropriate. Whether those responses are actually producing learning is a question the AI itself cannot answer. That measurement has to come from somewhere outside the system β a test, a teacher's assessment, a researcher's study.
This is not a fatal flaw. It's a design gap that the field is actively working on. But it means that the most important measure of an AI tutor's effectiveness β does the student actually understand more? β is not something the AI can reliably track itself. You can use this knowledge right now. When you work with an AI tutor, don't just answer its questions. Periodically ask yourself: could I explain this to someone else without looking anything up? That's the real test of whether the interaction did anything useful.
Most students assume that if the AI seemed satisfied, they must have learned something. Now you know that's not a reliable signal. The AI's satisfaction and your understanding are two different things β and only one of them matters when the test comes.
You know the problem: AI tutors can't reliably tell the difference between a student who understands something and a student who has learned to say the right words. Your job is to propose a fix β a specific design change that would help an AI tutor detect surface compliance or misconceptions more accurately.
AESOP will ask you to be specific. "Make it smarter" doesn't count. You'll need to describe a concrete mechanism β what data the AI collects, what it does with it, or what it asks differently.
In April 2023, researchers at Stanford University's Human-Centered AI Institute released a report documenting something that had been observed informally by teachers for months: AI tutoring tools were generating confident factual errors β wrong information delivered in a tone so certain that students had no reason to question it. In one documented case, a high school student using an AI study tool for a history essay was told that the Treaty of Versailles was signed in 1921. It was signed in 1919. The AI didn't hedge, didn't say "I think" β it stated the date as fact. The student used it in the essay. The teacher caught it. The student was confused because "the AI told me."
This wasn't a glitch or a bug in the traditional sense. The AI was working exactly as designed β predicting text that sounded authoritative. It had encountered enough text where 1921 appeared near "Versailles" (perhaps referring to related diplomatic events that year) that it generated that number with full confidence. The mechanism that makes these systems fluent is the same mechanism that makes them wrong in this specific, hard-to-detect way.
The AI research community has a term for what happened with the Treaty date: a hallucination. The word is a bit dramatic, but the definition is precise: a hallucination is when a language model generates text that is factually incorrect but internally coherent β meaning it fits the pattern of how correct text looks, so the AI produces it without any signal that something is wrong.
Hallucinations are a feature, not a bug, of how current language models work. They are the price of fluency. A model that was maximally cautious β one that only said things it was certain about β would be far less useful as a conversational partner. The engineering tradeoff between fluency and accuracy is one of the central unsolved problems in AI development right now.
Khanmigo is specifically designed to reduce hallucinations by anchoring responses to Khan Academy's curriculum content. But it cannot eliminate them. In a 2023 audit by education researcher Anya Kamenetz, Khanmigo generated incorrect math explanations in roughly 10β15% of tested interactions. That's significantly better than unguided ChatGPT β but it still means that roughly one in ten explanations may contain an error. In a tutoring context, where students are trusting the AI to help them understand something they already don't understand, a 10% error rate is not nothing.
Hallucinations tend to cluster in certain types of content: specific dates, names, citations, and any area where training data was thin or contradictory. Conceptual explanations (how a process works) tend to be more reliable than factual claims (when something happened, who did it).
In January 2024, a research team at UCLA's Center for Research on Evaluation, Standards, and Student Testing published findings on something more subtle than hallucinations: patterns in how AI tutors responded differently to students based on cues in their writing. When student prompts contained markers associated with certain demographic groups β dialect features, specific cultural references, writing styles β the AI tutors in the study gave different levels of detail, different encouragement, and different levels of scaffolding in response.
The researchers were careful about how they described this. The AI wasn't making conscious decisions about which students deserved better help. It was doing what language models always do: pattern-matching against training data. And the training data β the enormous volume of human text these models learned from β encodes the biases of the humans who wrote it. An AI that learned from that data inherits those patterns.
This has concrete implications. If an AI tutoring system was trained primarily on text written by and for students from specific cultural and linguistic backgrounds, it may be better at understanding and responding to students from those backgrounds. Students whose communication styles differ from the training distribution may get responses that are less calibrated, less helpful, or subtly harder to parse. This is not a problem anyone intended. It is a problem that is hard to measure precisely β and harder still to fix.
If an AI tutoring system provides measurably different quality of help to students based on patterns in their writing β and if it's being used as an equity tool to provide tutoring to students who can't afford human tutors β does the bias make the equity argument collapse? Or is a biased AI tutor still better than no tutor at all? There is no clean answer here.
Here is the thing about knowing that AI tutors hallucinate and carry biases: it gives you a practical toolkit that most of your peers don't have.
For factual claims: treat specific dates, names, statistics, and citations from any AI tutor the way you'd treat a claim from a classmate who seems smart but isn't infallible. They might be right. Verify before you commit to it in writing. For conceptual explanations β how something works, why a process unfolds the way it does β the error rate is lower and the AI is often genuinely useful.
For your own confidence: if an AI tutor's explanations consistently feel slightly off β if the vocabulary seems pitched at the wrong level, if the examples don't connect to your experience β that's worth paying attention to. You might be encountering a bias artifact. Trying a different prompt style, a different platform, or a human teacher for that topic is a reasonable response.
For the bigger picture: the fact that AI tutors have documented failure modes is not an argument for abandoning them. It's an argument for using them with awareness. A calculator can make your work wrong if you enter the wrong numbers. That doesn't make calculators bad β it makes entering careful inputs important. The same logic applies here.
Knowing this makes you a more sophisticated user of tools that are already reshaping education. The students who understand the failure modes of their AI tools will use those tools better than students who just trust them. That is a real advantage β and it's yours now.
AI tutors make errors in predictable ways. Now that you know where those errors cluster β specific dates, names, statistics β your job is to think through how you'd actually catch them in practice.
AESOP is going to push you to be specific. It's not enough to say "I'd verify it." How? Where? What makes a source trustworthy enough to override what the AI said?
In September 2023, the Los Angeles Unified School District β the second-largest school district in the United States, serving over 400,000 students β quietly suspended access to an AI tool it had paid $6 million to deploy. The tool was called Ed, built by a company called AllHere. The suspension came after parents and teachers raised concerns about data privacy, about what the AI was telling students, and about the fact that the decision to deploy it had been made largely without consulting teachers or families. By June 2024, AllHere had gone bankrupt. The $6 million was effectively gone. The students who were supposed to benefit from the tool had been given and then had taken away something their district had presented as the future of education.
The LA story is not an argument against AI tutoring tools. It is an argument for paying attention to who makes the decisions about those tools, what questions get asked before deployment, and who bears the cost when things go wrong. Those questions are not technical questions. They are political and ethical questions β and right now, in most school districts, they are being answered without much input from the people most affected.
When a school district adopts an AI tutoring tool, they are making an educational decision β but also a values decision. The tool reflects choices about what counts as correct, what counts as a good explanation, what topics are treated as settled and what topics are presented as contested. Those choices were made by engineers, product designers, and content teams at a company. They were not made by the teachers in your school or the families in your community.
This is not new. Textbooks have always encoded someone's values about what to include and exclude. But textbooks go through public review processes β school boards vote on them, parents can examine them, there are legal requirements about curriculum transparency in most states. AI tutoring systems don't face the same requirements. A system prompt that shapes every interaction a student has with an AI tutor is considered proprietary information. In most cases, no one outside the company has reviewed it.
In 2023, the U.S. Department of Education released a report titled Artificial Intelligence and the Future of Teaching and Learning. One of its central recommendations was that AI tools deployed in schools should be subject to transparency requirements β that educators and families should be able to understand, at a meaningful level, how the AI is making decisions about students. As of 2024, no federal legislation requires this. It remains a recommendation.
Several states β including California, New York, and Colorado β passed or proposed student data privacy laws in 2023β2024 that apply to AI education tools. These laws primarily address what data can be collected and how it can be used, not what values or instructional approaches are embedded in the AI's design. The data question and the curriculum question are related but different.
In October 2023, the American Federation of Teachers β the second-largest teachers' union in the United States, representing 1.7 million members β released a policy statement on AI in education. It was more nuanced than many expected. The AFT did not call for banning AI tutoring tools. Instead, it called for three things: teacher involvement in decisions about which tools get adopted, transparency about how the tools work, and protections ensuring that AI is used to support teachers rather than replace them.
That last point is where the debate gets heated. AI tutoring tools are significantly cheaper than human tutors. In some versions of the future being imagined by technology companies and some policymakers, AI tutors could allow schools to reduce the number of human teachers while maintaining β or even improving β educational outcomes. Teachers' unions argue this is both educationally wrong and ethically unacceptable. Technology advocates argue that if AI tutoring can genuinely improve learning, it's wrong to restrict it in order to protect adult employment.
Both sides are arguing in good faith. Both sides have real interests at stake. And the students β who are the ones whose education is being shaped by these decisions β are largely not at the table when those decisions are made.
If an AI tutoring system demonstrably improved learning outcomes for students in schools that couldn't afford enough human teachers β but its widespread adoption led to fewer teaching jobs β would that be a good outcome? A bad one? Who should get to decide?
You have now worked through the technical, pedagogical, and political layers of AI tutoring systems. That combination is rarer than you might think. Most people who have opinions about AI in education know one layer and not the others. A parent concerned about data privacy might not understand what a language model actually is. A teacher worried about replacement might not know what hallucinations are. A technology company excited about scale might not have thought carefully about training data bias.
You know all three layers. That lets you have a more complete conversation than most adults can have about this topic right now. It also lets you evaluate claims β from companies, from teachers, from policymakers, from journalists β with the specific knowledge of how these systems actually work under the surface.
When you see a headline that says "AI Tutors Outperform Human Teachers in New Study," you can ask: outperform on what measure, in what context, with what demographic of students, and who funded the study? When you see a headline that says "AI Tutors Are Dangerous and Should Be Banned from Schools," you can ask: what failure mode are they pointing to, is that failure fundamental to the technology or specific to a particular implementation, and what would students lose if the tool weren't available?
Neither trusting nor dismissing is the right response. The right response is the one you're now equipped to give: a specific, informed position based on what these systems actually do β not what their promoters claim or their critics fear. That is a genuinely consequential skill, and you have it.
Most people reading about AI tutoring tools see either a technological promise or a technological threat. You see a system with specific capabilities, specific failure modes, specific design choices, and specific political stakes. That's not the same as having all the answers. It's better: it's knowing which questions to ask.
A fictional school district β 12,000 students, limited budget, significant teacher shortage β is considering adopting Khanmigo for all middle school students. The school board has asked for student input. You're the one who knows how this technology actually works.
AESOP plays the role of a skeptical board member who has heard a lot of technology pitches before. You'll need to make a concrete recommendation β adopt, don't adopt, or adopt with conditions β and defend it with the specific knowledge you've built in this module.