In March 2023, Khan Academy sent an email to a small group of U.S. teachers. It said: "You have been selected to test Khanmigo — an AI that will talk to your students about math, history, and science, in real time." The email mentioned that the system was powered by GPT-4, the same model behind ChatGPT. But it also said something unusual: "Khanmigo will never give students the answer."
That phrase stopped a lot of teachers cold. The whole point of an AI tutor, they assumed, was to get answers faster. But Khan Academy's founder, Sal Khan, had a different theory. He believed the most valuable thing an AI could do for a student was ask the right question — not supply the right answer. Khanmigo was deliberately designed to frustrate students just enough to make them think.
Six months later, in September 2023, Duolingo launched something called Duolingo Max in the U.S., Japan, and the UK. It also ran on GPT-4. But it felt nothing like Khanmigo. Duolingo Max gave you instant explanations. It praised you. It used your name. It turned grammar lessons into something that felt closer to a text message conversation with a patient friend than a classroom session. Millions of users adopted it within weeks.
Same underlying model. Two completely different experiences. And buried in that difference is one of the most important questions in AI education: Should an AI tutor tell you, or should it ask you?
Khanmigo is built around what educators call the Socratic method — named after the ancient Greek philosopher Socrates, who famously never lectured. Instead, he asked questions that forced his students to examine their own assumptions until they figured things out themselves. Socrates believed that real understanding only happens when you arrive at an idea on your own. Being told an answer leaves no trace; discovering it yourself leaves a mark.
Khan Academy encoded this philosophy directly into Khanmigo's system prompt — the set of instructions that tells the AI how to behave. When a student asks Khanmigo "What's the answer to this algebra problem?", the system is specifically instructed to respond with a question like "What do you think the first step should be?" or "Let's look at what we already know — what does the equation tell us?"
This is a deliberate friction design. The AI is engineered to create a tiny obstacle. Not a wall — just enough resistance to make you pause. The theory is that pausing activates a different kind of brain processing: you're no longer passively receiving information, you're actively constructing it.
In pilot studies run by Khan Academy with U.S. students during the 2023–2024 school year, students who used Khanmigo for at least 30 minutes per week showed measurable improvements on standardized math assessments compared to students using traditional Khan Academy videos alone. The gains were modest — roughly 13% better performance — but they were real, and they were consistent with what decades of human tutoring research already showed: being pushed to think is more effective than being given answers.
Duolingo Max operates on a completely different theory. Its designers weren't primarily trying to optimize how deeply you understand Spanish grammar. They were trying to solve a different problem: most people quit.
Duolingo has published its own internal data showing that the average new user abandons the app within two weeks. Learning a language takes hundreds of hours. Most people, no matter how motivated they are on Day 1, simply stop. Duolingo Max was built to address dropout — the gap between wanting to learn and actually continuing to learn.
To do this, Duolingo Max uses GPT-4 for two specific features. The first, called Explain My Answer, lets users ask why their translation was wrong and get a conversational, personalized explanation instead of a generic grammar rule. The second, called Roleplay, lets users have open-ended AI conversations as characters — ordering coffee in Paris, booking a hotel in Tokyo — to practice language in context.
What Duolingo Max doesn't do is push you toward discomfort. It rewards. It celebrates streaks. It uses your name. Its mascot, the green owl Duo, sends you cheerful notifications. The AI is warm, immediate, and frictionless. The theory is that consistency beats intensity — that showing up every day for a short, enjoyable session produces more actual language acquisition than an occasional difficult deep dive.
A 2021 study in the journal Language Learning & Technology found that Duolingo users who engaged daily for 34 hours total showed vocabulary gains equivalent to one semester of college-level Spanish instruction. But the same study found those gains were shallow — strong on recognition, weak on production and grammar. The students knew words they'd seen; they struggled to construct sentences they'd never practiced.
This exposes a fundamental tension in learning tool design. Engagement — keeping someone using a tool — is not the same as learning. A tool can score very high on one and mediocre on the other. The question of which matters more depends on who you're designing for: the student who might not come back at all, or the student who needs to go deep.
Here's what most people never think about: before a single student touches either tool, hundreds of decisions have already been made. Someone at Khan Academy decided that math understanding was worth short-term frustration. Someone at Duolingo decided that daily streaks were worth shallower learning. These decisions came from educational philosophies, business models, and assumptions about users — not from the AI itself.
GPT-4, the underlying model, has no opinion about how learning should work. It just follows its instructions. The instructions are written by humans, and those humans bring their own ideas about what education is for.
Khan Academy is a nonprofit. It doesn't need users to pay to stay alive. It can afford to frustrate you a little because its mission is learning, not retention. Duolingo is a publicly traded company that reported $531 million in revenue in 2023, with stock analysts watching monthly active user numbers closely. It cannot afford for users to quit. Its mission is learning — but its survival depends on engagement.
Duolingo knows its most engaging design features don't always produce the deepest learning. Khan Academy knows its friction-heavy design causes some students to give up entirely. Neither company fully discloses these tradeoffs to users. Is that a problem? Who should decide what "good enough" learning looks like — the company, the teacher, the student, or someone else? There's no clean answer here. Sit with it.
You are now in a position most adults never reach: you can look at any AI learning tool and ask the real question first — not "is this well-designed?" but "what theory of learning is baked into the design, and who decided that was the right theory?" That question reshapes everything you'll read about educational AI from here on.
You've been hired as an independent learning-design auditor. A school district is considering purchasing a new AI homework helper called "StudyPal." You've been given a one-paragraph description of how it works, and you need to evaluate it before the district spends $200,000 on a three-year contract.
StudyPal description: "StudyPal answers student questions instantly and completely, explains every step of the solution, offers encouragement after each correct answer, tracks how many questions a student completes per session, and sends weekly reports to parents showing session length and questions answered."
In 1998, researchers at Carnegie Mellon University published a paper describing a system they called a "cognitive tutor." It wasn't like the AI tutors that would come twenty years later — it didn't use large language models or natural conversation. It was built on something called cognitive modeling, a technique where researchers built a precise mathematical map of how a human expert solves a problem, then compared every student decision against that map in real time.
The system was designed for algebra. Every time a student solved an equation step, the software noted not just whether the answer was right, but which reasoning pathway the student had followed. Over time, it built what researchers called a "knowledge component map" — essentially a fingerprint of exactly which mathematical skills a student had mastered, was developing, or had consistently misunderstood.
By 2002, this system had spun out into a company called Carnegie Learning, and their product — eventually called MATHia — was deployed in real schools. By 2023, MATHia was being used by approximately 600,000 students annually across the United States, in schools from rural Tennessee to urban Chicago. It had accumulated data on student problem-solving behavior stretching back over two decades. No other AI tutoring system in existence has anything close to that data history.
MATHia doesn't feel like ChatGPT. It doesn't have friendly conversation. It doesn't use your name warmly or celebrate your streaks. But underneath its plain interface runs something that Khanmigo and Duolingo Max simply don't have yet: the ability to predict, with documented accuracy, exactly which concept you will struggle with next.
Here's a concrete example of how MATHia's cognitive model works in practice. Suppose you're a seventh-grader working on solving linear equations. You solve ten problems. MATHia doesn't just track your scores. It tracks every intermediate step — every time you moved a variable to the wrong side, every time you divided before subtracting, every time you correctly applied the distributive property but then made an arithmetic error in the next step.
From those ten problems, MATHia has built a micro-profile of your mathematical reasoning. It knows, probabilistically, that you understand what "solving for x" means, that you reliably apply inverse operations, but that you have a systematic error: when negative signs appear on both sides of an equation, you consistently make a sign error. You don't just sometimes get it wrong — you get it wrong in the same direction, every time.
This distinction matters enormously. Random errors usually mean a student wasn't paying attention. Systematic errors mean a student has learned something incorrectly and needs to actively un-learn it. Those two situations require completely different responses from a tutor. MATHia can tell them apart. Most human teachers, managing 30 students simultaneously, cannot reliably do so for every student every day.
A 2019 RAND Corporation study — one of the most rigorous independent evaluations of an AI tutoring system ever conducted — found that students who used MATHia for at least 45 minutes per week showed statistically significant gains equivalent to 6.5 additional months of math learning over a school year compared to control groups. This is among the largest effect sizes ever documented for an educational technology product.
MATHia's power comes from data — and that same data raises questions that school districts, parents, and privacy advocates have been debating seriously since at least 2014, when the state of New York cancelled a data-sharing agreement with an educational technology consortium called inBloom after parents raised concerns about what student behavioral data was being collected, how long it was retained, and who could access it.
When you use MATHia, the system logs timestamps, response times, error patterns, and decision sequences — not just for one session, but across your entire school career. A student who starts using MATHia in fifth grade and continues through eighth grade has generated thousands of data points about their specific cognitive patterns. That profile is extremely detailed. It is also, potentially, very revealing — not just about math ability, but about things like attention, persistence, frustration tolerance, and academic self-confidence.
MATHia tracks "hint abuse" — when students click for hints repeatedly without attempting problems — as a distinct behavioral pattern. It tracks "gaming the system" behaviors like random clicking. It tracks session abandonment rates. These are not just learning metrics; they are behavioral and psychological indicators. The company uses them to improve the product. They are also retained in student records.
In 2020, Carnegie Learning published a privacy policy clarification stating that student data is not sold to third parties and is covered by FERPA, the U.S. federal student privacy law passed in 1974. But FERPA was written before AI-driven behavioral profiling existed. Its protections were designed for paper records and grade transcripts — not for systems that record thousands of micro-decisions per session.
MATHia can predict, based on your behavioral patterns at age 11, how likely you are to struggle with algebra at age 14. That prediction might be accurate. But should a school system be allowed to use it? If a teacher sees your MATHia profile before meeting you, does that help them support you — or does it prejudge you before you've had a chance to surprise anyone? No clean answer. Think about what you want yours to say.
MATHia is, in many ways, the proof of concept that the AI tutoring world is still catching up to. It demonstrated, with real data over real years, that a machine could identify a student's specific cognitive gaps more reliably than most classroom-based assessment. It demonstrated that adaptive pacing — letting each student move at their own speed through a curriculum, not the class average — produces real learning gains. And it demonstrated that you don't need a conversational AI to do this; structured interaction data is enough.
The newer generation of AI tutors — Khanmigo, Duolingo Max, and others covered in this module — are powerful in different ways. They can hold conversations. They can adapt tone. They feel more human. But as of 2024, none of them have MATHia's longitudinal data depth or its decades of documented outcome evidence.
You now understand something that shapes every serious policy debate about AI in education: the most effective AI tutoring system in documented existence is not a chatbot — it's a cognitive model running quietly behind a plain interface in 600 school districts, accumulating data about how millions of children actually think. That's not a small thing to know.
You're a student representative on a school district's newly formed AI Ethics Committee. The district is renewing its MATHia contract and the vendor has offered to give teachers access to a new "full behavioral dashboard" — including hint-abuse patterns, system-gaming flags, session abandonment rates, and a predictive score showing each student's probability of struggling with algebra in two years.
The committee needs to decide: which parts of this data should teachers see, which should be restricted, and who should have the authority to make that call?
In December 2019, the journal Nature Human Behaviour published a paper with an unusual title: "A Randomized Experiment in China Shows AI Can Improve Learning for Struggling Students." The researchers — from Carnegie Mellon University, Zhejiang University, and the Chinese company Squirrel AI — had run one of the largest randomized controlled trials of an AI tutoring system ever conducted.
They recruited 1,000 middle school students across 28 schools in China and randomly assigned them to two groups: one group received instruction from human teachers in the normal way; the other received instruction from Squirrel AI's adaptive tutoring system. Both groups covered the same math and science curriculum over the same period. After the study, both groups took standardized tests.
The result: students using Squirrel AI significantly outperformed students taught exclusively by human teachers. Not slightly — the gains were statistically large. The AI group also showed improvements for struggling students that were especially pronounced, suggesting the system was particularly effective at reaching students who typically fall behind in traditional classrooms.
The paper was peer-reviewed and published in one of the world's most respected scientific journals. It was also almost completely ignored by mainstream Western media and education policy circles. Derek Lomas, a learning scientist at Delft University of Technology who reviewed the study, wrote in 2020: "If a drug showed these effect sizes, we'd be talking about it on every front page. Because it's an AI education product, we're barely talking about it at all."
Squirrel AI was founded in 2014 by Derek Haoyang Li, a former education executive who set out to build what he described as "a clone of the world's best human tutor, available to every student." By 2023, Squirrel AI operated learning centers across more than 2,000 locations in China, with over 3 million registered students. It is by some measures the largest AI tutoring operation in the world.
The system works through an approach called fine-grained knowledge decomposition. Where MATHia might track a few hundred "knowledge components" in algebra, Squirrel AI has reportedly decomposed a single high school math curriculum into over 10,000 distinct micro-concepts. Before a student ever solves a problem, the system runs a diagnostic that maps their current knowledge state against this 10,000-node map and identifies which concepts they know, which they almost know, and which they have never encountered.
From that starting point, Squirrel AI constructs an individualized learning path through the curriculum. Each student's path is different — not because the destination is different, but because the route is chosen based on their specific knowledge gaps. Two students sitting side-by-side in a Squirrel AI learning center might be working on completely different concepts at any given moment, each moving toward the same exam objective from a different angle.
This is fundamentally different from what Khanmigo or Duolingo Max do. Those tools adapt their tone and their scaffolding — they adjust how hard they push, or how they explain something. Squirrel AI adapts what it teaches next, in a more granular way than any other system currently deployed at scale.
By this point in the module, you've studied three very different AI learning systems. Here's a direct comparison of how they differ on the dimensions that matter most:
| Dimension | Khanmigo | Duolingo Max | MATHia | Squirrel AI |
|---|---|---|---|---|
| Core method | Socratic questioning | Engagement + conversation | Cognitive modeling | Knowledge decomposition + adaptive paths |
| What it adapts | How it responds (tone, questions) | Explanations and practice type | Difficulty and concept sequencing | Which concept is taught next |
| Strongest evidence | Early pilots (2023–24) | Engagement/retention data | 2019 RAND RCT (6.5 months gain) | 2019 Nature HB RCT (vs. human teachers) |
| Main risk | Some students disengage from friction | Shallow learning despite high engagement | Behavioral data profiling | Scale of data collection; replacement of teachers |
| Business model | Nonprofit | Publicly traded, subscription | B2B school contracts | Consumer learning centers (China) |
No single system is best at everything. Squirrel AI has the most impressive outcome data — but it also operates in a context where it functions more as a replacement for classroom instruction than a supplement to it. That's a meaningful difference in what the system is for.
Squirrel AI's founder has said publicly that he believes AI tutoring will eventually make classroom instruction with human teachers unnecessary for most academic subjects. His reasoning is blunt: a human teacher managing 30 students simultaneously cannot provide individualized instruction to each child. An AI can. On measurable learning outcomes for academic content, an AI that tracks 10,000 knowledge components will eventually outperform a human teacher for most students, most of the time.
This is not a fringe view. A 2023 paper in the journal Educational Researcher surveyed 150 leading learning scientists and found that roughly 40% agreed that "AI systems will outperform average human teachers on academic outcome measures within 15 years." Roughly 35% disagreed. The remaining 25% said the question was unanswerable because it depended on what "outperform" meant.
If an AI system genuinely produces better academic outcomes than a human teacher — measurably, reliably, for most students — is that sufficient justification for replacing teachers? What does a teacher do that isn't captured in academic outcome measurements? And who gets to decide what school is for: measurable learning, or something else? These questions are being debated in education ministries and school boards right now. There is no consensus. There may not be one anytime soon.
You now understand something most adults — including most education policymakers — haven't fully grappled with: the most rigorous evidence in AI tutoring research comes not from Silicon Valley or from American classrooms, but from a Chinese company that's been running controlled trials since 2014. The conversation about AI replacing teachers isn't hypothetical. It's already happening. Knowing that makes you a more informed reader of every headline about the "future" of AI in education — because for millions of students, that future is already the present.
You're a student advisor to a national education committee. A government minister has just proposed a pilot program to replace one-third of classroom teaching time with Squirrel AI in 500 schools over five years. The minister cites the 2019 Nature study and argues this will dramatically improve academic outcomes, especially for struggling students in under-resourced schools.
Your committee needs a written critique of the proposal — not a rejection of the evidence, but a serious analysis of what the proposal gets right, what it misses, and what conditions would need to be true for such a policy to be ethical.
In the fall of 2023, a high school junior named Amara in suburban Atlanta was assigned to use Khanmigo for SAT prep every evening for six weeks. She had her laptop open, the interface loaded, and she was diligently answering Khanmigo's Socratic questions on reading comprehension. She was making measurable progress on the practice problems. But she was also quietly struggling with something Khanmigo couldn't see: she didn't believe she was a math person.
The belief wasn't irrational. It had been built over years — a fifth-grade teacher who called her out when she got an answer wrong, a sixth-grade class where the boys seemed to get called on more, a middle school where the "gifted" track was mostly white students and she was one of very few Black girls in the advanced group. By eleventh grade, she had developed what psychologists call math anxiety — a real cognitive phenomenon in which the stress of math problems actually impairs working memory, making the math harder than it would otherwise be.
Khanmigo gave Amara hints. It asked her guiding questions. It waited patiently. But it never asked: why do you hesitate for 30 seconds before every problem? It never noticed the pattern. It had no model for what was happening in her head outside the math itself. The system was measuring her knowledge gaps. It had no instrument for measuring what Claude Steele — the Stanford psychologist who identified the concept in 1995 — calls stereotype threat: the way being a member of a group that's stereotyped as less capable actually reduces performance in the moment, regardless of underlying ability.
Amara eventually got a human tutor — a Black woman who had navigated the same SAT in the same suburb fifteen years earlier. Within three sessions, Amara's performance on practice tests improved significantly. The content knowledge wasn't the bottleneck. The belief system was. No AI tutor currently deployed addresses that.
After studying Khanmigo, Duolingo Max, MATHia, and Squirrel AI, a pattern emerges. Each system is sophisticated in its own way. Each has documented evidence of effectiveness. And all four share the same three blind spots.
Gap 1: Identity and belonging. All four systems treat the learner as a cognitive agent — a brain processing information. None of them have a model of the learner as a social person whose belief about whether they belong in a subject affects how their brain processes information about that subject. Stereotype threat, impostor syndrome, and math identity are real, documented phenomena with real effects on academic performance. No current AI tutor tracks them or responds to them. A system that measures ten thousand knowledge components but zero identity components is missing something significant.
Gap 2: Transfer and application. Every system in this module is good at teaching something in a specific context. Duolingo Max is good at Spanish vocabulary. MATHia is good at algebra procedure. Squirrel AI is good at exam-targeted content. What none of them have demonstrated is helping students apply learning to genuinely novel contexts — situations that look completely different from anything practiced in the system. This is called far transfer, and it's arguably the most important kind of learning. It's also the hardest to teach and the hardest to measure. Current AI tutors largely avoid the problem.
Gap 3: Metacognition. Metacognition means thinking about your own thinking — knowing how you learn, recognizing when you're confused versus when you just think you're confused, understanding your own error patterns. Decades of learning research show that metacognitive awareness is one of the strongest predictors of long-term academic success. Students who can accurately judge their own understanding outperform equally intelligent students who can't. Of the four systems studied, only MATHia even attempts to build a metacognitive model — and it does so indirectly, by tracking error patterns rather than directly developing the learner's self-awareness.
A 2022 meta-analysis in Educational Psychology Review examined 93 studies of AI tutoring systems and found that while AI tools showed consistent gains on near-transfer tasks (problems similar to those practiced), effect sizes for far-transfer tasks were near zero. The authors concluded: "Current AI tutoring systems are optimized for performance on measurable tasks in structured domains. They have not demonstrated the ability to develop the flexible thinking required for genuinely novel problem-solving."
Understanding what AI tutors can't do is not an argument against using them. It's an argument for using them with clear eyes — knowing what you're getting and what you still need to supply yourself.
Here's a practical framework. When you're using any AI learning tool, you're getting: accurate knowledge-gap identification, patient and infinitely available practice, personalized sequencing, and feedback without judgment. These are genuinely valuable, and no human tutor is consistently better at all of them.
What you're not getting: someone who sees you as a whole person, someone who can identify when your problem is belief rather than knowledge, someone who can push you to apply ideas in genuinely unfamiliar contexts, or someone who helps you build an accurate map of your own thinking.
Those missing pieces are what a good teacher, mentor, or thinking partner does. AI tutors don't replace them. They offload the parts of learning that are about information transfer and practice. The parts that are about identity, meaning, and flexible thinking still require a human — or, in some cases, just time and experience.
Schools in under-resourced communities are more likely to adopt AI tutors as substitutes for human instruction — not because they believe AI is better, but because they can't afford enough qualified human teachers. This means the limitations described in this lesson fall disproportionately on students who are already disadvantaged. If you know that AI tutors can't address stereotype threat, identity, or far transfer — and you know that wealthier schools will use AI as a supplement while poorer schools use it as a replacement — what obligation does that knowledge create? For researchers? For policymakers? For companies building these tools?
You have now completed a comparative analysis that working education researchers take years to build. You can name the mechanisms behind four major AI tutoring systems, identify what evidence exists for each, locate the specific gaps they share, and recognize how business models, learning theory, and data ethics all intersect in a tool that looks, from the outside, like just an app that helps you with homework.
That is not a small thing. Every time someone tells you "AI is going to transform education," you now know the right questions to ask: Which AI? What theory of learning? For whom? At what cost? Measured how? Those questions are the difference between being a passive consumer of a tool and being someone who can actually evaluate whether it's doing what it claims.
You've been invited to pitch a new AI tutoring concept to a foundation that funds educational technology. Your pitch needs to directly address at least two of the three gaps identified in Lesson 4 — identity and belonging, far transfer, or metacognition. You also need to explain which existing tool your design is most similar to, and why yours does something none of them currently do.
You have three minutes of pitch time. Your AI colleague has read every criticism of AI tutoring in existence and will challenge every claim you make.