Module 4 · Lesson 1

Khan Academy's Khanmigo vs. Duolingo's Max

Two of the most-used AI tutors on Earth — built on similar technology, designed around completely different theories of how people learn.

When two tools are powered by the same AI engine, why do they feel so completely different to use?

In March 2023, Khan Academy sent an email to a small group of U.S. teachers. It said: "You have been selected to test Khanmigo — an AI that will talk to your students about math, history, and science, in real time." The email mentioned that the system was powered by GPT-4, the same model behind ChatGPT. But it also said something unusual: "Khanmigo will never give students the answer."

That phrase stopped a lot of teachers cold. The whole point of an AI tutor, they assumed, was to get answers faster. But Khan Academy's founder, Sal Khan, had a different theory. He believed the most valuable thing an AI could do for a student was ask the right question — not supply the right answer. Khanmigo was deliberately designed to frustrate students just enough to make them think.

Six months later, in September 2023, Duolingo launched something called Duolingo Max in the U.S., Japan, and the UK. It also ran on GPT-4. But it felt nothing like Khanmigo. Duolingo Max gave you instant explanations. It praised you. It used your name. It turned grammar lessons into something that felt closer to a text message conversation with a patient friend than a classroom session. Millions of users adopted it within weeks.

Same underlying model. Two completely different experiences. And buried in that difference is one of the most important questions in AI education: Should an AI tutor tell you, or should it ask you?

The Socratic Engine: How Khanmigo Actually Works

Khanmigo is built around what educators call the Socratic method — named after the ancient Greek philosopher Socrates, who famously never lectured. Instead, he asked questions that forced his students to examine their own assumptions until they figured things out themselves. Socrates believed that real understanding only happens when you arrive at an idea on your own. Being told an answer leaves no trace; discovering it yourself leaves a mark.

Khan Academy encoded this philosophy directly into Khanmigo's system prompt — the set of instructions that tells the AI how to behave. When a student asks Khanmigo "What's the answer to this algebra problem?", the system is specifically instructed to respond with a question like "What do you think the first step should be?" or "Let's look at what we already know — what does the equation tell us?"

This is a deliberate friction design. The AI is engineered to create a tiny obstacle. Not a wall — just enough resistance to make you pause. The theory is that pausing activates a different kind of brain processing: you're no longer passively receiving information, you're actively constructing it.

Socratic method — A teaching technique that uses questions instead of explanations, forcing the learner to reason toward an answer rather than receive one.

Friction design — Intentionally adding small obstacles to a learning experience so the learner has to work slightly harder — which research shows improves memory and understanding.

In pilot studies run by Khan Academy with U.S. students during the 2023–2024 school year, students who used Khanmigo for at least 30 minutes per week showed measurable improvements on standardized math assessments compared to students using traditional Khan Academy videos alone. The gains were modest — roughly 13% better performance — but they were real, and they were consistent with what decades of human tutoring research already showed: being pushed to think is more effective than being given answers.

The Engagement Engine: How Duolingo Max Works

Duolingo Max operates on a completely different theory. Its designers weren't primarily trying to optimize how deeply you understand Spanish grammar. They were trying to solve a different problem: most people quit.

Duolingo has published its own internal data showing that the average new user abandons the app within two weeks. Learning a language takes hundreds of hours. Most people, no matter how motivated they are on Day 1, simply stop. Duolingo Max was built to address dropout — the gap between wanting to learn and actually continuing to learn.

To do this, Duolingo Max uses GPT-4 for two specific features. The first, called Explain My Answer, lets users ask why their translation was wrong and get a conversational, personalized explanation instead of a generic grammar rule. The second, called Roleplay, lets users have open-ended AI conversations as characters — ordering coffee in Paris, booking a hotel in Tokyo — to practice language in context.

What Duolingo Max doesn't do is push you toward discomfort. It rewards. It celebrates streaks. It uses your name. Its mascot, the green owl Duo, sends you cheerful notifications. The AI is warm, immediate, and frictionless. The theory is that consistency beats intensity — that showing up every day for a short, enjoyable session produces more actual language acquisition than an occasional difficult deep dive.

What the Research Says

A 2021 study in the journal Language Learning & Technology found that Duolingo users who engaged daily for 34 hours total showed vocabulary gains equivalent to one semester of college-level Spanish instruction. But the same study found those gains were shallow — strong on recognition, weak on production and grammar. The students knew words they'd seen; they struggled to construct sentences they'd never practiced.

This exposes a fundamental tension in learning tool design. Engagement — keeping someone using a tool — is not the same as learning. A tool can score very high on one and mediocre on the other. The question of which matters more depends on who you're designing for: the student who might not come back at all, or the student who needs to go deep.

The Design Decision You Don't See

Here's what most people never think about: before a single student touches either tool, hundreds of decisions have already been made. Someone at Khan Academy decided that math understanding was worth short-term frustration. Someone at Duolingo decided that daily streaks were worth shallower learning. These decisions came from educational philosophies, business models, and assumptions about users — not from the AI itself.

GPT-4, the underlying model, has no opinion about how learning should work. It just follows its instructions. The instructions are written by humans, and those humans bring their own ideas about what education is for.

Khan Academy is a nonprofit. It doesn't need users to pay to stay alive. It can afford to frustrate you a little because its mission is learning, not retention. Duolingo is a publicly traded company that reported $531 million in revenue in 2023, with stock analysts watching monthly active user numbers closely. It cannot afford for users to quit. Its mission is learning — but its survival depends on engagement.

The Ethical Question

Duolingo knows its most engaging design features don't always produce the deepest learning. Khan Academy knows its friction-heavy design causes some students to give up entirely. Neither company fully discloses these tradeoffs to users. Is that a problem? Who should decide what "good enough" learning looks like — the company, the teacher, the student, or someone else? There's no clean answer here. Sit with it.

You are now in a position most adults never reach: you can look at any AI learning tool and ask the real question first — not "is this well-designed?" but "what theory of learning is baked into the design, and who decided that was the right theory?" That question reshapes everything you'll read about educational AI from here on.

Lesson 1 Quiz

Khanmigo vs. Duolingo Max — test your reasoning, not just your recall.

1. Khanmigo was intentionally designed to avoid giving students direct answers. What learning principle drives this design choice?

Correct. Khanmigo's "never give the answer" rule is a direct implementation of Socratic teaching — the AI asks questions rather than supplying answers, forcing the learner to construct understanding actively.

Not quite. Review the section on Khanmigo's design philosophy — the key term is the teaching method named after an ancient Greek philosopher who never lectured.

2. Both Khanmigo and Duolingo Max run on GPT-4. Yet they feel completely different to use. What accounts for this difference?

Exactly right. The AI model itself has no built-in teaching philosophy. It follows instructions. The humans who write those instructions are the ones who decide how the tool behaves.

Not quite. The lesson specifically explains that the same GPT-4 model powers both tools — the difference comes entirely from the instructions humans give it.

3. A student uses Duolingo Max for three months and builds a 90-day streak. She scores well on vocabulary recognition tests but struggles to write original sentences. What does this scenario best illustrate?

Yes. This is precisely the tension the lesson surfaces — Duolingo's 2021 research study found exactly this pattern. Engagement tools can produce real but shallow gains.

Think about the broader point the lesson makes about the difference between engagement and learning. The student did learn something — but what kind of learning, and what's missing?

4. Why might Khan Academy's nonprofit status affect how its AI tutor is designed, compared to Duolingo's design as a publicly traded company?

Correct. Business model shapes product design. When a company's survival depends on monthly active users, it will design for retention. When it doesn't, it can design purely for learning outcomes — even if that means some users quit.

The lesson connects business structure to design decisions directly. Think about what Duolingo has to track for investors versus what Khan Academy has to track for its mission.

5. "Friction design" in an AI learning tool means intentionally making the experience harder in some way. Based on the lesson, when is friction most likely to be beneficial?

Right. Friction is a tool for depth, not speed. It works best when the priority is durable understanding — when you need a student to actually internalize something, not just recognize it.

Think about what friction does to the learner's brain. The lesson says pausing "activates a different kind of brain processing." What kind — passive receiving, or active constructing?

Lab 1: The Design Auditor

You're auditing a new AI learning tool. Your job is to figure out what theory of learning is buried in its design — before the company tells you.

Your Role

You've been hired as an independent learning-design auditor. A school district is considering purchasing a new AI homework helper called "StudyPal." You've been given a one-paragraph description of how it works, and you need to evaluate it before the district spends $200,000 on a three-year contract.

StudyPal description: "StudyPal answers student questions instantly and completely, explains every step of the solution, offers encouragement after each correct answer, tracks how many questions a student completes per session, and sends weekly reports to parents showing session length and questions answered."

Your opening move: Tell your AI colleague what you think StudyPal's underlying theory of learning is — and whether the metrics it tracks actually measure learning. Then push the analysis further together.

AI Colleague — Design Analysis Lab

Lab 1

You're the auditor. I'm your research partner — I'll push back on your analysis and ask harder questions, but I won't do your job for you. What's your read on StudyPal's learning theory? Start with what you notice about what it measures and what it doesn't.

Module 4 · Lesson 2

Carnegie Learning's MATHia: The Oldest AI Tutor Still Running

Before ChatGPT, before Khanmigo, there was a system that had been tracking your thinking errors for twenty years — and it's still in 600 U.S. school districts today.

What does an AI tutor know about you after tracking 10,000 of your decisions? And should it be allowed to know that much?

In 1998, researchers at Carnegie Mellon University published a paper describing a system they called a "cognitive tutor." It wasn't like the AI tutors that would come twenty years later — it didn't use large language models or natural conversation. It was built on something called cognitive modeling, a technique where researchers built a precise mathematical map of how a human expert solves a problem, then compared every student decision against that map in real time.

The system was designed for algebra. Every time a student solved an equation step, the software noted not just whether the answer was right, but which reasoning pathway the student had followed. Over time, it built what researchers called a "knowledge component map" — essentially a fingerprint of exactly which mathematical skills a student had mastered, was developing, or had consistently misunderstood.

By 2002, this system had spun out into a company called Carnegie Learning, and their product — eventually called MATHia — was deployed in real schools. By 2023, MATHia was being used by approximately 600,000 students annually across the United States, in schools from rural Tennessee to urban Chicago. It had accumulated data on student problem-solving behavior stretching back over two decades. No other AI tutoring system in existence has anything close to that data history.

MATHia doesn't feel like ChatGPT. It doesn't have friendly conversation. It doesn't use your name warmly or celebrate your streaks. But underneath its plain interface runs something that Khanmigo and Duolingo Max simply don't have yet: the ability to predict, with documented accuracy, exactly which concept you will struggle with next.

What Cognitive Modeling Actually Does

Here's a concrete example of how MATHia's cognitive model works in practice. Suppose you're a seventh-grader working on solving linear equations. You solve ten problems. MATHia doesn't just track your scores. It tracks every intermediate step — every time you moved a variable to the wrong side, every time you divided before subtracting, every time you correctly applied the distributive property but then made an arithmetic error in the next step.

From those ten problems, MATHia has built a micro-profile of your mathematical reasoning. It knows, probabilistically, that you understand what "solving for x" means, that you reliably apply inverse operations, but that you have a systematic error: when negative signs appear on both sides of an equation, you consistently make a sign error. You don't just sometimes get it wrong — you get it wrong in the same direction, every time.

This distinction matters enormously. Random errors usually mean a student wasn't paying attention. Systematic errors mean a student has learned something incorrectly and needs to actively un-learn it. Those two situations require completely different responses from a tutor. MATHia can tell them apart. Most human teachers, managing 30 students simultaneously, cannot reliably do so for every student every day.

Cognitive modeling — Building a mathematical map of how an expert solves a problem, then comparing every step of a learner's work against that map to identify exactly where their understanding breaks down.

Knowledge component — A single, specific skill or concept in a subject — like "applying the distributive property" or "recognizing equivalent fractions." MATHia tracks mastery of hundreds of these individually.

A 2019 RAND Corporation study — one of the most rigorous independent evaluations of an AI tutoring system ever conducted — found that students who used MATHia for at least 45 minutes per week showed statistically significant gains equivalent to 6.5 additional months of math learning over a school year compared to control groups. This is among the largest effect sizes ever documented for an educational technology product.

The Data Trail You Leave Behind

MATHia's power comes from data — and that same data raises questions that school districts, parents, and privacy advocates have been debating seriously since at least 2014, when the state of New York cancelled a data-sharing agreement with an educational technology consortium called inBloom after parents raised concerns about what student behavioral data was being collected, how long it was retained, and who could access it.

When you use MATHia, the system logs timestamps, response times, error patterns, and decision sequences — not just for one session, but across your entire school career. A student who starts using MATHia in fifth grade and continues through eighth grade has generated thousands of data points about their specific cognitive patterns. That profile is extremely detailed. It is also, potentially, very revealing — not just about math ability, but about things like attention, persistence, frustration tolerance, and academic self-confidence.

What the System Sees

MATHia tracks "hint abuse" — when students click for hints repeatedly without attempting problems — as a distinct behavioral pattern. It tracks "gaming the system" behaviors like random clicking. It tracks session abandonment rates. These are not just learning metrics; they are behavioral and psychological indicators. The company uses them to improve the product. They are also retained in student records.

In 2020, Carnegie Learning published a privacy policy clarification stating that student data is not sold to third parties and is covered by FERPA, the U.S. federal student privacy law passed in 1974. But FERPA was written before AI-driven behavioral profiling existed. Its protections were designed for paper records and grade transcripts — not for systems that record thousands of micro-decisions per session.

The Ethical Question

MATHia can predict, based on your behavioral patterns at age 11, how likely you are to struggle with algebra at age 14. That prediction might be accurate. But should a school system be allowed to use it? If a teacher sees your MATHia profile before meeting you, does that help them support you — or does it prejudge you before you've had a chance to surprise anyone? No clean answer. Think about what you want yours to say.

Why MATHia Matters to the Future of AI Tutoring

MATHia is, in many ways, the proof of concept that the AI tutoring world is still catching up to. It demonstrated, with real data over real years, that a machine could identify a student's specific cognitive gaps more reliably than most classroom-based assessment. It demonstrated that adaptive pacing — letting each student move at their own speed through a curriculum, not the class average — produces real learning gains. And it demonstrated that you don't need a conversational AI to do this; structured interaction data is enough.

The newer generation of AI tutors — Khanmigo, Duolingo Max, and others covered in this module — are powerful in different ways. They can hold conversations. They can adapt tone. They feel more human. But as of 2024, none of them have MATHia's longitudinal data depth or its decades of documented outcome evidence.

You now understand something that shapes every serious policy debate about AI in education: the most effective AI tutoring system in documented existence is not a chatbot — it's a cognitive model running quietly behind a plain interface in 600 school districts, accumulating data about how millions of children actually think. That's not a small thing to know.

Lesson 2 Quiz

MATHia and cognitive modeling — apply what you've learned to new scenarios.

1. MATHia identifies "systematic errors" and treats them differently from "random errors." Why does this distinction matter for how a tutor should respond?

Correct. A systematic error means you've learned something the wrong way — that's actually harder to fix than a careless mistake, because you have to un-learn a pattern before building the right one.

Re-read the section on cognitive modeling. The lesson explains why random and systematic errors require "completely different responses" — what's the difference in what they reveal about the student?

2. The 2019 RAND Corporation study found MATHia produced learning gains equivalent to 6.5 additional months of math learning. What condition was required for this result?

Right. Dosage matters — even powerful tools have a threshold below which they don't produce measurable effects. 45 minutes per week was the documented minimum in this study.

The lesson states the specific condition directly. Check the RAND study section — what minimum usage was required?

3. A school district gives all seventh-grade teachers access to their students' MATHia profiles before the school year begins. A student named Priya had a strong MATHia record in sixth grade. Her new teacher immediately seats her in the "advanced" group. What risk does this scenario illustrate?

Exactly. The ethical question in the lesson asks this directly: if a teacher sees your profile before meeting you, does it help them support you — or does it prejudge you "before you've had a chance to surprise anyone"?

This scenario connects to the ethical question the lesson poses about data profiles. The issue isn't accuracy — it's about what happens when a prediction about you becomes a fact about you before you've acted.

4. FERPA is a U.S. privacy law designed to protect student records. The lesson argues it may not adequately protect students in AI tutoring systems. What is the core reason for this gap?

Correct. Laws often lag behind technology. FERPA protects the kinds of records that existed when it was passed — grades, transcripts, disciplinary files. It wasn't built to handle AI behavioral profiling.

The lesson explains this gap specifically. Think about when FERPA was written and what kinds of records it was designed to protect.

5. MATHia has two decades of documented outcome data, while Khanmigo and Duolingo Max do not. What does this mean for a school making a purchasing decision in 2024?

Right. Evidence of effectiveness is a genuine advantage, but it's not the only thing that matters. Conversational AI can reach students in ways cognitive models cannot. A sophisticated buyer weighs both, rather than dismissing one.

The lesson doesn't declare one tool "best" — it frames this as a genuine tradeoff. What does MATHia have that newer tools lack? What do newer tools offer that MATHia doesn't?

Lab 2: The Data Investigator

MATHia tracks thousands of behavioral data points per student. You're deciding what a school district should and shouldn't be allowed to see.

Your Role

You're a student representative on a school district's newly formed AI Ethics Committee. The district is renewing its MATHia contract and the vendor has offered to give teachers access to a new "full behavioral dashboard" — including hint-abuse patterns, system-gaming flags, session abandonment rates, and a predictive score showing each student's probability of struggling with algebra in two years.

The committee needs to decide: which parts of this data should teachers see, which should be restricted, and who should have the authority to make that call?

Start by stating your position: which data points would you allow teachers to see, and which would you restrict — and why? Your AI colleague will challenge your reasoning.

AI Colleague — Data Ethics Lab

Lab 2

I've read the committee brief. Before you give me your position, tell me this: do you think there's a difference between data that helps a teacher teach better and data that just tells a teacher what to expect? Because that distinction might matter more than which specific fields you allow. What's your opening position?

Module 4 · Lesson 3

Squirrel AI: The System China Built to Replace the Classroom

In 2019, a Chinese company claimed their AI tutor outperformed human teachers in a randomized controlled trial. The results were published in a major journal. Almost no one in the West noticed.

If an AI can outperform a human teacher on measurable learning outcomes — should it replace the teacher?

In December 2019, the journal Nature Human Behaviour published a paper with an unusual title: "A Randomized Experiment in China Shows AI Can Improve Learning for Struggling Students." The researchers — from Carnegie Mellon University, Zhejiang University, and the Chinese company Squirrel AI — had run one of the largest randomized controlled trials of an AI tutoring system ever conducted.

They recruited 1,000 middle school students across 28 schools in China and randomly assigned them to two groups: one group received instruction from human teachers in the normal way; the other received instruction from Squirrel AI's adaptive tutoring system. Both groups covered the same math and science curriculum over the same period. After the study, both groups took standardized tests.

The result: students using Squirrel AI significantly outperformed students taught exclusively by human teachers. Not slightly — the gains were statistically large. The AI group also showed improvements for struggling students that were especially pronounced, suggesting the system was particularly effective at reaching students who typically fall behind in traditional classrooms.

The paper was peer-reviewed and published in one of the world's most respected scientific journals. It was also almost completely ignored by mainstream Western media and education policy circles. Derek Lomas, a learning scientist at Delft University of Technology who reviewed the study, wrote in 2020: "If a drug showed these effect sizes, we'd be talking about it on every front page. Because it's an AI education product, we're barely talking about it at all."

What Squirrel AI Is — And How It Works

Squirrel AI was founded in 2014 by Derek Haoyang Li, a former education executive who set out to build what he described as "a clone of the world's best human tutor, available to every student." By 2023, Squirrel AI operated learning centers across more than 2,000 locations in China, with over 3 million registered students. It is by some measures the largest AI tutoring operation in the world.

The system works through an approach called fine-grained knowledge decomposition. Where MATHia might track a few hundred "knowledge components" in algebra, Squirrel AI has reportedly decomposed a single high school math curriculum into over 10,000 distinct micro-concepts. Before a student ever solves a problem, the system runs a diagnostic that maps their current knowledge state against this 10,000-node map and identifies which concepts they know, which they almost know, and which they have never encountered.

From that starting point, Squirrel AI constructs an individualized learning path through the curriculum. Each student's path is different — not because the destination is different, but because the route is chosen based on their specific knowledge gaps. Two students sitting side-by-side in a Squirrel AI learning center might be working on completely different concepts at any given moment, each moving toward the same exam objective from a different angle.

Fine-grained knowledge decomposition — Breaking a subject down into thousands of very small, specific sub-concepts — so small that the system can identify not just "student doesn't understand fractions" but "student understands equivalent fractions but not fraction division."

Adaptive learning path — A customized route through a curriculum built specifically for one student, based on their current knowledge state — as opposed to a standard sequence that every student follows in the same order.

This is fundamentally different from what Khanmigo or Duolingo Max do. Those tools adapt their tone and their scaffolding — they adjust how hard they push, or how they explain something. Squirrel AI adapts what it teaches next, in a more granular way than any other system currently deployed at scale.

The Comparison Table: Where Each System Wins

By this point in the module, you've studied three very different AI learning systems. Here's a direct comparison of how they differ on the dimensions that matter most:

Dimension	Khanmigo	Duolingo Max	MATHia	Squirrel AI
Core method	Socratic questioning	Engagement + conversation	Cognitive modeling	Knowledge decomposition + adaptive paths
What it adapts	How it responds (tone, questions)	Explanations and practice type	Difficulty and concept sequencing	Which concept is taught next
Strongest evidence	Early pilots (2023–24)	Engagement/retention data	2019 RAND RCT (6.5 months gain)	2019 Nature HB RCT (vs. human teachers)
Main risk	Some students disengage from friction	Shallow learning despite high engagement	Behavioral data profiling	Scale of data collection; replacement of teachers
Business model	Nonprofit	Publicly traded, subscription	B2B school contracts	Consumer learning centers (China)

No single system is best at everything. Squirrel AI has the most impressive outcome data — but it also operates in a context where it functions more as a replacement for classroom instruction than a supplement to it. That's a meaningful difference in what the system is for.

The Replacement Question

Squirrel AI's founder has said publicly that he believes AI tutoring will eventually make classroom instruction with human teachers unnecessary for most academic subjects. His reasoning is blunt: a human teacher managing 30 students simultaneously cannot provide individualized instruction to each child. An AI can. On measurable learning outcomes for academic content, an AI that tracks 10,000 knowledge components will eventually outperform a human teacher for most students, most of the time.

This is not a fringe view. A 2023 paper in the journal Educational Researcher surveyed 150 leading learning scientists and found that roughly 40% agreed that "AI systems will outperform average human teachers on academic outcome measures within 15 years." Roughly 35% disagreed. The remaining 25% said the question was unanswerable because it depended on what "outperform" meant.

The Ethical Question

If an AI system genuinely produces better academic outcomes than a human teacher — measurably, reliably, for most students — is that sufficient justification for replacing teachers? What does a teacher do that isn't captured in academic outcome measurements? And who gets to decide what school is for: measurable learning, or something else? These questions are being debated in education ministries and school boards right now. There is no consensus. There may not be one anytime soon.

You now understand something most adults — including most education policymakers — haven't fully grappled with: the most rigorous evidence in AI tutoring research comes not from Silicon Valley or from American classrooms, but from a Chinese company that's been running controlled trials since 2014. The conversation about AI replacing teachers isn't hypothetical. It's already happening. Knowing that makes you a more informed reader of every headline about the "future" of AI in education — because for millions of students, that future is already the present.

Lesson 3 Quiz

Squirrel AI and the replacement question — apply the concepts to real scenarios.

1. The 2019 Nature Human Behaviour study comparing Squirrel AI to human teachers was notable partly because of how it was designed. What made it scientifically rigorous?

Correct. Random assignment is the gold standard in research because it eliminates selection bias — you can't explain away the results by saying "well, the motivated students chose the AI." The random assignment means the groups should have been equivalent at the start.

Think about what "randomized controlled trial" means. Why does random assignment matter — what problem does it solve?

2. Squirrel AI uses "fine-grained knowledge decomposition" with over 10,000 micro-concepts. How is this different from what MATHia does?

Right. Both use knowledge modeling, but the granularity is very different. More micro-concepts means finer-grained identification of exactly where a student is and isn't ready — like the difference between a map with country outlines and one with every street.

Both MATHia and Squirrel AI use knowledge-component tracking. The comparison table in the lesson shows they use similar methods but at very different scales of granularity.

3. Two students sit next to each other in a Squirrel AI learning center. Student A is working on fraction multiplication; Student B is working on integer division. Both are in the same math class. What does this tell you about how Squirrel AI sequences content?

Exactly. Squirrel AI adapts what is taught next, not just how it's taught. Two students moving toward the same exam objective may need to travel very different routes to get there based on their individual knowledge gaps.

The lesson describes this directly. Squirrel AI doesn't adjust tone or difficulty — it adjusts which concept is taught next. What does that mean for sequencing?

4. A learning scientist argues: "We should deploy Squirrel AI in all schools immediately — the evidence clearly shows it outperforms human teachers on academic outcomes." What is the strongest counter-argument?

This is the strongest counter-argument because it doesn't deny the evidence — it questions what the evidence measures. If the test only measures academic content knowledge, and teachers do other valuable things, then "outperforms on this test" doesn't mean "should replace."

Think about the ethical question the lesson raises: "What does a teacher do that isn't captured in academic outcome measurements?" The strongest counter-argument engages with what the study measured — and what it didn't.

5. Derek Lomas compared ignoring Squirrel AI's results to ignoring a drug with large effect sizes. What point was he making about how education research is treated?

Correct. Lomas was pointing out a selective attention bias — we don't treat all evidence equally, and our choices about what to amplify and what to ignore have real consequences for what gets implemented and what gets ignored.

Lomas wasn't making an argument about regulation — he was making an argument about how we pay attention. What does it mean that the same size of effect gets very different coverage depending on where it appears?

Lab 3: The Policy Critic

A school minister just proposed replacing 30% of classroom teachers with Squirrel AI within five years. You have to write the public response — and actually defend it.

Your Role

You're a student advisor to a national education committee. A government minister has just proposed a pilot program to replace one-third of classroom teaching time with Squirrel AI in 500 schools over five years. The minister cites the 2019 Nature study and argues this will dramatically improve academic outcomes, especially for struggling students in under-resourced schools.

Your committee needs a written critique of the proposal — not a rejection of the evidence, but a serious analysis of what the proposal gets right, what it misses, and what conditions would need to be true for such a policy to be ethical.

Start by telling your AI colleague what you think the minister got right — then push into what the proposal misses. Don't just oppose it. Engage with the evidence first.

AI Colleague — Policy Analysis Lab

Lab 3

Before you critique the minister, I want to make sure you're engaging with the actual evidence. The 2019 study is real, peer-reviewed, and published in a top journal. A lot of people dismiss AI-in-education proposals without engaging with the evidence at all — that's not a serious critique, it's just resistance. So: what does the minister have right? Start there, then we'll dig into the gaps.

Module 4 · Lesson 4

What No Tool Does Well — And What That Means for You

After three tools, three theories, and three sets of evidence — the most important question isn't which AI tutor is best. It's what every AI tutor is missing.

If you know what a tool can't do, you can decide when not to use it. That might be the most powerful skill in this entire course.

In the fall of 2023, a high school junior named Amara in suburban Atlanta was assigned to use Khanmigo for SAT prep every evening for six weeks. She had her laptop open, the interface loaded, and she was diligently answering Khanmigo's Socratic questions on reading comprehension. She was making measurable progress on the practice problems. But she was also quietly struggling with something Khanmigo couldn't see: she didn't believe she was a math person.

The belief wasn't irrational. It had been built over years — a fifth-grade teacher who called her out when she got an answer wrong, a sixth-grade class where the boys seemed to get called on more, a middle school where the "gifted" track was mostly white students and she was one of very few Black girls in the advanced group. By eleventh grade, she had developed what psychologists call math anxiety — a real cognitive phenomenon in which the stress of math problems actually impairs working memory, making the math harder than it would otherwise be.

Khanmigo gave Amara hints. It asked her guiding questions. It waited patiently. But it never asked: why do you hesitate for 30 seconds before every problem? It never noticed the pattern. It had no model for what was happening in her head outside the math itself. The system was measuring her knowledge gaps. It had no instrument for measuring what Claude Steele — the Stanford psychologist who identified the concept in 1995 — calls stereotype threat: the way being a member of a group that's stereotyped as less capable actually reduces performance in the moment, regardless of underlying ability.

Amara eventually got a human tutor — a Black woman who had navigated the same SAT in the same suburb fifteen years earlier. Within three sessions, Amara's performance on practice tests improved significantly. The content knowledge wasn't the bottleneck. The belief system was. No AI tutor currently deployed addresses that.

The Three Gaps No Current AI Tutor Fills

After studying Khanmigo, Duolingo Max, MATHia, and Squirrel AI, a pattern emerges. Each system is sophisticated in its own way. Each has documented evidence of effectiveness. And all four share the same three blind spots.

Gap 1: Identity and belonging. All four systems treat the learner as a cognitive agent — a brain processing information. None of them have a model of the learner as a social person whose belief about whether they belong in a subject affects how their brain processes information about that subject. Stereotype threat, impostor syndrome, and math identity are real, documented phenomena with real effects on academic performance. No current AI tutor tracks them or responds to them. A system that measures ten thousand knowledge components but zero identity components is missing something significant.

Gap 2: Transfer and application. Every system in this module is good at teaching something in a specific context. Duolingo Max is good at Spanish vocabulary. MATHia is good at algebra procedure. Squirrel AI is good at exam-targeted content. What none of them have demonstrated is helping students apply learning to genuinely novel contexts — situations that look completely different from anything practiced in the system. This is called far transfer, and it's arguably the most important kind of learning. It's also the hardest to teach and the hardest to measure. Current AI tutors largely avoid the problem.

Far transfer — The ability to apply a concept or skill learned in one context to a completely different situation — like using algebra reasoning to understand a news story about economic statistics. This is harder than "near transfer," which is just applying the same skill to slightly different problems.

Gap 3: Metacognition. Metacognition means thinking about your own thinking — knowing how you learn, recognizing when you're confused versus when you just think you're confused, understanding your own error patterns. Decades of learning research show that metacognitive awareness is one of the strongest predictors of long-term academic success. Students who can accurately judge their own understanding outperform equally intelligent students who can't. Of the four systems studied, only MATHia even attempts to build a metacognitive model — and it does so indirectly, by tracking error patterns rather than directly developing the learner's self-awareness.

What the Research Shows

A 2022 meta-analysis in Educational Psychology Review examined 93 studies of AI tutoring systems and found that while AI tools showed consistent gains on near-transfer tasks (problems similar to those practiced), effect sizes for far-transfer tasks were near zero. The authors concluded: "Current AI tutoring systems are optimized for performance on measurable tasks in structured domains. They have not demonstrated the ability to develop the flexible thinking required for genuinely novel problem-solving."

How to Use This Knowledge Right Now

Understanding what AI tutors can't do is not an argument against using them. It's an argument for using them with clear eyes — knowing what you're getting and what you still need to supply yourself.

Here's a practical framework. When you're using any AI learning tool, you're getting: accurate knowledge-gap identification, patient and infinitely available practice, personalized sequencing, and feedback without judgment. These are genuinely valuable, and no human tutor is consistently better at all of them.

What you're not getting: someone who sees you as a whole person, someone who can identify when your problem is belief rather than knowledge, someone who can push you to apply ideas in genuinely unfamiliar contexts, or someone who helps you build an accurate map of your own thinking.

Those missing pieces are what a good teacher, mentor, or thinking partner does. AI tutors don't replace them. They offload the parts of learning that are about information transfer and practice. The parts that are about identity, meaning, and flexible thinking still require a human — or, in some cases, just time and experience.

The Ethical Question

Schools in under-resourced communities are more likely to adopt AI tutors as substitutes for human instruction — not because they believe AI is better, but because they can't afford enough qualified human teachers. This means the limitations described in this lesson fall disproportionately on students who are already disadvantaged. If you know that AI tutors can't address stereotype threat, identity, or far transfer — and you know that wealthier schools will use AI as a supplement while poorer schools use it as a replacement — what obligation does that knowledge create? For researchers? For policymakers? For companies building these tools?

You have now completed a comparative analysis that working education researchers take years to build. You can name the mechanisms behind four major AI tutoring systems, identify what evidence exists for each, locate the specific gaps they share, and recognize how business models, learning theory, and data ethics all intersect in a tool that looks, from the outside, like just an app that helps you with homework.

That is not a small thing. Every time someone tells you "AI is going to transform education," you now know the right questions to ask: Which AI? What theory of learning? For whom? At what cost? Measured how? Those questions are the difference between being a passive consumer of a tool and being someone who can actually evaluate whether it's doing what it claims.

Lesson 4 Quiz

What AI tutors can't do — apply the gaps framework to new situations.

1. Amara was making measurable progress on Khanmigo practice problems but her performance wasn't improving the way expected. What did Khanmigo's design fail to detect?

Correct. Khanmigo was measuring what it was built to measure — knowledge components. It had no instrument for detecting the cognitive and psychological effects of stereotype threat, which is a real phenomenon with documented effects on working memory.

Think about what changed when Amara switched to a human tutor. The content didn't change. What changed was the tutor's ability to see something Khanmigo couldn't. What was that something?

2. A student uses MATHia to master algebra and scores perfectly on all practice problems. Two weeks later, her economics teacher asks her to use algebra to model a supply-and-demand graph. She struggles. Which AI tutoring gap does this scenario best illustrate?

Right. Far transfer — applying learned skills to genuinely novel contexts — is exactly what current AI tutors struggle to develop. She can do algebra when it looks like algebra. The hard part is recognizing it when it doesn't look like algebra yet.

Review the definition of "far transfer" in the lesson. She clearly knows the algebra when it's presented as algebra. The problem is something specific about how she's applying it in an unfamiliar domain.

3. The 2022 meta-analysis found AI tutors showed strong gains on "near-transfer" tasks but near-zero effect sizes on "far-transfer" tasks. What does this suggest about the way most AI tutoring systems are built?

Exactly. There's a tendency in any measurable system to optimize for what gets measured. If your success metric is performance on structured practice problems, you'll build a tool that improves structured practice problem performance — not necessarily deeper flexible thinking.

The meta-analysis finding is about what AI tutors are built to optimize for. What kind of tasks are easiest to measure progress on? And what does it mean if those aren't the same as the most important tasks?

4. The lesson argues that the limitations of AI tutors fall "disproportionately on students who are already disadvantaged." What is the mechanism behind this claim?

Right. The equity issue isn't about the tool itself being biased — it's about how the tool is deployed. When a tool is a supplement, its limits are covered by what surrounds it. When it's a replacement, its limits are the limits of the student's whole education.

Think about the supplement vs. replacement distinction in the lesson. The tool itself isn't different — what's different is the context in which it operates and what it's expected to cover.

5. After completing this module, someone tells you: "AI tutors are amazing — I used one every day and my test scores improved by 20%." Using the framework from Lesson 4, what is the most thoughtful response?

This is the sophisticated response. It takes the evidence seriously — AI tutors do improve test performance — while asking the harder question about what test scores measure and what they don't. That's the thinking this module was designed to develop.

The lesson isn't arguing that AI tutors don't work — the evidence shows they do improve measurable outcomes. The question is what those outcomes measure and what they leave out. What would a genuinely complete picture of "improved learning" look like?

Lab 4: The Tool Designer

You know what three real AI tutors do well and what all of them miss. Now design something better — and defend every choice.

Your Role

You've been invited to pitch a new AI tutoring concept to a foundation that funds educational technology. Your pitch needs to directly address at least two of the three gaps identified in Lesson 4 — identity and belonging, far transfer, or metacognition. You also need to explain which existing tool your design is most similar to, and why yours does something none of them currently do.

You have three minutes of pitch time. Your AI colleague has read every criticism of AI tutoring in existence and will challenge every claim you make.

Start your pitch. Name your tool, describe what gap it addresses, and explain one specific mechanism — not just a vague goal — for how it would work. Your colleague will push back hard.

AI Colleague — Design Challenge Lab

Lab 4

I've read pitches for AI learning tools from 50 teams this year. Most of them say they'll "address the whole student" or "build metacognition" — and when I ask how, specifically, they give me a vague answer about personalization or adaptive feedback. I'm not interested in vague. Tell me what gap you're addressing, what specific mechanism your tool uses to address it, and why that mechanism would actually work. I'll be the skeptical foundation reviewer. Start your pitch.

Module 4 Test

15 questions — all four lessons. 80% to pass. Reasoning over recall.

1. Khanmigo and Duolingo Max both use GPT-4. A classmate says "they must work the same way." What's the most accurate response?

Right. The model is the substrate; the system prompt is the design. Same material, completely different architecture.

The key insight from Lesson 1: same model, different instructions, completely different tools.

2. What is "friction design" and why would a learning tool deliberately use it?

Correct. Friction = active construction of knowledge. Frictionless = passive receipt. The second feels easier but produces shallower learning.

Friction design is a deliberate pedagogical choice based on how memory works. What does having to think a little harder do to learning?

3. Why might Duolingo's engagement-first design actually serve some learners better than Khanmigo's friction-first design?

Right. Zero sessions produces zero learning. If the alternative is dropout, consistent shallow engagement has genuine value — the question is always "compared to what?"

Think about who Duolingo was designed for. The dropout problem is real. What does that mean for the value of engagement-first design?

4. MATHia was developed from research at which university, and approximately how long has it been operating in real schools?

Correct. Carnegie Mellon, 1998 research, schools from ~2002. This longevity is exactly what gives it data depth no newer tool has.

Review Lesson 2's opening story — where was the cognitive tutor research published, and when?

5. MATHia tracks "hint abuse" and "system-gaming" behaviors. Why might this data be ethically sensitive even if Carnegie Learning doesn't sell it to third parties?

Right. The issue isn't just external data sale — it's the existence of a detailed behavioral profile that can shape how people see you before they've interacted with you. Data retained is data that can be accessed.

Think about the ethical question in Lesson 2 — the issue isn't just who gets the data, but what happens when it's used by people who will interact with the student.

6. What is the significance of FERPA having been written in 1974 for understanding AI tutoring data risks?

Correct. Law is written for the world that exists when it's written. When the world changes faster than the law, gaps appear — and people can be harmed in those gaps.

The lesson makes a specific argument about the mismatch between what FERPA was designed to protect and what modern AI systems actually collect.

7. Squirrel AI's "fine-grained knowledge decomposition" decomposes a curriculum into over 10,000 micro-concepts. What is the primary learning advantage this provides?

Right. Precision identification enables precision routing. The more granular your map of what someone knows, the more accurately you can direct them toward what they don't know yet.

Think about what 10,000 micro-concepts gives you that 100 broad topics doesn't — what kind of decisions does that precision enable?

8. The 2019 Squirrel AI study was published in Nature Human Behaviour and largely ignored by Western media. According to the module, what does this selective attention suggest?

Right. Lomas's point is about selective amplification — what we pay attention to shapes what gets implemented. If we ignore strong evidence because of where it comes from, we make worse decisions.

The lesson doesn't argue the study was flawed — it was peer-reviewed in a top journal. The issue is about how we selectively pay attention to evidence.

9. A student completes all MATHia assignments perfectly and passes every unit test. Three months later, she reads a newspaper article about income inequality and can't connect it to the percentages and ratios she mastered in MATHia. Which gap does this illustrate?

Correct. Far transfer is the leap from "I can do this in math class" to "I can use this to understand the world." That leap is what AI tutors currently struggle to develop.

She clearly knows the math — she passed everything. The problem is applying it in an unfamiliar context. Which gap from Lesson 4 is that?

10. "Metacognition" is described as one of the strongest predictors of long-term academic success. What is metacognition?

Exactly. Students who can accurately judge their own understanding — who know when they understand something versus when they just think they do — consistently outperform equally intelligent students who can't.

Metacognition is specifically about self-awareness of your own learning processes. Re-read the Lesson 4 definition.

11. Both MATHia and Squirrel AI showed large effect sizes in rigorous studies. Khanmigo has early positive pilot data. Duolingo has strong engagement data. How should a school choose between them?

Right. This is the practitioner's answer — not "which is best" but "what are we trying to do, and for whom, and in what context?" Tool selection is always contextual.

The module never declares one tool the winner. It builds a framework for contextual evaluation — different tools for different priorities and contexts.

12. Stereotype threat, as identified by Claude Steele in 1995, has what specific effect on student performance in academic tasks?

Correct. Stereotype threat isn't just a feeling — it has a measurable cognitive mechanism: impaired working memory. That's why it affects performance on tasks that require active mental processing.

Lesson 4 describes this mechanism specifically. Stereotype threat doesn't just affect motivation — it has a direct cognitive effect on working memory. What is that effect?

13. Khan Academy's nonprofit status and Duolingo's public company status both influence their AI design choices. What general principle does this illustrate?

Right. Follow the incentives. A company that needs monthly active users will design for retention. A nonprofit that needs to demonstrate learning outcomes will design for learning. Neither is dishonest — they're responding to their situations.

Neither company is behaving badly — they're responding to their incentive structures. What does that tell you about how to analyze any product or institution?

14. The equity concern in Lesson 4 is that AI tutors deployed as replacements in under-resourced schools create worse outcomes than AI tutors deployed as supplements in well-resourced schools. What is the core mechanism of this inequity?

Right. The same tool, deployed differently, produces different equity outcomes. The tool isn't the problem — the gap between supplement and replacement is where inequity enters.

The tool itself isn't different. What changes is whether it's surrounded by human instruction that fills its gaps — or whether it is the instruction.

15. You've studied four AI tutoring systems across four lessons. A friend says: "AI tutors will replace teachers within 10 years — the data proves it." What is the most sophisticated response based on everything in this module?

This is the full answer. It takes the evidence seriously, acknowledges what it shows, but locates the real question: what are we measuring, and what are we not measuring? That's the question that actually decides the issue — and it's a values question, not just an empirical one.

The module gives you evidence on both sides of this question. The sophisticated answer doesn't dismiss the evidence — it asks what the evidence measures and what it leaves out, and recognizes that the "replacement" question is ultimately about values.