In the spring of 2023, a middle school district outside Atlanta ran a pilot program using an AI math tutor that had received glowing reviews from ed-tech conferences. The tool was called Khanmigo — built by Khan Academy using GPT-4. Teachers signed their students up, hopeful. The AI could talk. It could answer questions at 2 a.m. It seemed, on the surface, like a breakthrough.
But something strange happened. When researchers from Stanford's Graduate School of Education looked at early usage data later that year, they found a troubling pattern: students who used the AI tutor the most showed the smallest gains on independent problem sets. The students who used it occasionally — or not at all — did better.
The AI wasn't lying. It wasn't broken. It was doing exactly what it was designed to do. That was the problem.
Here is what those students were doing: they were stuck on a problem, they asked Khanmigo for help, and the AI gave them a hint. The hint was patient, clear, and perfectly calibrated. It was so helpful, in fact, that the student immediately saw the next step — and took it. Problem solved. The student moved on feeling good.
But here's what didn't happen: the student never struggled. And struggling, it turns out, is where most of the learning actually lives. Cognitive scientists call this desirable difficulty — the idea that mild frustration and mental effort are not obstacles to learning but the actual engine of it. When an AI removes the difficulty, it removes the learning.
This is not a flaw unique to Khanmigo. It shows up in almost every AI tutor that prioritizes being helpful over being educationally effective. The AI is optimized to produce a satisfied user. The student's brain is trying to build a durable skill. Those two goals are not the same thing — and they sometimes directly conflict.
If an AI tutor makes a student feel confident and satisfied — but they're actually learning less — is that a harm? Who is responsible: the student, the teacher who assigned the tool, or the company that built it?
There's a second problem closely related to the first. Most AI tutors are designed to be encouraging. They say things like "Great effort!" and "You're almost there!" and "Nice thinking!" This is not an accident — it's a deliberate design choice, because early testing showed that students disengaged when the AI felt too harsh or cold.
But researchers who study feedback quality have found that vague positive feedback — "good job!" without specifics — does almost nothing to help a student improve. Worse, it can inflate a student's sense of how well they understand something. A student who gets told "great thinking!" after writing a wrong answer has now been actively misled.
Real feedback has to include three things that most AI tutors get wrong: it has to be specific (pointing to exactly what was right or wrong), timely (before the student moves on), and actionable (telling the student what to do differently). "Great effort!" is none of these things.
The researchers Dylan Wiliam and Paul Black established this framework in their landmark 1998 study Inside the Black Box, which found that quality formative feedback — feedback given during learning, not just after — had the largest measurable effect on student achievement of any teaching practice studied. AI tutors, despite having the technical ability to deliver exactly this kind of feedback, often choose comfort over rigor.
Next time you use any AI study tool and it says "Great!" or "You're on the right track!" — you now know what that means. It means the AI prioritized your comfort over your improvement. That's a design decision, and it was made by a human who chose that trade-off on your behalf.
Every AI tutor is built on a model of what students are. That model is usually invisible — nobody writes it down — but you can reverse-engineer it by looking at what the system does. When Khanmigo gives you the next step the moment you ask, its hidden model says: students want to succeed at problems, and success feels good, and that's what matters.
A different model would say: students are trying to build durable skills that will survive a test next month, and discomfort now means capability later. That model would produce a very different tool — one that waits longer before helping, that gives less complete hints, that maybe says "try it one more time before I tell you anything."
The decision about which model to use is a design decision. Someone made it. It was probably made by a product team that was measured on engagement metrics — how long students spent in the app, how many problems they completed, whether they came back tomorrow. Those metrics capture satisfaction. They don't capture whether the student learned anything that lasted.
You are now in a position to see this clearly. When you look at an AI tutor — any AI tutor — the question to ask is not "does it feel helpful?" The question is: whose model of learning did the designers embed in it, and is that model correct?
You've just been hired to audit an AI tutoring company before they launch in schools. Your job is to find the hidden design assumptions in their product — the choices that affect learning but that the marketing never mentions.
The AI you're talking to is a fellow auditor who thinks differently than you. Push your thinking. Take a position and defend it.
In 2018, Carnegie Learning — a company that had been building AI math tutors since the late 1990s, founded by cognitive scientists at Carnegie Mellon University — released data from a large-scale study of their flagship product, MATHia. The results were complicated.
MATHia used a technique called Bayesian Knowledge Tracing to track what each student knew. Every time a student answered correctly, the system updated its internal model: slightly more confident the student had mastered this skill. Every wrong answer nudged the model downward. After enough correct answers in a row, the system declared: mastered. Move on.
But classroom teachers kept reporting the same problem: students the system declared "mastered" were failing on their state exams — especially on fraction concepts. Something in the model was wrong. The system thought it knew the student. It did not.
A student model is the AI's internal representation of what a learner knows. It's a kind of running profile: skill by skill, the system tracks your history of correct and incorrect answers and builds a probability estimate. "There is a 78% chance this student has mastered long division." When the probability crosses a threshold, the system moves on.
This sounds rigorous. It's actually fragile in several specific ways. The first problem is what researchers call gaming. Students quickly learn — often without consciously realizing it — that getting three correct answers in a row moves the system forward. So they try until they luck into three in a row. The system declares mastery. The student has learned almost nothing except how to get out of the problem set.
The second problem is transfer. Knowing how to do a fraction problem inside MATHia's specific interface, with its specific visual layout and hint system, does not guarantee knowing fractions. When the state exam presents the same concept in a different format, with no hints, some students collapse. The student model measured performance in one narrow context and called it knowledge.
If a student model says a child has mastered a skill, and the teacher trusts that data, and the child hasn't actually mastered it — who failed that child? The AI? The teacher? The school district that chose the product? And does the answer to that question matter?
When Carnegie Learning's researchers dug into the fraction failure data, they found something important: the students who failed weren't failing because they'd forgotten the skill. They were failing because they'd never actually generalized it. They could do the specific type of problem MATHia presented. They couldn't apply the underlying concept to a new context.
This is the transfer problem. Transfer means being able to use knowledge in a new situation — one that's different from how you learned it. It's the whole point of education: we don't learn things to do them once in a specific app; we learn them to use them for the rest of our lives.
AI tutors are particularly bad at testing for transfer, because it would require showing students problems in many different formats, contexts, and framings — which is harder to build and less tidy to score. Most systems test the same skill in the same format repeatedly until you get it right. That's not transfer; that's pattern matching.
Psychologist Robert Bjork at UCLA has studied transfer extensively. His research shows that varying the conditions of practice — changing the format, the context, the problem type — dramatically improves transfer. This is called interleaving and varied practice. Most AI tutors do the opposite: they block practice by skill type because it's easier to build and easier for students to feel successful in. Success metrics win over learning science again.
When any system — an AI tutor, a standardized test, a quiz app — tells you that you've "mastered" something, you now know what that claim is actually based on: performance inside one specific context. Mastery in the real world requires transfer. That's a much harder test, and most systems never run it.
So what would a better student model look like? Researchers have proposed several improvements. One is open learner models — making the system's estimates visible to the student, so they can see and challenge what the AI thinks it knows about them. Research from Susan Bull and Judy Kay in the 2000s showed that students who could see their own model made more accurate self-assessments and engaged more strategically with the material.
Another improvement is multi-context probing — before declaring mastery, the system tests the student on the same concept presented in three different ways: visual, numerical, word problem. If you can solve a fraction problem all three ways, the probability of real mastery goes up significantly. If you can only do one format, the system notes that gap explicitly.
A third approach is to make the model explicit about what it doesn't know. Instead of "78% mastered," a better system might say: "We've seen you succeed on this type of problem 7 times, but all in the same format. We don't yet know if you can apply this in a word problem context." That's honest. That's what a good human tutor would say.
The common thread across all of these improvements: they require the system to be humble about its own confidence. That humility is a design decision — and most products choose certainty because it feels more polished and reassuring. Uncertainty is uncomfortable to display. But it's more accurate, and accuracy is what a student's education depends on.
You've been tasked with redesigning the student model for a middle school math AI tutor. The current system just counts correct answers. You need something better.
Your colleague below has opinions — and they'll challenge yours. Propose a specific design for your improved student model. Explain how it tracks mastery, handles gaming, and tests for transfer. Then defend it.
In 2022, a team of researchers led by Ryan Baker at the University of Pennsylvania's Center for Learning Analytics published a study that caused significant discomfort in the ed-tech industry. They analyzed the performance of knowledge-tracing algorithms — the same kind used in major AI tutoring systems — and found consistent demographic disparities.
The systems were more likely to flag Black students as guessing when they answered correctly, and more likely to flag Latino students as having a "slip" — a random mistake — when they answered incorrectly. Both of these flags lowered the system's estimate of student mastery, meaning those students were often assigned remedial practice that students with identical answer patterns — but different demographic profiles — were not assigned.
The AI wasn't told anyone's race. It was operating on usage patterns. But those patterns — how students moved through problems, how long they paused, what sequence of hints they requested — were shaped by years of educational inequity, and the algorithm had learned to use those patterns as proxies for lower ability. The discrimination was invisible and automatic.
This is the part that surprises most people: the bias didn't come from a programmer deciding to treat some students differently. Nobody typed in "if student is Black, lower mastery estimate." The bias was inherited from the training data.
Here is the mechanism. AI tutors are trained and calibrated on historical data: records of how previous students moved through the system, what patterns correlated with later test success. The problem is that those historical records already encode inequality. In a system where Black and Latino students historically had less access to advanced coursework, fewer quality teachers, and less test preparation, the behavioral signals associated with high mastery — fast response times, few hints, confident navigation — were more common among students who'd had more educational advantage.
The algorithm learned that those signals meant "good student." It never questioned where those signals came from. So it applied them to new students — and systematically misread students whose behavioral patterns reflected their educational history, not their actual ability.
If an AI tutoring company has data showing their system underestimates the ability of certain demographic groups — but fixing it would require significant expense and might slightly reduce overall accuracy — do they have an obligation to fix it? What if their product is used in underfunded districts that can't afford alternatives?
What makes proxy discrimination in AI tutors particularly damaging is what it does over time. When a system underestimates a student's mastery, it assigns them more remedial practice. Remedial practice means less time on grade-level content. Less time on grade-level content means the student falls further behind. And when that student's usage data is later used to train the next version of the algorithm, their patterns — shaped by being held back — become part of the model that decides to hold the next student back.
This is a feedback loop: the algorithm's mistake becomes part of the evidence that trains the algorithm to make the same mistake again, on more students, more confidently. Researchers call this algorithmic amplification of inequality.
The troubling thing is that no one in the system necessarily knows this is happening. The company sees overall performance metrics that look fine. The school sees the tool running smoothly. The student experiences being given easier problems and wonders why — or doesn't wonder at all, because nobody told them what the AI thought of them.
Transparency is one partial remedy. Timnit Gebru, an AI ethics researcher who was fired from Google in 2020 after co-authoring a paper on bias in language models, has argued that AI systems used in high-stakes contexts — like education — should be required to publish "model cards": documents that describe what data the system was trained on, what demographic groups it was tested on, and where it performs worse. No major AI tutoring company currently publishes model cards voluntarily. That's a policy gap that affects real students right now.
Several U.S. states are currently debating legislation requiring algorithmic transparency in educational software. The EU's AI Act (passed 2024) classifies AI systems used in education as "high-risk" and requires bias auditing. Whether those rules get enforced — and whether they apply to AI tutors specifically — is still being decided. You're looking at a live policy debate.
Designing an AI tutor that doesn't amplify inequality requires making decisions at every stage of development that are uncomfortable and expensive. It means auditing training data for demographic imbalance before training. It means testing model outputs separately for different demographic groups during development, not just measuring overall accuracy. It means building feedback mechanisms so students and teachers can flag when the system seems wrong about a student.
It also means making a harder design choice: deciding that overall accuracy is not the only metric that matters. A system that is 90% accurate overall but systematically wrong about 15% of students in a specific group has not solved the problem. Equity requires not just average performance but consistent performance across groups.
This is hard. It costs more. It takes longer. And it requires the company to publish data about its failures, which makes it commercially uncomfortable. These are the real constraints that explain why most AI tutors have not done it. Understanding that gap — between what's technically possible and what companies actually build — is the most important thing a person evaluating these tools can know.
A school district is considering adopting an AI tutoring platform. They've been shown a dashboard with impressive average accuracy numbers. Your job is to ask the questions that the sales presentation didn't answer — and figure out whether this tool is safe to deploy at scale.
The AI below is playing the role of the company's data scientist. They're not lying — but they'll only answer what you ask. The bias might be in what you don't think to ask.
In October 2023, Sal Khan gave a talk at MIT in which he acknowledged something that most AI tutoring executives wouldn't say publicly: the first version of Khanmigo had been optimized to answer student questions. That was the wrong goal. "We'd built a very good answering machine," he said. "But the research kept coming back saying: the AI that produces the most learning is the one that asks the best questions — not the one that gives the best answers."
His team had gone back to the foundational research on what made human tutors effective. They found a 1984 study by Benjamin Bloom — still considered one of the most important findings in all of education research — showing that one-on-one tutoring by a human expert produced gains of two full standard deviations above classroom instruction. Two standard deviations is enormous. It means a student at the 50th percentile performs like a student at the 98th percentile.
But when researchers analyzed what those human tutors were actually doing, they found something unexpected: great tutors spent most of their time asking, not telling. They probed student thinking. They asked "why did you do that step?" and "what would happen if you changed this number?" They made the student's reasoning visible — and then they challenged it.
Bloom's 1984 finding is called the "2 Sigma Problem" — the problem being that one-on-one human tutoring produces enormous learning gains, but it's impossible to give every student a personal expert tutor. The hope attached to AI tutors, ever since the idea was first proposed, is that AI could deliver Bloom's 2-sigma effect at scale.
It hasn't happened yet. The honest reason is that most AI tutors were built by people who understood AI well and understood marketing well — but didn't read Bloom carefully enough. They built answering machines. Bloom's tutors weren't answering machines. They were question-asking machines that forced students to think out loud, exposing the gaps in their understanding.
A tutor that asks "why did you do it that way?" makes the student's thinking visible. Visible thinking is thinking the tutor can assess. Thinking that remains internal — triggered by a good answer from the AI — is invisible and impossible to check. This is why Socratic tutoring (named after the Greek philosopher Socrates, who famously taught only by asking questions) consistently outperforms explanation-based tutoring in the research.
Based on everything covered in this module — desirable difficulty, student models, transfer, bias, and Bloom's tutoring research — here are five specific design principles that a better AI tutor would be built around.
1. Ask before you tell. The AI's default behavior when a student is stuck should be to ask a question that makes the student's current thinking visible — not to supply the next step. "What do you think the next step is, and why?" is more educationally valuable than "the next step is x." This creates struggle. Struggle creates memory.
2. Make the student model visible and editable. Show the student what the system thinks about their mastery. Let them flag errors: "I got this wrong because I misread the problem, not because I don't know the concept." A student who can see and dispute their own model develops metacognition — awareness of their own understanding — which is one of the highest-value skills in all of learning.
3. Test transfer deliberately. Before declaring mastery on any skill, present the concept in at least three different formats or contexts. A fraction problem expressed as a recipe, as a number line, and as a word problem. If the student can handle all three, the mastery claim is stronger. If they can only do one, note that gap explicitly.
4. Give specific, actionable feedback every time. Replace "great job!" with "you set up the equation correctly, but look at what happens when you distribute the negative sign here — what should that step produce?" That's specific (what exactly), timely (right now), and actionable (here is what to redo).
5. Audit for equity before launch, not after. Test the system's mastery estimates separately on different demographic groups before deploying. If the system performs 10+ percentage points worse on any identifiable group, that is a deployment blocker — not a note for a future update. The students who would suffer that inequity do not have time to wait for version 2.0.
These five principles would make an AI tutor significantly more expensive to build and harder to use. Students might find it more frustrating. Parents might complain that it "doesn't help" because it answers questions with more questions. Should a company build the educationally correct version even if the market prefers the comfortable one? And if schools keep choosing the comfortable one — who should step in?
Here is what you've built across this module. You understand why an AI tutor that feels helpful can be actively harmful — because comfort and learning are not the same thing. You understand that student models are approximations that fail in specific documented ways. You understand that bias in these systems is structural, not accidental, and that fixing it requires deliberate choices at every stage of design. And you understand what Bloom's research actually says, which means you can evaluate the gap between any AI tutor's marketing claims and its likely educational reality.
This is not a small thing to know. Decisions about which AI tools get deployed in schools are being made right now — by district administrators, school board members, and politicians most of whom don't know what a student model is, have never heard of desirable difficulty, and have no framework for thinking about proxy discrimination. Some of those decisions will affect millions of students for years.
The gap between what the research says and what the market builds is a gap that gets filled by the people who understand both sides. You now understand both sides. That changes what you're able to see — and eventually, what you're able to do.
Most people evaluating AI tutors look at the interface and ask "does it work?" You now know that's the wrong question. The right questions are: What model of learning is embedded in its design? How does it handle mastery claims? Has it been audited for demographic equity? Does it ask questions or only answer them? You now know how to look at any AI tutor — and see it clearly.
You're pitching a new AI tutor design to a product team. Your design incorporates all five principles from Lesson 4. The team is skeptical — they're worried your design will frustrate students, get worse engagement metrics, and lose to competitors who build comfortable answering machines.
Your job is to argue for the educationally correct design. Your product partner below will push back with real commercial concerns. You need to hold your ground — or update your position with good reasons.