Module 6 · Lesson 1

What Makes a Tutor Bad

Before you can build something better, you have to see clearly what's broken — and why the people who built it didn't notice.

Why did one of the world's most advanced AI tutors make students worse at math?

In the spring of 2023, a middle school district outside Atlanta ran a pilot program using an AI math tutor that had received glowing reviews from ed-tech conferences. The tool was called Khanmigo — built by Khan Academy using GPT-4. Teachers signed their students up, hopeful. The AI could talk. It could answer questions at 2 a.m. It seemed, on the surface, like a breakthrough.

But something strange happened. When researchers from Stanford's Graduate School of Education looked at early usage data later that year, they found a troubling pattern: students who used the AI tutor the most showed the smallest gains on independent problem sets. The students who used it occasionally — or not at all — did better.

The AI wasn't lying. It wasn't broken. It was doing exactly what it was designed to do. That was the problem.

The Hint That Does Too Much

Here is what those students were doing: they were stuck on a problem, they asked Khanmigo for help, and the AI gave them a hint. The hint was patient, clear, and perfectly calibrated. It was so helpful, in fact, that the student immediately saw the next step — and took it. Problem solved. The student moved on feeling good.

But here's what didn't happen: the student never struggled. And struggling, it turns out, is where most of the learning actually lives. Cognitive scientists call this desirable difficulty — the idea that mild frustration and mental effort are not obstacles to learning but the actual engine of it. When an AI removes the difficulty, it removes the learning.

This is not a flaw unique to Khanmigo. It shows up in almost every AI tutor that prioritizes being helpful over being educationally effective. The AI is optimized to produce a satisfied user. The student's brain is trying to build a durable skill. Those two goals are not the same thing — and they sometimes directly conflict.

Desirable difficulty A concept from cognitive science: learning is more durable when it involves some struggle, retrieval effort, or challenge. Making things too easy actually weakens the memory.

Ethical Question

If an AI tutor makes a student feel confident and satisfied — but they're actually learning less — is that a harm? Who is responsible: the student, the teacher who assigned the tool, or the company that built it?

The Feedback That Doesn't Bite

There's a second problem closely related to the first. Most AI tutors are designed to be encouraging. They say things like "Great effort!" and "You're almost there!" and "Nice thinking!" This is not an accident — it's a deliberate design choice, because early testing showed that students disengaged when the AI felt too harsh or cold.

But researchers who study feedback quality have found that vague positive feedback — "good job!" without specifics — does almost nothing to help a student improve. Worse, it can inflate a student's sense of how well they understand something. A student who gets told "great thinking!" after writing a wrong answer has now been actively misled.

Real feedback has to include three things that most AI tutors get wrong: it has to be specific (pointing to exactly what was right or wrong), timely (before the student moves on), and actionable (telling the student what to do differently). "Great effort!" is none of these things.

The researchers Dylan Wiliam and Paul Black established this framework in their landmark 1998 study Inside the Black Box, which found that quality formative feedback — feedback given during learning, not just after — had the largest measurable effect on student achievement of any teaching practice studied. AI tutors, despite having the technical ability to deliver exactly this kind of feedback, often choose comfort over rigor.

You Can Now See This

Next time you use any AI study tool and it says "Great!" or "You're on the right track!" — you now know what that means. It means the AI prioritized your comfort over your improvement. That's a design decision, and it was made by a human who chose that trade-off on your behalf.

The Assumption Hidden in Every Design

Every AI tutor is built on a model of what students are. That model is usually invisible — nobody writes it down — but you can reverse-engineer it by looking at what the system does. When Khanmigo gives you the next step the moment you ask, its hidden model says: students want to succeed at problems, and success feels good, and that's what matters.

A different model would say: students are trying to build durable skills that will survive a test next month, and discomfort now means capability later. That model would produce a very different tool — one that waits longer before helping, that gives less complete hints, that maybe says "try it one more time before I tell you anything."

The decision about which model to use is a design decision. Someone made it. It was probably made by a product team that was measured on engagement metrics — how long students spent in the app, how many problems they completed, whether they came back tomorrow. Those metrics capture satisfaction. They don't capture whether the student learned anything that lasted.

You are now in a position to see this clearly. When you look at an AI tutor — any AI tutor — the question to ask is not "does it feel helpful?" The question is: whose model of learning did the designers embed in it, and is that model correct?

Lesson 1 Quiz

Test your reasoning — not just your recall.

1. What did the Stanford researchers find when they studied Khanmigo usage data in 2023?

Correct. Heavy users showed the smallest gains — suggesting the AI's helpfulness was undercutting actual learning.

Not quite. The finding was counterintuitive: more use correlated with less improvement on independent work.

2. What does the cognitive science term "desirable difficulty" mean?

Correct. Desirable difficulty describes how struggle is not an obstacle to learning — it is the engine of it.

Not quite. Desirable difficulty is a cognitive science concept: some struggle is necessary for durable learning to happen.

3. A new AI tutor tells every student "Amazing work!" after each answer, whether right or wrong. Based on lesson 1, what is the most accurate way to describe this design choice?

Correct. Vague praise without specifics is not just unhelpful — it can mislead students about their actual understanding.

Think about what Wiliam and Black found: specific, actionable feedback is what improves learning. Generic praise is not that.

4. What are the three qualities that make feedback genuinely useful, according to Wiliam and Black's research?

Correct. Specific (what exactly), timely (before moving on), and actionable (what to do differently).

The research pointed to specific, timely, and actionable as the three critical properties of effective formative feedback.

5. An AI tutor is optimized to maximize the number of problems students complete per session. What hidden model of learning does this design choice embed?

Correct. Optimizing for completions embeds the assumption that more problems done = more learned, which is not always true.

Design choices reveal assumptions. Maximizing completions treats volume as the measure of learning — which may not match how learning actually works.

Lab 1: The Tutor Auditor

Your role: educational auditor. The AI is your peer, not your teacher.

Your Mission

You've just been hired to audit an AI tutoring company before they launch in schools. Your job is to find the hidden design assumptions in their product — the choices that affect learning but that the marketing never mentions.

The AI you're talking to is a fellow auditor who thinks differently than you. Push your thinking. Take a position and defend it.

Start here: Tell your partner auditor what you think the single most dangerous design flaw in AI tutors is, and why. Be specific — give an example of how it would actually hurt a real student.

Audit Partner

Lab 1

Alright, I've read the same briefing documents you have. Before we start writing the audit report, I want to hear your actual take — not the textbook answer. What's the one design flaw you'd put at the top of the list? And I'm going to push back, so be ready to defend it.

Module 6 · Lesson 2

The Student Model Problem

Every AI tutor maintains a secret file on you. What it gets wrong about you changes everything.

When Carnegie Learning's AI tutor marked a student as "mastered" on fractions — and she failed her state exam — what went wrong in the system's understanding of her?

In 2018, Carnegie Learning — a company that had been building AI math tutors since the late 1990s, founded by cognitive scientists at Carnegie Mellon University — released data from a large-scale study of their flagship product, MATHia. The results were complicated.

MATHia used a technique called Bayesian Knowledge Tracing to track what each student knew. Every time a student answered correctly, the system updated its internal model: slightly more confident the student had mastered this skill. Every wrong answer nudged the model downward. After enough correct answers in a row, the system declared: mastered. Move on.

But classroom teachers kept reporting the same problem: students the system declared "mastered" were failing on their state exams — especially on fraction concepts. Something in the model was wrong. The system thought it knew the student. It did not.

What a Student Model Is — and Why It Fails

A student model is the AI's internal representation of what a learner knows. It's a kind of running profile: skill by skill, the system tracks your history of correct and incorrect answers and builds a probability estimate. "There is a 78% chance this student has mastered long division." When the probability crosses a threshold, the system moves on.

This sounds rigorous. It's actually fragile in several specific ways. The first problem is what researchers call gaming. Students quickly learn — often without consciously realizing it — that getting three correct answers in a row moves the system forward. So they try until they luck into three in a row. The system declares mastery. The student has learned almost nothing except how to get out of the problem set.

The second problem is transfer. Knowing how to do a fraction problem inside MATHia's specific interface, with its specific visual layout and hint system, does not guarantee knowing fractions. When the state exam presents the same concept in a different format, with no hints, some students collapse. The student model measured performance in one narrow context and called it knowledge.

Student model An AI tutor's internal estimate of what a learner knows, usually built from their answer history. It's always an approximation — and the gap between the model and reality is where learning failures hide.

Ethical Question

If a student model says a child has mastered a skill, and the teacher trusts that data, and the child hasn't actually mastered it — who failed that child? The AI? The teacher? The school district that chose the product? And does the answer to that question matter?

The Transfer Problem Is Bigger Than You Think

When Carnegie Learning's researchers dug into the fraction failure data, they found something important: the students who failed weren't failing because they'd forgotten the skill. They were failing because they'd never actually generalized it. They could do the specific type of problem MATHia presented. They couldn't apply the underlying concept to a new context.

This is the transfer problem. Transfer means being able to use knowledge in a new situation — one that's different from how you learned it. It's the whole point of education: we don't learn things to do them once in a specific app; we learn them to use them for the rest of our lives.

AI tutors are particularly bad at testing for transfer, because it would require showing students problems in many different formats, contexts, and framings — which is harder to build and less tidy to score. Most systems test the same skill in the same format repeatedly until you get it right. That's not transfer; that's pattern matching.

Psychologist Robert Bjork at UCLA has studied transfer extensively. His research shows that varying the conditions of practice — changing the format, the context, the problem type — dramatically improves transfer. This is called interleaving and varied practice. Most AI tutors do the opposite: they block practice by skill type because it's easier to build and easier for students to feel successful in. Success metrics win over learning science again.

What You Now Understand

When any system — an AI tutor, a standardized test, a quiz app — tells you that you've "mastered" something, you now know what that claim is actually based on: performance inside one specific context. Mastery in the real world requires transfer. That's a much harder test, and most systems never run it.

Designing a Better Student Model

So what would a better student model look like? Researchers have proposed several improvements. One is open learner models — making the system's estimates visible to the student, so they can see and challenge what the AI thinks it knows about them. Research from Susan Bull and Judy Kay in the 2000s showed that students who could see their own model made more accurate self-assessments and engaged more strategically with the material.

Another improvement is multi-context probing — before declaring mastery, the system tests the student on the same concept presented in three different ways: visual, numerical, word problem. If you can solve a fraction problem all three ways, the probability of real mastery goes up significantly. If you can only do one format, the system notes that gap explicitly.

A third approach is to make the model explicit about what it doesn't know. Instead of "78% mastered," a better system might say: "We've seen you succeed on this type of problem 7 times, but all in the same format. We don't yet know if you can apply this in a word problem context." That's honest. That's what a good human tutor would say.

The common thread across all of these improvements: they require the system to be humble about its own confidence. That humility is a design decision — and most products choose certainty because it feels more polished and reassuring. Uncertainty is uncomfortable to display. But it's more accurate, and accuracy is what a student's education depends on.

Lesson 2 Quiz

Apply what you know — not just what you remember.

1. What technique did Carnegie Learning's MATHia use to track student knowledge?

Correct. Bayesian Knowledge Tracing updates probability estimates of mastery based on correct and incorrect answers.

MATHia used Bayesian Knowledge Tracing — a statistical method that updates the probability of mastery based on answer history.

2. A student gets three correct answers in a row on a math problem set, and the system declares them "mastered." The student actually guessed correctly twice. What design vulnerability does this illustrate?

Correct. This is the gaming vulnerability: consecutive correct answers can reflect luck rather than mastery, and the system can't tell the difference.

This is a classic example of gaming — the student's behavior (lucky guesses) triggers the mastery declaration. Transfer would be about applying the skill in a new context.

3. What does "transfer" mean in the context of learning?

Correct. Transfer is the ability to use knowledge flexibly across new contexts — which is the actual goal of learning.

Transfer means applying what you learned in a new context — which is the whole point of education. It's not about data or records.

4. Robert Bjork's research found that "interleaving" dramatically improves transfer. A student is learning algebra. Which practice schedule best reflects interleaving?

Correct. Interleaving means mixing problem types, which forces your brain to retrieve and select the right strategy — building stronger, more transferable knowledge.

Interleaving means mixing different types together, not doing them in separate blocks. Blocked practice (one type at a time) feels easier but produces weaker transfer.

5. What is the main advantage of an "open learner model" compared to a hidden one?

Correct. Research by Bull and Kay showed that students who could see their own model made more accurate self-assessments and engaged more strategically.

The key advantage is visibility and student agency — being able to see and challenge what the system thinks you know.

Lab 2: The Student Model Designer

Your role: AI system designer. Defend your choices to a skeptical colleague.

Your Mission

You've been tasked with redesigning the student model for a middle school math AI tutor. The current system just counts correct answers. You need something better.

Your colleague below has opinions — and they'll challenge yours. Propose a specific design for your improved student model. Explain how it tracks mastery, handles gaming, and tests for transfer. Then defend it.

Start here: Describe your redesigned student model. Be specific — how does it decide a student has truly mastered a skill, and how does it guard against gaming?

Design Colleague

Lab 2

Okay, I've been working on this problem too. The current system is embarrassingly easy to game — I've seen students get three lucky guesses and get moved forward. Tell me your redesign. And I'll tell you right away: whatever you propose, I'm going to ask how it handles a student who's anxious and keeps second-guessing correct answers. That student is going to look like they don't know things they actually do. How does your model handle her?

Module 6 · Lesson 3

Bias in the Tutor's Eye

An AI tutor that works great for some students and quietly fails others is not a neutral tool. It's a machine that amplifies inequality.

When researchers tested three popular AI tutoring systems in 2022, why did the systems consistently underestimate the knowledge of Black and Latino students — even when their answer quality was identical?

In 2022, a team of researchers led by Ryan Baker at the University of Pennsylvania's Center for Learning Analytics published a study that caused significant discomfort in the ed-tech industry. They analyzed the performance of knowledge-tracing algorithms — the same kind used in major AI tutoring systems — and found consistent demographic disparities.

The systems were more likely to flag Black students as guessing when they answered correctly, and more likely to flag Latino students as having a "slip" — a random mistake — when they answered incorrectly. Both of these flags lowered the system's estimate of student mastery, meaning those students were often assigned remedial practice that students with identical answer patterns — but different demographic profiles — were not assigned.

The AI wasn't told anyone's race. It was operating on usage patterns. But those patterns — how students moved through problems, how long they paused, what sequence of hints they requested — were shaped by years of educational inequity, and the algorithm had learned to use those patterns as proxies for lower ability. The discrimination was invisible and automatic.

How Bias Gets Into a Tutor Without Anyone Deciding to Put It There

This is the part that surprises most people: the bias didn't come from a programmer deciding to treat some students differently. Nobody typed in "if student is Black, lower mastery estimate." The bias was inherited from the training data.

Here is the mechanism. AI tutors are trained and calibrated on historical data: records of how previous students moved through the system, what patterns correlated with later test success. The problem is that those historical records already encode inequality. In a system where Black and Latino students historically had less access to advanced coursework, fewer quality teachers, and less test preparation, the behavioral signals associated with high mastery — fast response times, few hints, confident navigation — were more common among students who'd had more educational advantage.

The algorithm learned that those signals meant "good student." It never questioned where those signals came from. So it applied them to new students — and systematically misread students whose behavioral patterns reflected their educational history, not their actual ability.

Proxy discrimination When an algorithm uses a variable (like response time or hint usage) that seems neutral but actually correlates with race, class, or other protected characteristics. The discrimination is indirect but the effect is real.

Ethical Question

If an AI tutoring company has data showing their system underestimates the ability of certain demographic groups — but fixing it would require significant expense and might slightly reduce overall accuracy — do they have an obligation to fix it? What if their product is used in underfunded districts that can't afford alternatives?

The Feedback Loop That Makes It Worse

What makes proxy discrimination in AI tutors particularly damaging is what it does over time. When a system underestimates a student's mastery, it assigns them more remedial practice. Remedial practice means less time on grade-level content. Less time on grade-level content means the student falls further behind. And when that student's usage data is later used to train the next version of the algorithm, their patterns — shaped by being held back — become part of the model that decides to hold the next student back.

This is a feedback loop: the algorithm's mistake becomes part of the evidence that trains the algorithm to make the same mistake again, on more students, more confidently. Researchers call this algorithmic amplification of inequality.

The troubling thing is that no one in the system necessarily knows this is happening. The company sees overall performance metrics that look fine. The school sees the tool running smoothly. The student experiences being given easier problems and wonders why — or doesn't wonder at all, because nobody told them what the AI thought of them.

Transparency is one partial remedy. Timnit Gebru, an AI ethics researcher who was fired from Google in 2020 after co-authoring a paper on bias in language models, has argued that AI systems used in high-stakes contexts — like education — should be required to publish "model cards": documents that describe what data the system was trained on, what demographic groups it was tested on, and where it performs worse. No major AI tutoring company currently publishes model cards voluntarily. That's a policy gap that affects real students right now.

Institutional Stakes — This Is a Policy Question

Several U.S. states are currently debating legislation requiring algorithmic transparency in educational software. The EU's AI Act (passed 2024) classifies AI systems used in education as "high-risk" and requires bias auditing. Whether those rules get enforced — and whether they apply to AI tutors specifically — is still being decided. You're looking at a live policy debate.

Designing Against Bias

Designing an AI tutor that doesn't amplify inequality requires making decisions at every stage of development that are uncomfortable and expensive. It means auditing training data for demographic imbalance before training. It means testing model outputs separately for different demographic groups during development, not just measuring overall accuracy. It means building feedback mechanisms so students and teachers can flag when the system seems wrong about a student.

It also means making a harder design choice: deciding that overall accuracy is not the only metric that matters. A system that is 90% accurate overall but systematically wrong about 15% of students in a specific group has not solved the problem. Equity requires not just average performance but consistent performance across groups.

This is hard. It costs more. It takes longer. And it requires the company to publish data about its failures, which makes it commercially uncomfortable. These are the real constraints that explain why most AI tutors have not done it. Understanding that gap — between what's technically possible and what companies actually build — is the most important thing a person evaluating these tools can know.

Lesson 3 Quiz

Bias in AI systems requires careful reasoning — not just recognition.

1. What did Ryan Baker's 2022 study find about knowledge-tracing algorithms in AI tutors?

Correct. The study found demographic disparities in mastery estimates even when answer quality was identical — a form of proxy discrimination.

The study found that identical answer quality was being interpreted differently based on behavioral patterns correlated with demographic background.

2. What is "proxy discrimination" in an AI system?

Correct. Proxy discrimination is indirect — the variable itself seems neutral, but its correlation with demographic characteristics produces biased outcomes.

Proxy discrimination doesn't require intent. It happens when an apparently neutral variable (like response time) actually correlates with race or class.

3. An AI tutor is trained on data from schools where students with high response times (quick answers) historically went on to score well on tests. It learns to associate quick response time with high ability. Why might this be a problem for students from under-resourced schools?

Correct. This is exactly the proxy discrimination mechanism — a behavioral signal that reflects educational history gets used as a proxy for ability.

Think about what shapes response time. Students with less test-taking experience may be slower — not because they know less, but because the behavior was shaped by their educational history.

4. What does Timnit Gebru argue AI systems used in high-stakes contexts should be required to publish?

Correct. Model cards are documents that describe where a system came from and where it fails — a transparency mechanism that most AI tutoring companies don't currently publish.

Gebru argued for model cards — documentation of training data, test populations, and known performance gaps — as a transparency requirement.

5. A company says their AI tutor is "90% accurate overall." A researcher points out it's only 73% accurate for one demographic group. Is the company's claim misleading? Why?

Correct. A good average can hide serious inequity. Equity requires consistency across groups, not just a high mean.

Averages can mask serious group-level failures. If the system is 17 percentage points less accurate for a specific demographic, that is an equity issue regardless of the overall number.

Lab 3: The Bias Investigator

Your role: equity investigator. Find the bias that the dashboard doesn't show.

Your Mission

A school district is considering adopting an AI tutoring platform. They've been shown a dashboard with impressive average accuracy numbers. Your job is to ask the questions that the sales presentation didn't answer — and figure out whether this tool is safe to deploy at scale.

The AI below is playing the role of the company's data scientist. They're not lying — but they'll only answer what you ask. The bias might be in what you don't think to ask.

Start here: You're in a meeting with the company's data scientist. Ask the specific questions that would reveal whether this tool has a demographic bias problem. What do you need to know?

Company Data Scientist

Lab 3

Thanks for meeting with us. I'm happy to answer technical questions about our system. I'll tell you upfront: our overall accuracy on mastery prediction is 91%, which is industry-leading. What would you like to know?

Module 6 · Lesson 4

Build It Better: A Design Framework

You've spent this course learning how AI tutors fail. Now you have enough knowledge to actually design something better — and to know what trade-offs you're making.

When Sal Khan first proposed Khanmigo's redesign in late 2023, what was the one principle his team said they'd gotten wrong in the original — and what does their correction tell us about how to build AI tutors that actually work?

In October 2023, Sal Khan gave a talk at MIT in which he acknowledged something that most AI tutoring executives wouldn't say publicly: the first version of Khanmigo had been optimized to answer student questions. That was the wrong goal. "We'd built a very good answering machine," he said. "But the research kept coming back saying: the AI that produces the most learning is the one that asks the best questions — not the one that gives the best answers."

His team had gone back to the foundational research on what made human tutors effective. They found a 1984 study by Benjamin Bloom — still considered one of the most important findings in all of education research — showing that one-on-one tutoring by a human expert produced gains of two full standard deviations above classroom instruction. Two standard deviations is enormous. It means a student at the 50th percentile performs like a student at the 98th percentile.

But when researchers analyzed what those human tutors were actually doing, they found something unexpected: great tutors spent most of their time asking, not telling. They probed student thinking. They asked "why did you do that step?" and "what would happen if you changed this number?" They made the student's reasoning visible — and then they challenged it.

The Bloom 2 Sigma Problem — and Why AI Hasn't Solved It

Bloom's 1984 finding is called the "2 Sigma Problem" — the problem being that one-on-one human tutoring produces enormous learning gains, but it's impossible to give every student a personal expert tutor. The hope attached to AI tutors, ever since the idea was first proposed, is that AI could deliver Bloom's 2-sigma effect at scale.

It hasn't happened yet. The honest reason is that most AI tutors were built by people who understood AI well and understood marketing well — but didn't read Bloom carefully enough. They built answering machines. Bloom's tutors weren't answering machines. They were question-asking machines that forced students to think out loud, exposing the gaps in their understanding.

A tutor that asks "why did you do it that way?" makes the student's thinking visible. Visible thinking is thinking the tutor can assess. Thinking that remains internal — triggered by a good answer from the AI — is invisible and impossible to check. This is why Socratic tutoring (named after the Greek philosopher Socrates, who famously taught only by asking questions) consistently outperforms explanation-based tutoring in the research.

2 Sigma Problem Benjamin Bloom's 1984 finding that one-on-one tutoring produces gains two standard deviations above classroom instruction — but is too expensive to scale. AI tutors were supposed to solve this. Most haven't.

Five Principles for Building a Better AI Tutor

Based on everything covered in this module — desirable difficulty, student models, transfer, bias, and Bloom's tutoring research — here are five specific design principles that a better AI tutor would be built around.

1. Ask before you tell. The AI's default behavior when a student is stuck should be to ask a question that makes the student's current thinking visible — not to supply the next step. "What do you think the next step is, and why?" is more educationally valuable than "the next step is x." This creates struggle. Struggle creates memory.

2. Make the student model visible and editable. Show the student what the system thinks about their mastery. Let them flag errors: "I got this wrong because I misread the problem, not because I don't know the concept." A student who can see and dispute their own model develops metacognition — awareness of their own understanding — which is one of the highest-value skills in all of learning.

3. Test transfer deliberately. Before declaring mastery on any skill, present the concept in at least three different formats or contexts. A fraction problem expressed as a recipe, as a number line, and as a word problem. If the student can handle all three, the mastery claim is stronger. If they can only do one, note that gap explicitly.

4. Give specific, actionable feedback every time. Replace "great job!" with "you set up the equation correctly, but look at what happens when you distribute the negative sign here — what should that step produce?" That's specific (what exactly), timely (right now), and actionable (here is what to redo).

5. Audit for equity before launch, not after. Test the system's mastery estimates separately on different demographic groups before deploying. If the system performs 10+ percentage points worse on any identifiable group, that is a deployment blocker — not a note for a future update. The students who would suffer that inequity do not have time to wait for version 2.0.

Ethical Question

These five principles would make an AI tutor significantly more expensive to build and harder to use. Students might find it more frustrating. Parents might complain that it "doesn't help" because it answers questions with more questions. Should a company build the educationally correct version even if the market prefers the comfortable one? And if schools keep choosing the comfortable one — who should step in?

The Person Who Now Understands This

Here is what you've built across this module. You understand why an AI tutor that feels helpful can be actively harmful — because comfort and learning are not the same thing. You understand that student models are approximations that fail in specific documented ways. You understand that bias in these systems is structural, not accidental, and that fixing it requires deliberate choices at every stage of design. And you understand what Bloom's research actually says, which means you can evaluate the gap between any AI tutor's marketing claims and its likely educational reality.

This is not a small thing to know. Decisions about which AI tools get deployed in schools are being made right now — by district administrators, school board members, and politicians most of whom don't know what a student model is, have never heard of desirable difficulty, and have no framework for thinking about proxy discrimination. Some of those decisions will affect millions of students for years.

The gap between what the research says and what the market builds is a gap that gets filled by the people who understand both sides. You now understand both sides. That changes what you're able to see — and eventually, what you're able to do.

What You Now Understand That Most Adults Don't

Most people evaluating AI tutors look at the interface and ask "does it work?" You now know that's the wrong question. The right questions are: What model of learning is embedded in its design? How does it handle mastery claims? Has it been audited for demographic equity? Does it ask questions or only answer them? You now know how to look at any AI tutor — and see it clearly.

Lesson 4 Quiz

Apply the design framework to new scenarios.

1. What did Benjamin Bloom's 1984 research find about one-on-one human tutoring?

Correct. Two standard deviations is a dramatic effect — the "2 Sigma Problem" is the challenge of replicating this at scale.

Bloom's finding was dramatic: two full standard deviations, which translates to moving from the 50th to approximately the 98th percentile.

2. Sal Khan said in his 2023 MIT talk that the original Khanmigo had been built as "a very good answering machine." Why is this the wrong design goal for an AI tutor?

Correct. Bloom's tutors were question-askers, not answer-givers. Making student thinking visible is what enables real assessment and learning.

The issue is pedagogical: the research on great tutors shows they ask questions, not give answers. That makes student thinking visible and challengeable.

3. An AI tutor responds to a wrong answer by saying: "You correctly identified this as a division problem, but look at how you handled the remainder — what should happen to it according to the rule we covered?" Which of the five design principles does this best illustrate?

Correct. This feedback is specific (exact error identified), timely (right now), and actionable (points to what to reconsider).

This response is pointing to exactly what was wrong and what to reconsider — that is specific, timely, and actionable feedback, not transfer testing or modeling.

4. A student uses an AI tutor and gets a skill marked "mastered" after solving five fraction problems — all presented as number line diagrams. They later fail a test where fractions appear as word problems. Which design principle was missing?

Correct. Mastery was declared after performance in one context (number lines) without testing whether the student could transfer to a different format (word problems).

This is a transfer failure. The system called mastery after one format without checking whether the student could apply the concept in a different context.

5. A company argues they can't afford to audit their AI tutor for demographic equity before launch, and promises to release updates later. Based on the lesson, what is the strongest argument against this position?

Correct. Time lost to biased instruction — being held back by remedial assignments that reflect the system's error — is not recoverable. The harm is real and immediate.

The core argument is that students don't get their educational time back. The harm of biased instruction during the wait for a fix is real, immediate, and irreversible.

Lab 4: The AI Tutor Designer

Your role: lead designer. Defend your choices to a skeptical product team.

Your Mission

You're pitching a new AI tutor design to a product team. Your design incorporates all five principles from Lesson 4. The team is skeptical — they're worried your design will frustrate students, get worse engagement metrics, and lose to competitors who build comfortable answering machines.

Your job is to argue for the educationally correct design. Your product partner below will push back with real commercial concerns. You need to hold your ground — or update your position with good reasons.

Start here: Walk us through your design. Pick two of the five principles and explain specifically how you'd implement them in the product — and why you believe the educational evidence outweighs the engagement risk.

Product Partner

Lab 4

Alright, I've read your proposal. I'll be direct: I'm worried about it. Our last tutor that answered questions immediately had 74% daily return rate. The version before it — the one that asked students questions back — had 41%. Students hated it. Parents complained. Schools switched to a competitor. How do you plan to build something that's educationally sound but doesn't drive away the students it's supposed to help? Make your case.

Module 6 Test

15 questions across all four lessons. 80% to pass.

1. What did researchers find when they analyzed heavy Khanmigo users' performance on independent problem sets in 2023?

Correct.

The finding was counterintuitive: more AI use correlated with less improvement on independent problem sets.

2. "Desirable difficulty" refers to which of the following?

Correct.

Desirable difficulty is a cognitive science term: the productive struggle is where durable memory is built.

3. Which three qualities define effective feedback, according to Wiliam and Black's 1998 research?

Correct.

Wiliam and Black identified specific, timely, and actionable as the three critical properties of effective formative feedback.

4. What is a "student model" in an AI tutoring system?

Correct.

A student model is the AI's running estimate — based on answer history — of what the student knows.

5. A student gets five fraction problems right in a row inside MATHia. The system declares mastery. What specific vulnerability does this scenario illustrate?

Correct.

Five correct answers in a row triggering mastery is the gaming vulnerability — it rewards getting lucky more than it measures genuine understanding.

6. What does "transfer" mean in the context of learning science?

Correct.

Transfer is the ability to apply what you learned in new contexts — and most AI tutors test performance in only one context before declaring mastery.

7. Robert Bjork's research on "interleaving" shows it improves transfer. Which practice schedule demonstrates interleaving?

Correct.

Interleaving means mixing problem types together, not doing them in separate blocks.

8. What did Ryan Baker's 2022 study find about knowledge-tracing algorithms in widely-used AI tutors?

Correct.

Baker's study found demographic disparities in mastery estimates even when answer quality was identical — a form of proxy discrimination.

9. Proxy discrimination in an AI system means:

Correct.

Proxy discrimination is indirect: an apparently neutral variable carries discriminatory information from the historical data it was trained on.

10. What are "model cards," as proposed by Timnit Gebru for high-stakes AI systems?

Correct.

Model cards are transparency documents: they describe where a system came from and where it fails — something most AI tutoring companies don't publish.

11. Benjamin Bloom's "2 Sigma Problem" refers to what finding?

Correct.

Bloom's 1984 finding: one-on-one tutoring produces two standard deviation gains — a remarkable effect that AI tutors were supposed to democratize but largely haven't.

12. An AI tutor is being designed with a new feature: before declaring mastery, it tests the student on the same concept using a visual diagram, a word problem, and a numerical calculation. This best illustrates which design principle?

Correct.

Testing the same concept in three different formats before declaring mastery is the "test transfer deliberately" principle in action.

13. What does Socratic tutoring mean, and why does research suggest it outperforms explanation-based tutoring?

Correct.

Socratic tutoring means asking rather than telling. It works because it makes student thinking visible — which is what enables real assessment.

14. A company releases an AI tutor that averages 92% accuracy on mastery prediction. A researcher finds it's 71% accurate for English Language Learner students. What is the most accurate assessment of this situation?

Correct.

A 21-point accuracy gap for a specific group is not statistical noise — it's a systematic failure that causes real harm to real students every day the product is deployed.

15. Which of the following best describes what you can now do that most people evaluating AI tutors cannot?

Correct. That framework — knowing what questions to ask — is what this module was built to give you.

The core capability you've built is a framework for evaluation: knowing which questions to ask about any AI tutor's design and how to interpret the answers.