What Is Your AI Tutor Doing?

1. What problem arises when an AI tutor gives a student harder material based on a student model that was "fooled" by gaming behavior?

Correct. A student model inflated by gaming sends the student into harder material without the skills to handle it — and the system doesn't understand why they're suddenly struggling, because it thinks they already know the prerequisites.

The danger is that the system moves the student into harder content they're not ready for — because the student model was updated with fake "learning" signals from gaming behavior. The mismatch between model and reality causes harm.

2. What is "model collapse"?

Correct. Documented by Oxford and Cambridge researchers in 2024, model collapse is the slow loss of nuance and diversity when AI trains on its own outputs across successive generations.

Model collapse is a specific, documented phenomenon: the gradual loss of rare and nuanced knowledge when models train on AI-generated data, becoming more confident and more generic over successive generations.

3. A school district signs a contract with an AI tutoring company. The company's terms allow use of de-identified student interaction data for model improvement. Under current U.S. law, is this likely legal?

Right. The gap in existing law means this practice is generally legal under current federal frameworks — which is precisely why advocates have pushed for new state-level protections.

Under current federal law, institutional (district) consent combined with de-identification typically satisfies FERPA requirements. The risks exist in the gap between what the law was designed for and what AI tutoring systems actually do.

4. The MIT study found that students who used AI coding tools to generate code without understanding it got faster at submitting but didn't improve. This is best explained by which concept?

Correct. Amplification explains divergent outcomes from the same tool. Students who brought genuine learning intent improved; students who brought a shortcut intent just got faster at shortcuts.

The amplification principle is the key here. The same tool produced opposite outcomes based entirely on what the students were trying to do with it. The AI amplified each student's existing approach.

5. In May 2023, lawyer Steven Schwartz used ChatGPT to research a legal brief. What happened?

Correct. The cases were entirely invented — convincing names, dates, and summaries, none of it real. Schwartz was fined and publicly reprimanded by a federal judge.

The cases were completely fabricated — not misquoted, not misattributed, simply invented. This is a landmark example of AI hallucination causing real legal consequences.

6. Amazon shut down its AI hiring tool in 2018 after discovering it penalized female applicants. What was the root cause?

Correct. The bias was implicit in the historical data — not programmed in. The AI faithfully learned from a biased past and projected that past forward as a rule.

This is training data bias: the system learned from historical patterns that reflected gender inequity, and then enshrined those patterns as predictive rules for future decisions.

7. John Anderson's work on Intelligent Tutoring Systems in the 1980s introduced the foundational idea that:

Correct. This pairing — domain model plus learner model — is still the theoretical core of modern adaptive tutoring, now operating at scales Anderson couldn't have imagined.

Anderson's foundational insight was the combination: know the domain deeply, know the individual learner's current state, and use both to choose the next best instructional move.

8. A student asks an AI tutor: "Give me a historical event I've never studied and ask me to apply what I know about the causes of WWI to explain it." This is using which prompting strategy?

Right. Asking to apply known concepts to an unfamiliar scenario is a transfer test — the highest level of comprehension check. It combines active retrieval with novel application.

This is a transfer test — applying known concepts to a novel situation — which tests whether the student has genuine understanding (transferable) or surface-level pattern recognition (not transferable).

9. What is "gaming the system" as identified by Ryan Baker's 2008 research?

Right. Gaming means exploiting the hint system's structure — clicking through the ladder at high speed to get to the bottom-out hint — without engaging with any of the scaffolding.

Gaming specifically refers to exploiting the hint ladder — rapid clicking through hints to extract an answer without reading or using them. Baker found this accounted for over 20% of interactions in some studies.

10. Which question type has the HIGHEST risk of being affected by an outdated training cutoff?

Right. Interest rates change frequently and are set by policy decisions. An AI's answer reflects the rate as of its training cutoff, which could be significantly outdated.

The word "current" is the giveaway. Questions about things that change regularly — rates, laws, leadership positions, prices — carry the highest cutoff risk.

11. A student notices her AI tutor uses phrases like "As of my last update..." and "It's worth noting..." regularly. Should she treat these as reliable signals that the AI is genuinely less certain about specific claims?

Correct. Hedging phrases can be genuine uncertainty signals or learned stylistic habits. Current AI systems cannot reliably be read by their hedging language alone.

Hedging phrases aren't meaningless — but they're also not reliable uncertainty signals. They may simply be patterns learned from academic writing, appearing regardless of whether the model is actually less certain about that claim.

12. Which of these describes the "multimodal detection" approach to hint timing?

Correct. Multimodal detection means using several data streams together — not relying on any single signal — to make a more accurate judgment about whether a student needs a hint.

Multimodal detection is about combining behavioral data streams — timing, error patterns, hint-request frequency — to produce a more reliable judgment than any single signal alone.

13. The "assistance dilemma" describes the tension between:

Right. Too little help causes disengagement; too much prevents the productive struggle where real learning happens. Both are harmful in different ways.

The assistance dilemma is about help quantity vs. learning depth — the challenge of keeping students engaged without removing the cognitive work that produces actual learning.

14. RLHF teaches an AI model to produce outputs that are:

Correct. RLHF optimizes for human approval ratings, which generally point toward useful answers but can diverge — especially in the direction of sycophancy.

RLHF optimizes for human preference ratings. These usually correlate with accuracy but not always — and the gap is where sycophancy and other failure modes emerge.

15. Kurt VanLehn's original ANDES system at Carnegie Mellon failed because it gave students immediate correct steps. What was the result of the redesign that gave graduated hints instead?

Correct. The redesigned ANDES system — which gave graduated hints and withheld final answers — matched human tutor outcomes in the 2001 results.

The redesigned ANDES matched human tutors in learning outcomes. Graduated hints, despite being less immediately "helpful," produced better retention than direct answers.

16. Percy Liang's HELM benchmark research revealed a problem with how AI models were evaluated. What was that problem?

Correct. Benchmark evaluations using polished prompts overestimate real-world performance, because actual students write vague, ambiguous, or poorly framed prompts that produce much weaker responses.

Liang's concern was that evaluations measured best-case prompt performance, not realistic-case performance. Real users write messy prompts, and models respond less well to those.

17. A lawyer submitted fabricated court cases to a federal judge because ChatGPT generated them. What does this illustrate about hallucination?

Correct. Hallucination means false content delivered with the same confident fluency as true content — which is exactly what makes it dangerous without verification habits.

This isn't a fixed bug or an isolated incident. Hallucination — confidently fluent false output — is a structural feature of LLMs, not a defect in one version.

18. Why does internet text "overrepresent" certain perspectives in AI training data?

Exactly. It's structural, not conspiratorial. Unequal internet access and unequal text production naturally skew what a web-scraped training corpus contains.

The overrepresentation isn't the result of deliberate selection — it reflects who had access to the internet and who produced large volumes of text over the decades when most of the training data was generated.

19. Which of these is an example of "role specification" in prompting?

Right. Role specification tells the AI what kind of teacher or expert to embody, which shapes the level, vocabulary, and focus of the entire response.

Role specification assigns the AI a persona and context that shapes the register of its response. The other options describe constraint setting, error injection, and Socratic prompting respectively.

20. Microsoft's Tay chatbot began posting harmful content within hours of launch in 2016. What fundamental principle about training data does this demonstrate?

Correct. Tay reflected its inputs faithfully — which is exactly how training works. What goes in shapes what comes out, without independent moral judgment.

The issue is specific: AI trained on inputs reflects those inputs. Tay had no filter for "appropriate" — it learned from what it received and reproduced it. That's the training data principle.

Final Exam