In November 2022, a lawyer named Steven Schwartz was working on a routine aviation lawsuit against Avianca Airlines. He needed case citations — references to past court decisions that would support his client's argument. Under time pressure, he turned to ChatGPT, which had just launched days earlier. The AI produced six beautiful, professional-sounding citations: case names, docket numbers, court names, dates. The language was impeccable. Every citation read exactly the way a real legal citation should read.
Not one of them existed. The cases were completely fabricated. The courts had never heard them. The judges named had never written those opinions. But the text sounded so authoritative, so exactly right, that Schwartz submitted them to a federal court in May 2023 without checking. When the judge discovered the fraud, Schwartz faced sanctions and a hearing. He told the court he had no idea an AI could produce fake citations that sounded real. The judge was not entirely sympathetic.
The case made international headlines. Legal scholars called it a warning. But the deeper question wasn't really about lawyers. It was about all of us. Why does polished language feel more trustworthy than it should?
Here is something counterintuitive: humans are wired to associate smooth, fluent language with knowledge and truth. This isn't stupidity. It's a mental shortcut that usually works. If someone speaks with precision and confidence, they've usually put in the work to know what they're talking about. That shortcut evolved over thousands of years of human communication, and it's generally pretty reliable — when you're dealing with humans.
AI language models break this shortcut completely. They are trained to produce fluent, grammatically correct, confident-sounding text. That is literally the skill they were built for. A large language model like GPT-4 or Claude doesn't know whether a fact is true before deciding how confidently to state it. It knows what confident, professional language patterns look like, and it reproduces them regardless of whether the underlying claim has any basis in reality.
This creates a specific problem: the text that comes out of an AI is calibrated for readability, not accuracy. There is no signal in the prose style that tells you whether the content is real or invented. A fabricated legal citation and a real one come out of the same model in the same voice.
Think about someone you know who is genuinely an expert at something — a musician, a coder, a chef. When they explain their subject, they often stumble a little. They say "actually, wait" and correct themselves. They pause to think. They say "this part is complicated and I'm not sure I can explain it well." That stumbling is a signal: it means they're navigating real complexity, not just pattern-matching to what explanation sounds like.
AI text almost never does this. It glides. It transitions smoothly from paragraph to paragraph. Every claim gets a confident follow-up sentence. There are no corrections, no "hmm, let me reconsider." The text has what researchers call surface coherence — it hangs together beautifully — even when the underlying facts are scattered or wrong.
Researchers at Stanford and MIT have documented this in studies from 2023. When readers were shown two versions of the same incorrect fact — one written awkwardly, one written fluently — they rated the fluent version as significantly more credible, even when told both versions had the same accuracy. The prose was doing the persuading, not the evidence.
This is not a small thing. When you read a news article, a Wikipedia edit, a school report, or a product review, you are constantly — unconsciously — using language quality as a proxy for trustworthiness. AI has made that proxy unreliable.
The next time you read something that sounds impressively authoritative, ask yourself: am I trusting the information, or am I trusting the writing style? These are completely different things, and AI has made it urgent to tell them apart.
The Schwartz case was dramatic because it ended up in front of a federal judge. But the same dynamic plays out in quieter ways every day. A student submits an essay using AI-generated facts about a historical event — the essay sounds authoritative, the teacher grades it well, the wrong facts lodge in everyone's memory. A journalist uses an AI summary to double-check a story — the summary sounds accurate, so they skip the original source. A doctor reads an AI-produced literature review — it cites studies that don't exist, but the citations look right.
In each of these cases, the problem isn't that someone was careless. The problem is that the usual signal for "this is reliable" — confident, fluent, well-structured prose — has been completely decoupled from actual reliability. The signal still fires in our brains. It just doesn't mean anything anymore.
Researchers who study misinformation call this a "credibility laundering" problem. Raw, poorly-sourced information becomes credible-seeming when it gets rewrapped in professional language. AI is the most efficient credibility launderer ever built.
Every time you read a piece of writing and feel that automatic "this sounds right" response, you're experiencing fluency bias in action. Most people never name it or notice it. You just did. That awareness is a real skill — one that matters in school, in news, in arguments, and eventually in professional decisions that affect other people.
Here's a tension that doesn't have a clean answer: most of the time, good writing really does reflect careful thinking. Clear, organized prose often comes from someone who has done the work to understand what they're saying. The fluency bias shortcut exists because it usually works.
So if we train people to distrust fluent, polished writing — to become suspicious of text that sounds too good — are we also training them to distrust legitimate expertise? Are we making people worse at recognizing genuine scholarship? Could overcorrecting against AI-polished text make us ironically less equipped to evaluate real knowledge?
And there's a second layer: AI tools are now used by people for whom English is not a first language, or who have learning differences that make writing harder. Their polished AI-assisted text might actually reflect serious thought — they just needed help with the execution. If we penalize "too-perfect" writing, who gets penalized most?
These are real tensions. They don't resolve. The right move is to hold them, not solve them.
Below is a passage written in polished, confident AI-style prose about a scientific topic. Your job isn't to say whether it's AI or not — it's to identify which specific claims you would need to verify independently, and explain why the writing style can't tell you whether those claims are true.
Your lab partner will push back on your reasoning, ask you to be more specific, and won't let you off with vague answers. You need at least 3 exchanges to complete this lab.
Start by telling your lab partner which part of this passage you'd verify first, and why.
In January 2023, a group of researchers at the Royal Danish Academy submitted a batch of student essays to a study on AI detection. Half the essays were written by human students; half were generated by GPT-3.5. They recruited 79 experienced teachers — people who had spent years reading student work — and asked them to label each essay as human or AI. The teachers were confident. They averaged 64 years of combined teaching experience in the room.
Their accuracy rate was 38%. They would have done better guessing at random.
The researchers noted something specific in their analysis: the teachers who performed worst tended to rely on a single criterion — whether the writing "sounded like a student." They knew what student prose felt like. The problem was that GPT-3.5 had also been trained on enormous quantities of student prose, and it had learned to sound like one. The fingerprints the teachers were looking for had been deliberately smoothed away.
But a smaller group of teachers performed much better. What did they do differently? They looked at structure, not surface. They asked: does this essay's paragraph logic make sense for someone who actually wrestled with this question, or does it read like a comprehensive list?
When a human writer is working through an idea they genuinely find complicated, their paragraph structure tends to be uneven. They might spend three sentences on the thing that surprised them, then rush through the background they know well. They linger where they're uncertain. They repeat themselves when they haven't quite worked something out yet. Their organization reflects their actual thinking process, which is not perfectly linear.
AI paragraph structure is different. It follows a highly optimized template: introduce the topic, state the key point, provide supporting evidence, conclude with a transition to the next idea. Every paragraph does this. Every paragraph is roughly the same length. The transitions are smooth: "Furthermore," "In addition," "It is also worth noting," "This highlights the importance of." The essay reads like a well-organized report on a topic, not like someone actually thinking.
Researchers who study AI detection call this template adherence. The text follows its implicit organizational template so faithfully that it feels frictionless. Real human writing has friction — places where the writer changed their mind, added a thought awkwardly, got briefly lost and found their way back.
One of the most consistent findings in AI text research is a pattern in how AI uses hedging language — words and phrases that soften or qualify a claim. Phrases like "it is important to note," "research suggests," "this underscores the need for," "many experts believe," "a nuanced approach is required." These phrases perform carefulness without actually being careful.
Human writers hedge too. But they hedge about specific things they're actually uncertain about. An expert writing about climate science might say "the exact feedback timelines are still contested" — a real hedge about a real disagreement. An AI writing about climate science might say "it is important to note that climate change presents complex challenges" — a hedge about nothing in particular, placed there because it sounds appropriately measured.
In 2023, researchers at the University of Pennsylvania analyzed over 50,000 AI-generated texts and found that certain phrases appeared at statistically anomalous rates compared to human writing. Phrases like "in conclusion," "it is worth noting," "delve into," "multifaceted," and "underscores the importance" appeared in AI writing roughly 3 to 8 times more frequently than in comparable human texts. The AI wasn't using these phrases because they fit — it was using them because it had learned that formal writing uses them.
This is a detectable fingerprint, but it's fading. As people publish more AI-generated text on the internet, future AI models train on it and become less obviously formulaic. The window for catching AI by its hedge phrases is closing.
Phrases that appear disproportionately in AI writing: "delve into," "it is important to note," "nuanced," "multifaceted," "underscores," "in today's rapidly evolving landscape," "a comprehensive understanding," "crucial to recognize." Not proof of AI — but worth pausing on when they cluster.
Teachers who can tell human writing from AI writing often say they're looking for "voice." This is real, but it's vague unless you break it down. What they're actually detecting is a collection of specific things: the presence of concrete, specific personal detail; the occasional sentence that doesn't quite work but tries something; the sense that the writer has a particular relationship to this material rather than surveying it from above.
Human writers make idiosyncratic choices — unexpected word picks, comparisons that are a little strange, sentences that break the rules in a way that works. These aren't errors; they're signatures. They prove that a mind with a specific history and set of associations was at work, not a system optimizing for general readability.
An interesting test: ask yourself whether the writing could be about any topic, or only this one. AI essays about the French Revolution and AI essays about photosynthesis often have the same tone, the same structure, the same emotional register. Human essays about things people care about tend to feel different from human essays about things they were assigned. The caring shows up in the texture of the language.
You now have a structural lens, not just a surface one. When you read something, you can ask: does this paragraph organization reflect actual thinking, or template execution? Does the hedging refer to real uncertainty, or is it decorative caution? These questions work on AI writing, but they also make you a better reader of all writing — including your own.
Here's an uncomfortable fact: every time researchers publish a new list of AI writing fingerprints, those fingerprints start disappearing. The reason is straightforward. If a paper says "AI overuses the word 'delve,'" then people who want to hide AI writing tell the AI not to use "delve." Turnitin and GPTZero publish detection methods; AI developers update their models; the detectors update their algorithms; the cycle continues.
By late 2024, the most sophisticated AI writing tools, when prompted carefully, produce text that defeats commercial AI detectors at rates above 85%. This is documented in research from Stanford's HAI group. It means you can't outsource your detection to a software tool and feel safe.
The real skill isn't running text through a detector. It's understanding what you're actually evaluating when you read: Is this person demonstrating their thinking, or producing the appearance of thinking? That question applies regardless of whether AI was involved. And it's a question that software can't answer for you.
If AI writing fingerprints keep disappearing, and detection tools keep failing, what obligations do we have? Should writers be required to disclose AI use? If someone uses AI to polish their writing but all the ideas are their own, have they done something wrong? What if they used it to help them with a language barrier? There's no consensus on any of these questions — which means the rules you encounter in school and work right now are being improvised in real time.
Read the passage below carefully. Then tell your lab partner what structural features you notice — things like paragraph organization, transition words, hedging language, and whether the writing feels like someone actually thinking versus someone producing a well-organized summary. Commit to a judgment: does this feel like AI? What's your evidence?
List the specific structural features you notice, then give your verdict. Your lab partner will challenge your reasoning.
In September 2023, the U.S. Federal Trade Commission began issuing warnings about a specific practice they called "AI-generated fake reviews." The agency had tracked hundreds of thousands of product reviews on major retail platforms — Amazon, Walmart, Yelp — that were demonstrably AI-generated. These reviews didn't just exist; they deployed the specific vocabulary of each product category. A fake review of running shoes would include correct technical terms like "stack height," "heel-to-toe drop," and "carbon fiber plate". A fake medical device review would use clinical language about efficacy, biocompatibility, and FDA classification correctly.
Consumers reading these reviews couldn't tell the difference from reviews written by real domain experts. The reviews sounded more knowledgeable than most genuine user reviews. The FTC estimated that fake AI reviews were influencing billions of dollars in purchasing decisions. In August 2024, they formally banned the practice and began levying fines — but enforcement experts noted that detection is nearly impossible at scale.
The deeper issue: specialized vocabulary had always been a proxy for expertise. If someone used the right terms correctly, they probably knew what they were talking about. AI erased that assumption. And it erased it first in the places where people most rely on expert guidance — health, finance, legal advice, technical products.
When a large language model is trained, it ingests enormous quantities of text from every domain imaginable — medical journals, law reviews, technical manuals, academic papers, financial filings. It learns the vocabulary of each domain and, crucially, it learns the syntactic patterns that domain uses. Not just "what words do cardiologists use" but "how do cardiologists structure a diagnostic assessment sentence."
This means AI can produce text that reads like it was written by a cardiologist, a securities lawyer, a structural engineer, or a philosophy professor — and do it without any actual expertise in those fields. It's not reasoning like an expert; it's pattern-matching to what expert writing in that domain looks like. The distinction matters enormously, but the output is often indistinguishable on the surface.
A striking demonstration: researchers at the University of Chicago in 2023 fed GPT-4 the entire bar exam and had it produce answers. It passed, scoring in the 90th percentile. Then they asked it follow-up questions that required genuine legal reasoning about a novel hypothetical not covered in any training data. Performance dropped sharply. The AI knew how lawyer language works; it didn't know how law works.
In 2023 and 2024, a cluster of studies documented a specific and alarming version of domain mimicry in medical contexts. Researchers found AI-generated health information spreading on social media, YouTube descriptions, and health forums that was technically fluent — it used correct anatomical terms, cited real concepts in pharmacology, described legitimate-sounding treatment protocols — but contained dangerous errors. The errors were invisible to non-experts because the surrounding language was so convincing.
One documented case from 2023: a popular TikTok health account posted AI-generated summaries of studies on magnesium supplementation. The summaries used correct biochemistry vocabulary and cited actual published journals. But the dosage recommendations were wrong — in some cases, suggesting amounts that could cause cardiac arrhythmias. The account had 2.3 million followers. The error was caught by a cardiologist in the comments section who recognized that the language was correct but the reasoning was off.
The cardiologist's detection method is instructive: she didn't flag the vocabulary. She flagged the reasoning structure. Expert domain knowledge isn't just about having the right words — it's about knowing which considerations need to be weighed against each other, what the counterarguments are, and where the uncertainties in the field actually lie. AI text often has the words without the weighing.
When evaluating domain-specific AI content, don't ask "does this use the right terms?" Ask: "Does this text know what it doesn't know? Does it identify the genuine uncertainties and trade-offs in this field, or does it just list the main points?" Expertise is visible in what's left unsaid as much as in what's stated.
The people most harmed by AI domain mimicry are not people with the most expertise. They're people with the least — specifically people who are turning to authoritative-sounding text because they don't have access to real experts. Someone who can't afford a lawyer reading an AI-generated legal summary. Someone in a country with limited healthcare access following an AI-generated medical protocol. A first-generation college student reading AI-generated advice about financial aid that sounds completely authoritative and is partially wrong.
This is a genuine equity issue, not just a technical one. Access to real experts has always been unequally distributed. AI was supposed to help democratize that access. Instead, it's in some cases delivering a convincing simulation of expert knowledge that can actually widen the harm gap — the appearance of expert guidance without the safety net that real expertise provides.
Researchers at the Brookings Institution documented this pattern in 2024, specifically analyzing AI use for legal and medical advice among lower-income populations. They found that while AI tools did provide genuinely useful information in the majority of cases, the error cases — where the AI was confidently, fluently wrong — tended to cluster around edge cases that were also the most consequential for the people asking.
In 2024, the U.S. Congress held three separate hearings on AI-generated medical misinformation. The EU's AI Act includes specific provisions about high-stakes domains including healthcare and legal advice. These aren't future problems — they're problems that policymakers are trying to solve right now, with imperfect tools, in real time. Understanding the mechanics of domain mimicry puts you ahead of most adults following this debate.
Here is a real tension that policymakers, doctors, and technologists are actively arguing about: AI-generated medical information, even imperfect AI-generated medical information, is in many cases better than no information at all. For someone in a remote area with no access to a doctor, a mostly-right AI summary of medication side effects might genuinely save their life. The alternative — no information — might be worse than imperfect information.
But the people most vulnerable to being harmed by AI domain mimicry are also the people who most need accessible expert guidance. The solution of "get a real expert to verify" is not available to everyone equally. And warning labels on AI medical content — "this is not professional advice" — are not reliably read or understood.
If you think about this carefully, you'll notice that no clean position exists. "AI health information should be restricted to protect people" harms access for the most vulnerable. "AI health information should be unrestricted to maximize access" exposes the most vulnerable to dangerous errors. Every institutional policy in this space is currently a bet on which harm is worse — and no one actually knows.
Below is a passage written in the style of a medical/nutrition expert. The vocabulary is largely correct. Your job is to apply the cardiologist's method from Lesson 3: look at the reasoning structure, not the vocabulary. Does this text know what it doesn't know? Where does it present contested or uncertain information as settled fact? Where are the trade-offs missing?
Tell your lab partner specifically where the reasoning fails, even though the vocabulary is correct. What would a real nutrition researcher say about the claims that need caveats?
In October 2023, a Wall Street Journal investigation documented something remarkable happening inside Amazon's trust and safety team. The company had developed sophisticated AI detection tools to catch fake AI-generated reviews — tools that analyzed sentence structure, vocabulary patterns, and writing style. The tools had an accuracy rate that Amazon described as "above 90%."
Within six weeks of internal deployment, the tools' accuracy had dropped to below 70%. What happened? The people generating fake reviews had access to the same public research on AI writing signatures that Amazon's detectors used. They updated their prompts. They told their AI to vary sentence length, avoid signature phrases, and include deliberate "authenticity markers" — misspellings, colloquial expressions, first-person anecdotes about using the product. The detectors couldn't keep up.
The Amazon engineer who spoke to the Journal, anonymously, said something that has stuck with researchers ever since: "The more we taught our detectors to look for specific patterns, the more we trained the adversaries to remove those patterns. The only thing that doesn't get gamed is asking whether the content makes sense — whether there's a real person's thinking behind it."
That's the insight this lesson builds on. Not a list of patterns. A way of asking whether thinking happened.
Over three lessons, you've built up a set of specific observations about how AI writing works. Now let's consolidate them into a portable checklist — five questions you can ask about any piece of writing, in order. None of these rely on specific vocabulary patterns that can be patched. They all point at something deeper: whether a mind with genuine experience was actually at work.
Question 1: Does this text have a specific point of view, or does it survey from above? Real writers have a relationship to their material. They take a position, or they acknowledge that they're uncertain about taking one. AI text often describes a landscape of opinions without landing in it.
Question 2: Does the writing linger where things are complicated? Genuine expertise shows up as asymmetric attention — the writer spends more time on the hard parts. AI distributes attention evenly across easy and hard. Every point gets equal treatment because the system doesn't know which points are actually harder.
Question 3: Does the text acknowledge what it doesn't know? This is the cardiologist's question. Real expertise is calibrated — it knows where the evidence is strong and where it isn't. AI text is often uniformly confident, even across claims that experts would hedge significantly.
Question 4: Is there any specific, irreplaceable detail? Human writers anchor in personal experience or specific cases that only someone who was there could know. AI often uses detail that feels specific ("In a 2019 study...") but could have been generated from a pattern rather than from actual knowledge of that study.
Question 5: Does the conclusion follow from the preceding argument, or does it just re-describe it? AI conclusions frequently summarize rather than conclude — they restate what was just said rather than drawing a genuine inference from it. Human writers who have actually worked through an argument arrive somewhere new at the end.
Here's how the checklist looks in practice. Consider a paragraph from a 2023 magazine article about a musician, written by a human critic:
"There's a moment, about two and a half minutes into 'Midnight Rain,' where Swift pauses the synth line and lets the vocal melody sit on top of silence for exactly one beat too long. It shouldn't work. And then it does. That gap is what makes her an interesting songwriter instead of a competent one — she knows where the fall is and she jumps anyway."
Run the checklist: specific point of view — yes, a real claim about what makes her interesting. Asymmetric attention — yes, lingers on one specific moment. Acknowledges uncertainty — yes, "it shouldn't work." Irreplaceable detail — yes, the specific track and specific moment. Genuine conclusion — yes, arrives at a characterization that wasn't stated at the start.
Now compare to AI-generated music criticism: "Taylor Swift's songwriting demonstrates her deep understanding of musical dynamics. Her use of contrast between silence and sound creates memorable emotional moments. Research and critical consensus suggest she is one of the most successful artists of her generation. In conclusion, Swift's approach to songwriting underscores her significant influence on contemporary pop music."
Checklist: specific point of view — no, surveying consensus. Asymmetric attention — no, every sentence weighted equally. Acknowledges uncertainty — no, everything is equally confident. Irreplaceable detail — no, nothing that couldn't have been generated. Genuine conclusion — no, restates the introduction.
The checklist identifies absence of human thinking — it doesn't prove AI involvement. Bad human writing fails these tests too. A rushed student essay, a corporate press release, a bureaucratic report can all fail every one of these checks. The checklist identifies writing that didn't involve genuine thought, not necessarily writing that came from a machine.
The checklist is a tool for evaluating whether writing reflects genuine thinking. It's most useful in high-stakes situations: when you're deciding whether to trust a piece of information that might affect a decision you make; when you're evaluating whether a source has genuine expertise or is pattern-matching; when you're trying to determine if something is worth sharing or citing.
It's not a useful tool for low-stakes, informal contexts. If your friend sends you an AI-drafted birthday message, running the five-question checklist is probably not the move. The tool should match the stakes.
There's also a version of this checklist you can apply to your own writing. If you've drafted something and want to know whether it actually represents your thinking, ask the five questions. If your conclusion just restates your introduction, you haven't finished thinking yet. If every paragraph gets equal space, you haven't identified what's actually complicated about your topic. The checklist isn't just for catching AI — it's a description of what real thinking looks like on the page.
Most readers — including most adults, most teachers, most journalists — still rely on fluency, vocabulary, and confident tone as signals of trustworthiness. You now have a framework that goes beneath all three. You're asking whether genuine thinking happened, not whether the result sounds like it did. That's a fundamentally different skill, and it's increasingly rare.
A final honest accounting. The five-question checklist works well right now. AI models are improving rapidly, and within the next two to three years, the most sophisticated AI systems — when prompted by skilled users — will produce writing that passes the checklist more reliably. The asymmetric attention, the acknowledgment of uncertainty, the irreplaceable specific detail — these can all be prompted into existence by someone who knows to ask for them.
What this means is that the skill of detection isn't a destination you arrive at and stay. It's a practice you maintain. New AI capabilities require updating your framework. The underlying principle — asking whether genuine thinking happened — remains stable even as specific indicators shift. That's the thing to hold onto: not the checklist, but the question the checklist is pointing at.
And there's an even harder question underneath all of this: if AI can eventually produce text that demonstrates all the markers of genuine thinking — specific points of view, calibrated uncertainty, irreplaceable detail — does that text then deserve the same epistemic status as human thinking? What is "genuine thinking," anyway? These are questions that philosophers, AI researchers, and educators are actively arguing about right now, with no consensus in sight. They're also, increasingly, questions that have legal and institutional consequences. Where you land on them — or whether you're willing to sit with not landing — matters.
Below is a passage. Apply all five questions from the checklist to it: specific point of view, asymmetric attention, acknowledges uncertainty, irreplaceable detail, genuine conclusion. Give a verdict on each question, then give your overall judgment. Your lab partner will push back on any weak reasoning.
Go through each of the five checklist questions, give a verdict (pass/fail) with one sentence of reasoning for each, then deliver your overall judgment.