L1
ยท
Quiz
ยท
Lab
L2
ยท
Quiz
ยท
Lab
L3
ยท
Quiz
ยท
Lab
L4
ยท
Quiz
ยท
Lab
Module Test
Module 6 ยท Lesson 1

Why Different Tools Give You Different Answers

Same prompt, different AI โ€” and sometimes a completely different result. Here's why that's not random.
If you're already getting decent outputs from one AI, what do you actually gain by understanding the others?

Priya is a junior at a state university, finishing a cover letter at midnight for a UX research internship at a product studio in Austin. She types the same prompt into both ChatGPT and Claude: "Write a professional cover letter for a UX research internship, emphasizing user empathy and a background in psychology." The ChatGPT version comes back polished, confident, a little generic โ€” the kind of letter that reads like it could belong to any of fifty applicants. The Claude version is longer, more nuanced, and includes a paragraph that almost sounds like it's asking the hiring manager a question. Neither letter is wrong. But they're clearly not the same.

Priya doesn't know why they're different, so she combines them manually and submits. She gets an interview. But she's left wondering: was that the best she could have done? If she'd understood why those tools responded differently, she could have made a deliberate choice โ€” not just a paste job.

The Architecture Isn't the Difference โ€” Training Is

Every major AI chatbot โ€” ChatGPT, Claude, Gemini, Mistral, Llama-based products โ€” is built on a transformer architecture. That part is roughly the same across the board. The reason they behave differently comes from three distinct layers: what data they trained on, how they were fine-tuned, and what their RLHF (reinforcement learning from human feedback) optimized for.

OpenAI's GPT-4o was trained and fine-tuned with a heavy emphasis on being useful across a wide range of tasks quickly. It was also shaped by massive user feedback loops, which means it got really good at producing things that feel polished and satisfying โ€” even if they lack depth. Anthropic built Claude with a framework called Constitutional AI, which trained the model to evaluate its own outputs against a set of principles. That makes Claude more likely to hedge, qualify, or push back on things it finds poorly framed. Google's Gemini was trained with especially strong emphasis on factual retrieval and integration with real-time information โ€” it's a research engine that learned to talk.

Understanding these design philosophies isn't just trivia. It directly predicts how each tool will respond to different prompt structures, tones, and task types.

Why This Matters to You Right Now

If you're using one AI for everything โ€” cover letters, code debugging, brainstorming, fact-checking โ€” you're almost certainly leaving capability on the table. Not because the other tools are "better," but because different tasks map to different strengths, and the strengths are real and consistent.

A Mental Model: Personality Types for AI Tools

Here's a framing that's a little reductive but genuinely useful: think of the major AI tools as having distinct working styles, the way collaborators do.

ChatGPT (GPT-4o) is the fast, confident generalist. It produces output immediately, with high surface polish. It's trained to please, which means it will rarely tell you your idea is bad โ€” it will build on it. That's great for brainstorming, drafting, and getting something out of zero. It's less great when you actually need someone to push back on your assumptions.

Claude (claude-3.5/3.7 Sonnet, claude-opus) is the careful analyst who writes in full sentences. It asks clarifying questions, adds nuance you didn't request, and sometimes writes more than you wanted. It's been trained to reason about ethics, uncertainty, and framing. If you give Claude a poorly constructed argument, it will often notice and say so โ€” sometimes helpfully, sometimes annoyingly. It excels at long-form reasoning, complex editing, and tasks where you want a genuinely considered opinion, not just agreement.

Gemini (Google's model) is the researcher with a live internet connection. When you need a synthesis of current information โ€” recent policy changes, recent science, what happened last week โ€” Gemini is frequently more reliable than models with training cutoffs. Its prose is sometimes more mechanical, but its factual grounding is a genuine asset for anything that requires being current.

Smaller/open models (Mistral, Llama-based tools like Meta AI, Perplexity's model) are more unpredictable but often faster, cheaper, or available in contexts where the big three aren't. Their consistency depends heavily on the specific deployment, but they're worth knowing about.

Tool Strongest at Watch out for Prompt style it rewards
GPT-4o Fast drafts, code, creative work, wide task range Over-confident, rarely challenges bad premises Direct, task-focused, specific format requests
Claude Long reasoning, editing, nuanced argument, analysis Can be verbose, sometimes over-qualifies Context-rich, conversational, reasoning-forward
Gemini Current info, research synthesis, Google integration Less creative, can be dry prose Research-style questions, "as of [date]" framing
Mistral/Llama Speed, local deployment, customization Inconsistent quality, less instruction-following Concise, simple, structured prompts
The Peer Situation: Everyone's Using One Tool

Here's what's actually happening among people your age right now: the overwhelming majority are using exactly one AI tool, usually whatever they first signed up for โ€” which for most people was ChatGPT. Not because it's objectively the best tool for every situation, but because it was first, it's easy, and switching feels like extra effort with no obvious payoff.

That's understandable. It's also a mild but real disadvantage. The people getting the most out of AI right now โ€” in classes, internships, personal projects โ€” aren't necessarily using more sophisticated prompts. Some of them have just figured out that a task that produces mediocre output on one tool might produce excellent output on another. Priya's cover letter problem wasn't really about prompt engineering. It was about not knowing that Claude was probably the better choice for a nuanced, reflective writing task, while ChatGPT would have been better for quickly generating five structural variations she could pick from.

The practical takeaway from this lesson: Next time you get output you're not satisfied with, before you spend 20 minutes re-prompting, ask yourself: would a different tool do this better by design? Give it five minutes with a competitor and see.

Try This Tomorrow

Take a task you use AI for regularly โ€” summarizing readings, drafting messages, brainstorming ideas. Run the same prompt on two tools. Don't try to make one better than the other. Just observe: what did each one emphasize? What did each one skip? What does that tell you about their training priorities?

Lesson 1 Quiz

5 questions โ€” apply what you just read
1. What is the primary reason ChatGPT, Claude, and Gemini produce different outputs from the same prompt?
Correct. All three use transformer architectures โ€” the differences come from training data, fine-tuning strategies, and RLHF optimization targets, not the base math.
Not quite. The architectural differences are minor compared to the training and fine-tuning differences. RLHF optimization is especially important โ€” it shapes the model's "personality."
2. You're a freelancer writing a strategic analysis memo for a client โ€” 1,500 words, layered argument, needs to hold up to scrutiny. Which tool is most likely to serve you best by design?
Right. Claude's design prioritizes careful reasoning, self-evaluation, and nuance โ€” exactly what a scrutiny-worthy strategic memo needs. Speed matters less than argument quality here.
Think about what the task actually demands: sustained reasoning, argument structure, the ability to hold up to scrutiny. Which tool is trained to care about those things most?
3. Anthropic's "Constitutional AI" approach primarily affects Claude by making it:
Correct. Constitutional AI trains the model to self-critique against a set of principles, which manifests as more qualifications, more willingness to say "this framing has a problem," and sometimes more verbosity.
Constitutional AI is a self-evaluation framework โ€” it makes Claude assess its own outputs, which generally makes it more cautious, not faster or more agreeable.
4. Your friend tells you: "I just use ChatGPT for everything because it's the best." Based on what you learned in this lesson, what's the most accurate response?
Exactly. No single tool is best at everything. ChatGPT is genuinely excellent for many things. But the claim that it's "the best" for all tasks ignores consistent, design-level differences in what each tool optimizes for.
This isn't about which tool is universally "best" โ€” that framing is the mistake. Different tools have genuine strengths in different categories, and those strengths are consistent because they come from training design.
5. Which of the following is the best description of what Gemini is specifically optimized to do well?
Correct. Gemini was built by Google with strong emphasis on factual retrieval and real-time information โ€” it's essentially a search engine that learned to reason and write.
Think about who built Gemini and what their core product already does. Google's DNA is information retrieval โ€” that shows up in what Gemini is optimized for.

Lab 1: Tool Selection Consultant

You're advising a peer on which AI tool to use for their task. Make a real judgment call.

Your Role: AI Tool Advisor

Your peer is about to use AI for a task. They don't know which tool to use and they're going to ask you. Your job isn't to say "it depends" โ€” it's to make a specific recommendation and defend it based on what you know about how these tools are designed.

The AI in this lab will play your peer: direct, a little skeptical, and willing to push back if your reasoning is weak. After you've given your recommendation, justify it. If you're not sure, say why โ€” but make a call.

Try starting with: "What's your task?" and let the peer describe what they're trying to do. Then recommend a tool and explain the design logic behind your choice.
Lab Assistant
Tool Selection
Hey โ€” I need to use AI for something and I honestly have no idea which tool to use. You just did this module, right? Walk me through it. Tell me your task first, or just ask me what I'm doing and help me pick.
Module 6 ยท Lesson 2

How ChatGPT Actually Works โ€” and What That Means for Your Prompts

GPT-4o is the tool most people use. It's also one of the most misunderstood.
Why does ChatGPT give you a confident, polished answer even when it's wrong โ€” and how do you account for that?

Marcus is a sophomore studying communications, and he's using ChatGPT to research a persuasion paper on social proof. He asks: "What are the most important studies on social proof from the last five years?" ChatGPT gives him five citations, complete with author names, journal titles, and publication years. Marcus skims them โ€” they look legit โ€” and drops them into his paper. His professor flags three of them as non-existent. The citations were fabricated.

Marcus is furious at ChatGPT. But here's the thing: ChatGPT didn't lie to him in the way a person lies. It did exactly what it was designed to do โ€” predict the next most plausible token in the sequence. "Author Name. Journal Title. Year. Page numbers." looks exactly like what a real citation looks like. From the model's perspective, it generated a valid-seeming continuation of the text pattern. It had no way to know whether those specific papers existed. Marcus's mistake wasn't trusting AI โ€” it was using ChatGPT for a task that requires factual accuracy about specific real-world objects.

The Fundamental Thing About GPT-4o: It Predicts, Not Retrieves

Every response from any GPT model is a prediction. The model doesn't look things up. It doesn't have a database of facts it checks against. It generates text by predicting what tokens (words, parts of words) should come next given everything it's seen before, weighted by patterns learned during training. This is powerful. It's also a profound source of unreliability for specific factual claims.

GPT-4o is especially likely to produce confident, fluent, polished-sounding text โ€” because that's what it was rewarded for. Human feedback during RLHF consistently rated fluent, confident responses higher than uncertain or hedging ones, which trained the model to produce exactly that: confident text, whether or not it's accurate. This isn't OpenAI being malicious. It's a known consequence of training for satisfaction.

The practical implication: ChatGPT is excellent for tasks where fluency and pattern are more important than factual precision. First drafts. Brainstorming. Restructuring existing text. Writing code from clear specifications. Creating outlines. All of these are "generate a plausible continuation of this pattern" tasks โ€” which is exactly what GPT-4o is built for.

Hallucination When a language model generates factually incorrect information with high confidence. Not a malfunction โ€” a predictable consequence of token prediction optimized for fluency.
RLHF Reinforcement Learning from Human Feedback. The process that shapes model "personality" โ€” human raters signal which outputs are better, training the model to produce more of those.
What GPT-4o Actually Does Well โ€” And How to Prompt For It

GPT-4o's strongest use cases fall into a few clear categories, each with prompting strategies that extract better output:

Fast structural drafting. When you need a first draft, outline, or structure quickly, GPT-4o is often the fastest path. The key prompt move here is to specify format explicitly. "Write a 5-section outline for an essay arguing X, using bullet points with 2 sub-points each" will produce better structure than "help me outline my essay." Format instructions are one of the things GPT-4o follows most reliably.

Creative iteration. GPT-4o is good at generating multiple versions of something. If you ask for five different approaches to the same email opening, it will actually give you meaningfully different versions. Contrast this with Claude, which tends to give you one considered version and explain why it made the choices it did. For creative tasks where you want raw options to pick from, GPT-4o's "give me variety" mode is an asset.

Code and technical scaffolding. GPT-4o writes competent code quickly for standard tasks. The caveat is that it will also confidently write wrong code. Always test. Never trust output for functions that touch real data, authentication, or payments without review.

Tone matching and rewriting. Paste in some existing text and ask GPT-4o to rewrite something new in the same style. It's unusually good at picking up on tone, register, and voice โ€” better than most other models for this specific task.

The Prompt Pattern That Unlocks GPT-4o

Give it explicit constraints: format, length, tone, audience, and any content limits. GPT-4o is trained to follow detailed instructions closely. The more specific your constraints, the less creative latitude it takes โ€” and the more reliable the output becomes. Vague prompts give it room to hallucinate. Constrained prompts channel its prediction engine in a useful direction.

The Peer Reality: How People Are Actually Using (and Abusing) ChatGPT

The dominant way people use ChatGPT right now is: open it, type what they want, and either accept or re-prompt once. Most people treat it like a more capable Google search. That works fine for a lot of tasks. It fails predictably for specific factual claims, for tasks that require genuine nuance or pushback, and for long documents where the model loses context.

The subtle thing most people miss is the confidence calibration problem. ChatGPT will give you the same tone whether it's very certain or completely guessing. It doesn't say "I'm not sure about this" the way Claude sometimes does. That means you have to calibrate yourself โ€” you have to know which categories of claims to verify externally. Specific citations, specific statistics, specific dates, names of real people in niche fields: always check. General explanations of concepts, structural outlines, rewrites of existing content: generally reliable.

The practical takeaway: Use ChatGPT like a fast, confident collaborator who produces great rough drafts but sometimes makes up sources. The solution isn't to distrust it โ€” it's to know which outputs need fact-checking and which don't.

Prompting Upgrade: The Constraint Stack

Before sending your next ChatGPT prompt, add three constraints you weren't planning to include: (1) a specific output format, (2) a specific audience, and (3) one thing to avoid. Watch what changes. The additional specificity almost always improves output quality โ€” not because ChatGPT needs more context philosophically, but because it needs a narrower target to aim at.

Lesson 2 Quiz

5 questions โ€” understand ChatGPT's design deeply enough to use it well
1. What is the core mechanism behind ChatGPT's text generation?
Correct. ChatGPT is fundamentally a prediction engine โ€” it generates text by predicting the next token, not by retrieving facts or performing logical deductions.
ChatGPT doesn't retrieve or look anything up. Its entire output is generated by predicting what tokens should come next based on patterns learned during training.
2. Marcus's fake citation problem happened because:
Exactly right. "Author. Journal. Year." is a plausible citation pattern โ€” the model generates it confidently because it matches what citations look like, not because it verified the paper exists.
There's no intent involved. It's a structural problem: citations look like a specific text pattern, and the model generates that pattern without any way to verify whether real objects behind the pattern exist.
3. You're helping a friend write five different opening lines for a pitch email. Which approach best uses ChatGPT's actual strengths?
Right. GPT-4o's strength for creative tasks is generating variety quickly. Asking for explicit tone variants leverages its ability to explore option space. You pick the best; it does the generation.
ChatGPT is actually quite good at creative drafting โ€” especially when you ask it for multiple variants. The "give me five versions with different tones" prompt is a classic way to use its generative range well.
4. Why does ChatGPT sound equally confident whether it's correct or hallucinating?
Correct. This is one of the most important things to understand about GPT-4o. Human raters during RLHF preferred fluent, confident answers โ€” so the model learned to produce them. Confidence is a style choice baked into the training, not a reliability signal.
The answer is in the training process. RLHF shaped the model to produce the outputs humans rated as best โ€” and humans tended to rate confident, fluent answers higher. That's why confidence doesn't track accuracy.
5. The "constraint stack" prompting technique works for ChatGPT primarily because:
Exactly. Vague prompts give GPT-4o a wide prediction space โ€” it fills it with whatever seems most plausible, which often includes problematic assumptions. Constraints narrow that space, directing the generation toward something you'll actually find useful.
Think about what constraints do to a prediction problem: they narrow the target. Less latitude means the model can't wander into plausible-but-wrong territory as easily. That's the mechanism.

Lab 2: The ChatGPT Constraint Challenge

You're going to build a constrained prompt โ€” and the lab AI is going to push back if your constraints are weak.

Your Role: Prompt Architect

You have a task that you'd normally give to ChatGPT. Your job is to build a constraint-stacked prompt for it โ€” specifying format, audience, length, tone, and at least one explicit exclusion. Then explain to your lab partner (the AI here) why each constraint is doing useful work.

The lab AI will evaluate your prompt design: are the constraints real and useful, or are they just padding? It will ask you to justify or revise. This is the kind of critical feedback ChatGPT itself almost never gives you โ€” which is part of why it's worth practicing here.

Start by telling the lab AI what task you're designing the prompt for. Then share your constraint-stacked prompt draft and explain your choices.
Lab Assistant
Prompt Architecture
Alright, let's do this. What task are you designing a constrained ChatGPT prompt for? Give me the task, then show me your first draft of the prompt โ€” and be ready to defend each constraint you included.
Module 6 ยท Lesson 3

Prompting Claude: Getting the Most Out of a Model That Thinks Out Loud

Claude is built differently. The prompts that work on ChatGPT often underperform here โ€” and vice versa.
What changes in how you write prompts when the model is designed to push back, qualify, and reason rather than just comply?

Deja is a pre-law junior and she's been using Claude to help stress-test arguments for a moot court competition. She typed her argument into Claude and asked: "Is this argument strong?" She expected either validation or a list of counterarguments she could prepare for. Instead, Claude gave her a response that started with: "This argument has real structural clarity, but there are two premises where I'd push back before recommending it to a hostile examiner." Then it actually pushed back. Hard. On premises she thought were solid.

Her first reaction was annoyance โ€” she hadn't asked for criticism. Her second reaction, after she worked through the feedback, was that Claude had just identified the two weakest points in her argument, which were exactly the two points the opposing team attacked during the competition. She finished in the top three. The lesson wasn't just "Claude is better for this." It was: Claude rewards prompts that invite genuine analysis rather than prompts that fish for confirmation.

Why Claude Behaves Differently in Conversation

Claude was built by Anthropic using a training approach called Constitutional AI, where the model is taught to evaluate its outputs against a set of explicit principles before committing to them. This creates a model that thinks about what it's saying more than it thinks about whether you'll like what it says. The practical consequence is that Claude is significantly more likely than ChatGPT to:

โ€” Add unsolicited qualifications when it thinks your premise is shaky
โ€” Point out ambiguity in your request before answering
โ€” Produce longer responses because it includes reasoning, not just conclusions
โ€” Decline or reframe requests it finds ethically problematic rather than just complying

This can feel annoying if you're used to ChatGPT's compliance. But it's a feature, not a bug, if you're using Claude for the right tasks. The trick is knowing how to structure prompts that work with this orientation instead of against it.

Prompts That Actually Work With Claude's Design

Give Claude context and let it reason. Unlike ChatGPT, which responds well to tightly constrained format instructions, Claude actually performs better when you give it context and invite it to reason. A prompt like "Here is my draft argument. I'm presenting to a skeptical audience. Identify the weakest points and explain why they're weak" will get you a more genuinely useful analysis than "List five weaknesses in this argument." The first invites reasoning. The second invites a list.

Ask for steelman and then pushback. Claude is exceptionally good at the intellectual move of "give me the strongest version of the opposing position, then tell me how to respond to it." This is genuinely hard for most AI tools because it requires the model to take a position it doesn't necessarily hold. Claude's training makes it comfortable doing this without compromising on accuracy.

Explicitly invite disagreement. Prompts that say "tell me what's wrong with this" or "where would this fail?" unlock Claude's critical mode better than neutral prompts. Because it's trained to be diplomatically honest rather than dishonestly diplomatic, giving explicit permission for criticism produces more direct responses.

Use Claude for long-document tasks. Claude has one of the longest context windows of any deployed model, and it actually uses context it was given earlier in a conversation more reliably than most competitors. For tasks like reviewing long documents, maintaining consistency across a multi-section piece, or having a sustained analytical conversation, Claude's memory of the conversation tends to be more reliable.

The Verbosity Problem โ€” and How to Handle It

Claude's responses are often longer than you need. It will explain its reasoning when you just wanted the output. The fix: add a direct instruction like "Be direct and concise โ€” give me the answer without explaining your reasoning unless I ask" or "Give me only the output, no meta-commentary." Claude takes these instructions seriously. It won't feel offended. It will comply.

When Claude Is Not the Right Tool

It's worth being honest about where Claude underperforms, because the peer instinct is often to treat each new AI discovery as universally better. Claude is not the right choice when:

You need speed above quality. Claude's responses are frequently longer and more considered โ€” that takes time. For rapid-fire brainstorming or quick rewrites, ChatGPT is faster.

You want pure creative compliance. If you want the AI to just write what you asked for without questioning your concept, Claude's tendency to evaluate and push back can slow you down. Creative tasks where the directive is "just do it, I'll judge quality myself" are often faster on ChatGPT.

You need current information. Claude has a training cutoff and (in its default state without tools) no real-time web access. For anything where currency matters โ€” recent developments, current stats, what happened last month โ€” Gemini or ChatGPT with browsing is more reliable.

The practical takeaway: Use Claude specifically for tasks where you want a model that will evaluate its own output, challenge weak premises, and produce considered rather than immediate answers. Brief it with context. Invite disagreement. Explicitly request concision if you need it. The output reward for doing this well is real.

A Prompting Formula for Claude

[Context] + [Task] + [Standard it should apply] + [Permission to critique] + [Output format]. Example: "I'm applying to a competitive graduate program in urban planning. Here is my statement of purpose. Evaluate it against what top programs say they're looking for. Tell me what's weak, what's missing, and what's strongest โ€” then give me a revised opening paragraph." That prompt structure uses every dimension of Claude's strengths.

Lesson 3 Quiz

5 questions โ€” can you apply Claude's design logic to real prompting decisions?
1. What is the primary reason Claude often adds qualifications and pushes back on user premises?
Correct. Constitutional AI teaches Claude to evaluate its own outputs before producing them โ€” a process that surfaces problems in the user's premise as a natural side effect.
The mechanism is Constitutional AI โ€” a training approach that has Claude evaluate its own outputs against principles. That self-evaluation is what surfaces premise problems and generates qualifications.
2. You want Claude to analyze a business plan you've written and identify its weaknesses. Which prompt is most likely to get genuinely useful critical feedback?
Right. This prompt gives Claude context (skeptical investors), invites genuine critique (weakest assumptions), asks for reasoning (why each is vulnerable), and signals that pushback is welcome. That's the structure Claude rewards.
Compare the prompts: which one gives Claude context, invites genuine reasoning, and signals that critique is welcome rather than just requested? Claude performs best when the prompt treats it as an analyst, not a task-completer.
3. Claude's responses tend to be longer than you need. The most effective way to address this is:
Correct. Claude takes explicit concision instructions seriously. Adding "output only, no meta-commentary" or similar framing reliably reduces verbosity without losing quality.
The simplest solution is also the most effective: just tell Claude explicitly what you want. "Be direct โ€” give me the output without explaining your reasoning" is a direct instruction Claude will follow.
4. Deja's moot court experience illustrates which core principle about prompting Claude?
Exactly. Deja got more value by inviting honest evaluation than she would have gotten by asking for validation. Claude is designed to provide the former โ€” and it will, if the prompt doesn't implicitly discourage it.
The story's real lesson is about what kind of prompt unlocks Claude's actual capability. She didn't ask for criticism explicitly โ€” she asked "is this strong?" and Claude evaluated rather than validated. The prompt that invites analysis gets analysis.
5. You're working on a creative short story and need an AI to write the next three scenes exactly as you specified โ€” no editorial notes, no suggestions, just the scenes. Which tool is better suited for this task?
Right. When the task is pure creative execution โ€” just do what I specified โ€” ChatGPT's tendency to comply without evaluating is an advantage. Claude's critical orientation slows you down when you want compliance, not analysis.
Think about what "pure creative execution" demands: compliance, not evaluation. Which tool is trained to comply vs. evaluate? That's your answer.

Lab 3: The Claude Critic

You're going to bring something you've written โ€” and get it genuinely stress-tested.

Your Role: Analysis-Seeker

Bring something real: a paragraph from a paper, a professional bio, an argument you've been making, an idea for a project. Use the Claude-optimized prompt structure from the lesson โ€” context, task, standard, permission to critique, output format. Then push back on the lab AI's feedback if you disagree. See what happens when you defend your choices vs. when you revise them.

The lab AI here is playing the role of a rigorous analyst โ€” direct, specific, willing to be convinced but not easily. It won't validate for the sake of being nice.

Paste what you want stress-tested and use the full Claude prompt formula: context + task + standard + explicit permission to critique + output format. The more specific your prompt structure, the more useful the feedback.
Lab Assistant
Critical Analysis Mode
Ready when you are. Use the full prompt structure from the lesson โ€” give me context, the thing you want analyzed, the standard I should hold it to, and tell me explicitly that you want real critique. Don't fish for compliments. If you do, I'll call it out.
Module 6 ยท Lesson 4

Gemini, Specialized Tools, and Building a Personal AI Stack

Knowing what exists is only half the problem. The other half is building a workflow you'll actually use.
How do you go from "I know these tools exist" to actually having a reliable system that makes your work better, not just faster?

Leo is a junior studying environmental policy, and he's trying to write an analysis of a federal rule that was finalized in late February 2025 โ€” less than two months ago. He opens ChatGPT and asks about the rule. ChatGPT gives a confident summary that's based on the proposed rule from 2023, not the final rule. The summary is authoritative-sounding and wrong. He opens Claude and asks the same question. Claude hedges: "My training data may not include the final version of this rule as finalized in early 2025 โ€” I'd recommend verifying with a primary source." At least Claude told him it didn't know. He opens Gemini with web access. Gemini pulls the actual Federal Register summary and gives him the accurate, current version. Right tool, right task.

Leo had been using all three tools for months. But he'd been treating them as interchangeable โ€” just picking whichever one was already open on his laptop. This moment clarified something he'd understood abstractly but never internalized: the tool you use isn't just a style preference. It's a decision with real consequences for the quality of your work.

Gemini's Actual Strengths โ€” and How to Prompt for Them

Gemini (Google's flagship model, available at gemini.google.com) was built with a genuinely different orientation than the other major tools. Google's core product is information retrieval at scale, and that DNA shows in Gemini's design. The model's strongest features include:

Real-time web access. Unlike ChatGPT and Claude in their default states, Gemini regularly incorporates current web search results. For anything that needs to be current โ€” recent regulation, current market prices, recent scientific papers, what a company announced last month โ€” Gemini is typically more reliable than models working only from training data.

Google ecosystem integration. Gemini integrates natively with Google Docs, Gmail, Drive, and Search. If you're working in the Google ecosystem (which most students are), Gemini can access your actual documents, summarize your emails, and work within your existing files. None of the other major tools do this natively without third-party connectors.

Multi-modal input. Gemini handles images, PDFs, and other file types well. Upload a dense policy document and ask it to extract the key provisions. Upload a chart from a paper and ask it to explain what it shows. These tasks are usable on other platforms too, but Gemini's file-handling in practice tends to be reliable.

The prompt strategies that work well with Gemini lean into its research orientation. Ask it to compare current data. Ask it to summarize documents you upload. Ask it to fact-check claims against what it can currently find on the web. Ask it to find the most recent version of something. Prompts that leverage its real-time access are where Gemini outperforms models without it.

Specialized Tools Worth Knowing About

Beyond the big three, there are a handful of specialized tools that outperform general-purpose AI in specific contexts. You don't need to use all of them โ€” but knowing they exist means you're not limited to ChatGPT when a better-fit tool exists.

Perplexity AI is essentially a research assistant that always shows its sources. Every claim is linked to a retrievable document. If you're doing research where you need to trace claims back to primary sources โ€” and you don't want to manually verify every output โ€” Perplexity's source-linking is a genuine practical advantage over models that generate without attribution.

GitHub Copilot / Cursor are coding assistants that are directly integrated into development environments. For programming tasks, these tools dramatically outperform asking a general AI assistant about code, because they have context about your entire codebase, can see your errors in real time, and are trained specifically on programming tasks at a depth that general models don't match.

NotebookLM (Google) is designed for working with a specific set of documents you provide. Upload your course readings, upload your research papers, and then ask questions about them specifically. Unlike asking a general model to "remember" a document, NotebookLM grounds all responses in exactly what you gave it โ€” which dramatically reduces hallucination for document-based research tasks.

Tool Best for Avoid for
ChatGPT Fast drafts, brainstorming, creative variants, code scaffolding, tone matching Fact-specific claims, citation generation, current events without browsing
Claude Long reasoning, argument analysis, editing, complex writing tasks, ethical dilemmas Speed-critical tasks, pure creative compliance, real-time information needs
Gemini Current events, research synthesis, Google ecosystem tasks, document analysis Nuanced creative writing, prolonged analytical reasoning
Perplexity Research with verifiable sources, claim checking, academic topic overview Long-form writing, creative tasks, code
NotebookLM Analyzing documents you provide, course reading synthesis, reducing hallucination Tasks not grounded in provided documents, creative generation
Building a Personal AI Stack You'll Actually Use

Here's the honest version of what "having an AI stack" means at your stage: it's not about subscribing to ten different services and maintaining a complex workflow. Most people your age, navigating real tasks with real time constraints, need something simple enough to actually use consistently. The goal is decision clarity, not comprehensiveness.

A workable three-layer personal stack looks like this:

Layer 1 โ€” Your daily driver. One tool you use by default for most tasks. For most people, this is ChatGPT or Claude. Pick the one that fits your primary use cases (drafting and brainstorming โ†’ ChatGPT; analysis and editing โ†’ Claude). Don't switch daily drivers on every task โ€” build fluency with one first.

Layer 2 โ€” Your research layer. One tool you reach for when currency or sources matter. Gemini or Perplexity. When a task involves claims about what's currently true โ€” not patterns or concepts, but specific facts โ€” you go here instead of your daily driver. Make this a habit, not a fallback.

Layer 3 โ€” Your specialist. One tool for a specific high-frequency task in your own life. If you code, this is Copilot or Cursor. If you study from readings, this is NotebookLM. If you write music, this is something else entirely. One specialist that makes a specific repetitive task meaningfully better.

The peer reality is that most people skip layers 2 and 3 entirely and use ChatGPT for everything โ€” and then wonder why their AI-assisted research sometimes produces wrong information or why their study sessions don't feel as efficient as they should. The answer usually isn't "get better at prompting." The answer is usually "use the right tool."

The practical takeaway: Before next week, define your three layers. Write them down. Not an aspirational list โ€” a realistic one based on what you actually do. What tasks do you use AI for most? Which tool fits those tasks best? Which one do you reach for when facts matter? Which one would improve one specific thing you do regularly? That's your stack.

The Stack Decision Heuristic

Ask yourself three questions before opening any AI tool: (1) Does this task require current facts? If yes, go to your research layer. (2) Does this task require deep reasoning or critique? If yes, use Claude. (3) Everything else? Daily driver. This takes about three seconds and will meaningfully improve your output quality over weeks of consistent application.

Lesson 4 Quiz

5 questions โ€” tool selection in realistic scenarios
1. Leo's federal rule research problem was solved by Gemini because:
Correct. Gemini's real-time web access let it retrieve the actual, current document. ChatGPT predicted based on old training data; Claude honestly admitted uncertainty. Real-time access is the differentiator for current-fact tasks.
The key is real-time access, not just training data size or specialization. Gemini can go find current documents on the web. The other tools (without browsing enabled) can only predict from what they were trained on.
2. You're writing a research paper and need to cite specific claims with traceable sources. Which tool should be your research layer?
Right. Perplexity's source-linking for every claim is the specific feature that makes it the research layer for traceable citations. Claude's honesty is valuable, but it doesn't show you sources โ€” it just tells you when it's uncertain.
The specific need here is traceable sources โ€” claims you can verify. NotebookLM is great for document-based tasks but requires you to supply the documents. Perplexity links everything to retrievable web sources automatically.
3. A student uses the same AI tool for everything: drafting essays, checking current statistics, debugging code, and analyzing arguments. What is the most accurate assessment of this approach?
Correct. The consistent theme of this module is that tool strengths are real, design-level, and predictable. Using one tool for everything isn't catastrophically wrong โ€” but it accepts predictable underperformance on the tasks other tools are specifically better at.
The design differences between tools are real and consistent. A single-tool approach isn't neutral โ€” it means accepting predictable underperformance on tasks where another tool's design is a better fit.
4. NotebookLM is specifically designed to reduce hallucination by:
Correct. NotebookLM's anti-hallucination advantage comes from grounding โ€” it answers from what you gave it, not from general training data. That's a fundamentally different architecture than asking a general model to "remember" a document.
NotebookLM's strength is grounding responses in user-provided documents. It doesn't use the web or general training data for its answers โ€” which means it can't make things up about content that isn't in your documents.
5. The "three-layer personal AI stack" framework suggests that most people's Layer 1 (daily driver) should be:
Correct. The framework prioritizes decision clarity over comprehensiveness. A daily driver you know deeply outperforms three tools you use superficially. Build fluency first, then add layers for specific needs.
The framework is about building real fluency, not just having access to multiple tools. Rotating daily or picking the "most featured" tool misses the point: consistency builds the skill of knowing how to get good output from a specific tool.

Lab 4: Build Your Personal AI Stack

Not theoretical. You're going to make real decisions about your real workflow โ€” and defend them.

Your Role: Workflow Architect

Tell the lab AI about your actual life: what you're studying or working on, what tasks you use AI for most, and what you're currently using. Then together, work through defining your three-layer stack โ€” daily driver, research layer, specialist. The lab AI will ask you to justify each choice based on the tool's actual design strengths, not just habit or familiarity.

If you default to "ChatGPT for everything," be ready to defend that or revise it. The lab AI will push back on unjustified choices and validate well-reasoned ones. By the end, you should have a written three-layer stack you could actually show someone and explain.

Start by describing your situation: what you're currently studying or doing, what tasks you use AI for most frequently, and which tools you've been using. Then we'll build your stack together.
Lab Assistant
Stack Builder
Alright โ€” let's make this real. Tell me: what are you actually studying or working on right now, what kinds of tasks do you reach for AI to help with most often, and what have you been using? Don't overthink it. Just describe your actual situation and we'll figure out your stack from there.

Module 6 Test

15 questions โ€” score 80% or higher to pass ยท Prompting Across Tools
1. All major AI tools (ChatGPT, Claude, Gemini) share the same base transformer architecture. Why do they produce different outputs from the same prompt?
Correct. The architecture is similar; the personality comes from training data, fine-tuning, and RLHF โ€” all of which differ significantly across providers.
Architectural similarity is the starting point โ€” the differences emerge from training data, fine-tuning, and RLHF optimization.
2. What does "hallucination" mean in the context of language models?
Correct. Hallucination is confident factual incorrectness โ€” a structural consequence of predicting plausible tokens, not a malfunction.
Hallucination is confident wrong output โ€” a consequence of predicting fluent-seeming text, not evidence of confusion or refusal.
3. GPT-4o gives equally confident responses whether it's correct or hallucinating. The primary reason for this is:
Exactly. RLHF shaped the model toward confident, fluent output because that's what human raters preferred โ€” regardless of accuracy.
RLHF is the mechanism: human raters preferred confident answers, so the training rewarded them. Confidence is a style artifact, not a reliability signal.
4. You need to write five different variations of a landing page headline for a portfolio project. Which tool is best suited by design?
Correct. Rapid creative variation with specific format constraints is a core ChatGPT strength โ€” it's built to produce multiple plausible outputs quickly.
Rapid creative variation is ChatGPT's domain. Claude tends to give you one considered version; Gemini isn't optimized for creative output; Perplexity is a research tool.
5. Claude's "Constitutional AI" training approach produces a model that is more likely than ChatGPT to:
Correct. Constitutional AI trains self-evaluation, which manifests as qualifications, pushback, and reasoning-heavy responses.
Constitutional AI is a self-evaluation framework โ€” it makes Claude assess its outputs, leading to more qualifications and reasoning, not faster or shorter responses.
6. Which prompt structure is best suited to getting high-quality output from Claude for an analytical task?
Right. Context + task + standard + permission to critique + output format โ€” the full Claude formula โ€” extracts the most useful analytical output.
Claude rewards prompts that give context, invite reasoning, and explicitly permit critique. Brief prompts underuse its analytical capability.
7. When is Claude NOT the right tool for a task?
Correct. Claude's tendency to evaluate and qualify is an asset for analytical tasks but slows you down when you just want pure execution. ChatGPT's compliance is better for "just do it" creative tasks.
Claude's evaluative nature is a liability when you want pure compliance, not analysis. If the task is "write this, don't question it," Claude's training works against you.
8. Gemini's primary advantage over ChatGPT and Claude for research tasks is:
Correct. Gemini's differentiator for research is real-time web access and Google integration โ€” not parameter size or reasoning depth.
Gemini's research advantage comes from real-time web access, not raw capability. It can retrieve current information; the others (without browsing) cannot.
9. Perplexity AI's primary use case advantage is:
Correct. Source-linked claims are Perplexity's signature feature โ€” it's designed specifically so users can verify everything it says.
Perplexity's value proposition is source transparency โ€” every claim links to a real document. That's the specific advantage over general-purpose AI for research tasks.
10. NotebookLM reduces hallucination primarily by:
Correct. Document-grounding is NotebookLM's core design โ€” it answers from what you gave it, removing the general training data that enables hallucination.
NotebookLM's anti-hallucination advantage is grounding: it only uses the documents you provide, not its general training data.
11. The "constraint stack" technique for ChatGPT prompts works because:
Correct. Constraints narrow the prediction space โ€” less latitude means less opportunity for the model to fill gaps with plausible but wrong content.
It's about narrowing the prediction target. Vague prompts give the model a wide space to fill; constraints direct the generation toward something useful.
12. A student uses ChatGPT to generate statistics for a paper on housing affordability. Three of the statistics turn out to be fabricated. Which understanding best explains what happened?
Correct. Specific numbers follow predictable patterns โ€” percentages, decimal points, source names. ChatGPT generates those patterns confidently because they're plausible, not because they're real.
This is the hallucination mechanism: numbers look like numbers, and the model generates what looks like a real statistic, with no way to verify the underlying data exists.
13. The three-layer personal AI stack described in Lesson 4 consists of:
Correct. Daily driver, research layer, specialist โ€” three layers based on task type, not tool tier or cost.
The three layers are organized by task function: general tasks, current-fact tasks, and one specific recurring task. Not by cost or rotation.
14. You're preparing for a job interview at a product company. You want honest evaluation of your answers to common interview questions โ€” not just validation. Which tool should you use?
Right. When you specifically want honest critique over validation, Claude's Constitutional AI orientation is the structural advantage. It's trained to evaluate honestly, not to make you feel good.
The key requirement is honest evaluation. Claude is specifically designed to evaluate rather than validate. ChatGPT's RLHF training makes it more likely to affirm than critique.
15. Which of the following best describes the overall principle of this module?
Correct. That's the module in one sentence: tool-task matching is a skill, the strengths are real and consistent, and the benefit of developing this skill is output quality that a single-tool approach can't reliably match.
The module's core argument is that tool-task matching is a learnable skill with real benefits. Not that any single tool is best, and not that more tools automatically means better output.