GPT vs. Claude vs. Gemini · Module 7 · Lesson 1

The Decision Framework

Before you open a chat window, your choice of model shapes every result that follows.

When Morgan Stanley deployed an internal AI assistant for its 16,000 financial advisors in November 2023, the team didn't simply pick the most popular model. They evaluated document grounding, citation accuracy, and tone consistency across hundreds of proprietary research documents. The answer was GPT-4 with a custom retrieval layer — not because GPT-4 was universally superior, but because it best matched the specific constraints of that deployment.

That choice process — matching task properties to model strengths — is the core skill of this module.

Why Model Selection Matters

GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are all capable of drafting emails, answering questions, and writing code. At the surface level they seem interchangeable. But capability overlap disguises real differences in default behavior — differences that compound over thousands of prompts in a production environment.

The wrong default model choice creates friction: outputs that require heavy editing, safety refusals that interrupt workflows, context windows that run short on long documents, or latency mismatches on real-time applications. Each issue costs time and erodes trust in the tooling.

Model selection is not a one-time decision made at project kickoff. It is an ongoing judgment call tied to task type, audience, budget, latency requirements, and safety posture. Skilled practitioners build a mental map of where each model excels and where it struggles — and they update that map as models evolve.

The Five-Axis Decision Framework

Every model selection decision can be decomposed into five axes. Rate your task on each before committing to a model.

Task Type — Is this primarily language reasoning, code generation, multimodal analysis, or real-time information retrieval? Each axis has a different model leader.

Context Length — How much input does the task require? Gemini 1.5 Pro's 1M-token window is not a marketing figure; it was publicly demonstrated in April 2024 processing the entire Apollo 11 mission transcript (~240,000 tokens) in a single call.

Safety & Tone Posture — Is this a consumer-facing product, an internal enterprise tool, or a developer API? Claude's Constitutional AI training produces notably cautious defaults; GPT-4o's moderation layer is tunable via system prompt.

Cost & Latency — What is the acceptable per-call budget and response time ceiling? GPT-4o mini launched in July 2024 at roughly 15× cheaper than GPT-4o for input tokens, targeting high-volume classification and summarization tasks.

Ecosystem Integration — Does the task live inside Google Workspace, Microsoft 365, or a standalone API? Gemini's native Workspace integration and GPT-4o's presence inside Microsoft Copilot create friction-free paths that matter in enterprise settings.

A Reference Overview

The table below summarizes the dominant positioning of each model family as of mid-2024. These are tendencies, not hard limits.

Dimension	GPT-4o	Claude 3.5	Gemini 1.5
Reasoning depth	High	High	High
Context window	128K tokens	200K tokens	1M tokens
Default tone	Neutral / direct	Careful / thorough	Informative / broad
Multimodal input	Text, image, audio, video (via API)	Text, image	Text, image, audio, video, PDF
Code generation	Excellent	Excellent	Very good
Cost tier (mid-2024)	$5/$15 per 1M tokens in/out	$3/$15 per 1M tokens in/out	$3.50/$10.50 per 1M tokens in/out
Live web access	Via tools	No (base)	Via Gemini app / extensions

KEY INSIGHT

There is no universally best model. There is only the best model for a given task, at a given cost, for a given audience, at a given moment. The frameworks in this module give you a repeatable method to find that model quickly — and to defend the choice when stakeholders ask why.

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.

1. Which of the five decision-framework axes was central to Morgan Stanley's model selection for their financial advisor assistant?

✓ Correct. Morgan Stanley's team specifically evaluated document grounding, citation accuracy, and tone consistency across proprietary research documents — task-type and safety-posture axes driving the GPT-4 choice.

✗ Not quite. The lesson details that Morgan Stanley evaluated document grounding, citation accuracy, and tone consistency — matching task-type and safety requirements, not cost or video capability.

2. As publicly demonstrated in April 2024, approximately how large a context did Gemini 1.5 Pro process in a single call using the Apollo 11 transcript?

✓ Correct. The Apollo 11 mission transcript demo (~240,000 tokens) was used to validate Gemini 1.5 Pro's 1M-token context window in a real, publicly reported test.

✗ The lesson states the Apollo 11 transcript demonstration processed approximately 240,000 tokens — a real documented demonstration of the 1M-token context window capability.

3. What was the primary design goal of GPT-4o mini when it launched in July 2024?

✓ Correct. GPT-4o mini launched at ~15× cheaper than GPT-4o for input tokens, targeting high-volume classification and summarization use cases where cost per call matters most.

✗ GPT-4o mini's design goal was cost reduction — roughly 15× cheaper than GPT-4o — targeting high-volume summarization and classification tasks.

Lab 1 — Apply the Decision Framework

Describe a real task. The assistant helps you score it across the five axes and recommends a model.

Scenario-Based Model Selection Practice

In this lab you'll describe a task or workflow you actually need AI for. The assistant will walk you through the five-axis decision framework — task type, context length, safety posture, cost/latency, and ecosystem integration — and reason toward a model recommendation with justification.

Push back. Ask why. Try a different scenario to compare outputs. Complete at least 3 exchanges to finish the lab.

Try asking: "I need to summarize 50-page legal contracts daily for a team of 12 paralegals. Which model should I use and why?"

Model Selection Advisor AI LAB

GPT vs. Claude vs. Gemini · Module 7 · Lesson 2

GPT-4o: When to Reach for OpenAI

The model that rewired the developer ecosystem — and the tasks where it still leads.

When Khan Academy built Khanmigo — its AI tutor for millions of students — the team selected GPT-4 as the underlying model. The reasoning was documented in Sal Khan's March 2023 TED Talk: stepwise reasoning quality, the ability to walk through a math problem without jumping to the answer, and the maturity of OpenAI's content-moderation layer for a child-safe environment.

By 2024 Khanmigo had served tens of millions of tutoring sessions. The choice held not because GPT-4 was cheapest or fastest, but because its pedagogical reasoning pattern matched the product's core requirement.

GPT-4o's Structural Strengths

GPT-4o ("o" for omni) launched in May 2024, combining text, image, audio, and video processing in a single model endpoint. Three structural strengths make it the default choice in specific contexts:

1. Instruction-following precision. On OpenAI's internal evals and on third-party benchmarks like MT-Bench, GPT-4o consistently scores highest on multi-step instruction adherence. When a prompt says "return only JSON with no commentary," GPT-4o is least likely to add a preamble. This matters enormously in agentic pipelines where output is parsed programmatically.

2. Tool-use and function-calling maturity. OpenAI introduced structured function calling in June 2023. By mid-2024 this API was deeply integrated into frameworks like LangChain, LlamaIndex, and AutoGen. The ecosystem assumes GPT-4o defaults, which means less glue code and more battle-tested examples when you're building tool-augmented agents.

3. The Assistants API and persistent threads. GPT-4o is the backbone of OpenAI's Assistants API, which manages conversation state, file uploads, code interpreter, and retrieval augmented generation (RAG) in one managed surface. For teams that don't want to build their own memory and retrieval layer from scratch, this matters significantly.

Where GPT-4o Leads — Specific Use Cases

These are task categories where GPT-4o has documented production advantages over alternatives as of 2024:

Use Case

Agentic Tool Pipelines

Multi-step tool-calling chains where structured JSON outputs must be parsed reliably by downstream code. GPT-4o's function-calling API is the most mature and best-documented option.

Use Case

Code Generation & Debugging

GitHub Copilot's early fine-tuning on GPT models and the wide adoption of GPT-4o in coding assistants reflect its strong performance on HumanEval and SWE-bench coding benchmarks.

Use Case

Customer-Facing Chatbots (Microsoft Stack)

Azure OpenAI Service and Microsoft Copilot embed GPT-4o natively. Teams already in Microsoft 365 face the lowest integration friction using GPT-4o via Azure rather than switching providers.

Use Case

Real-Time Voice Applications

GPT-4o's native audio model (launched May 2024) supports sub-300ms latency voice interaction — demonstrated live during the May 2024 OpenAI keynote with real-time emotion detection.

GPT-4o's Limitations

Matching a model also means knowing where it underperforms. GPT-4o's documented weaknesses in production settings include:

Context window constraints on very long documents. At 128K tokens, GPT-4o handles most use cases but loses to Gemini 1.5 Pro on tasks requiring full-document analysis of books, legal archives, or codebases exceeding 100K tokens.

Verbosity in long-form creative writing. User comparisons on platforms like LMSYS Chatbot Arena (which has collected over 1 million head-to-head evaluations) consistently show Claude 3.5 Sonnet rated higher on prose quality and nuanced creative writing tasks.

Safety-layer friction in sensitive domains. GPT-4o's default content moderation can refuse edge-case security research, certain medical information, or mature creative scenarios — requiring careful system prompt engineering to unlock legitimate professional use cases.

PRACTITIONER NOTE

If your project involves tool-augmented agents, lives inside the Microsoft ecosystem, requires real-time voice, or needs the most battle-tested function-calling API — GPT-4o is the default-safe choice. When those conditions don't apply, the default shifts.

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.

1. What specific capability did Sal Khan identify as the key reason for choosing GPT-4 for Khanmigo in his March 2023 TED Talk?

✓ Correct. Khan specifically cited stepwise reasoning quality — the pedagogically essential pattern of showing work step-by-step rather than jumping to a final answer — as the key differentiator.

✗ The lesson cites Sal Khan's TED Talk where he specifically identified stepwise reasoning quality — not cost, audio, or context length — as the decisive factor for a tutoring application.

2. OpenAI introduced structured function calling — a feature central to GPT-4o's agentic pipeline advantage — in which month and year?

✓ Correct. OpenAI introduced structured function calling in June 2023, giving GPT-4 a head start in the agentic pipeline ecosystem that GPT-4o inherited and extended.

✗ The lesson states OpenAI introduced structured function calling in June 2023 — a date that gave the GPT ecosystem a significant head start in agentic pipeline tooling.

3. Based on LMSYS Chatbot Arena head-to-head evaluations, which model is consistently rated higher than GPT-4o on prose quality and nuanced creative writing?

✓ Correct. The lesson cites LMSYS Chatbot Arena data (over 1 million evaluations) showing Claude 3.5 Sonnet rated higher than GPT-4o on prose quality and nuanced creative writing — a documented limitation of GPT-4o.

✗ LMSYS Chatbot Arena data — over 1 million head-to-head evaluations — consistently shows Claude 3.5 Sonnet rated higher than GPT-4o on prose quality and creative writing tasks.

Lab 2 — GPT-4o Use Case Audit

Test whether your workflow is a genuine GPT-4o fit or a case of brand-name defaulting.

Is GPT-4o Actually the Right Choice?

Many teams default to GPT-4o because it's what they know. In this lab, describe a specific project or task you're considering for GPT-4o. The assistant will probe whether the GPT-4o-specific strengths — function calling maturity, Assistants API, real-time audio, Microsoft ecosystem fit — genuinely apply to your case, or whether a different model might serve better.

Be specific about your task. The more detail you give, the more targeted the analysis. Complete at least 3 exchanges to finish the lab.

Try asking: "We're building a chatbot for customer support inside Microsoft Teams — should we use GPT-4o via Azure or would Gemini work just as well?"

GPT-4o Fit Analyzer AI LAB

GPT vs. Claude vs. Gemini · Module 7 · Lesson 3

Claude & Gemini: Matching Strengths to Tasks

The cases where Claude's careful reasoning or Gemini's long context decisively outperforms the default choice.

Quora's Poe platform — which hosts multiple AI models in one interface — gives users direct A/B comparison data across Claude, GPT-4, and Gemini on identical prompts. By early 2024, Poe's internal usage patterns showed Claude 3 Opus leading on long-form writing and document analysis requests, while GPT-4 dominated code-assistance sessions. Meanwhile, Notion AI, which integrated Claude into its notes and documents product in 2023, cited Claude's ability to maintain consistent document voice across edits as the deciding factor.

These are not preferences — they reflect measurable differences in output patterns at scale.

Claude 3.5: Where Careful Reasoning Wins

Anthropic's Claude 3.5 Sonnet (released June 2024) is trained with Constitutional AI — a process where the model critiques its own outputs against a set of principles before finalizing responses. This produces three observable behaviors that make Claude the right choice in specific contexts:

Long-form writing consistency. Claude tends to maintain stylistic coherence across 5,000–20,000 word outputs better than GPT-4o. This was a documented reason Notion AI chose Claude: when editing a 10-page strategy document, Claude preserves voice and structure across multiple revision passes. GPT-4o, by contrast, sometimes introduces tonal shifts between sections.

Nuanced instruction adherence on sensitive topics. Claude's training includes explicit calibration on harm avoidance that is less binary than GPT-4o's moderation layer. In practice, Claude can engage with medical edge cases, legal hypotheticals, and security research questions that GPT-4o refuses — while still declining genuinely harmful requests. This makes Claude a better fit for professional domain tools in medicine, law, and security.

200K-token context with strong retrieval accuracy. Claude 3.5 Sonnet's 200K-token window is paired with strong "needle in a haystack" performance — accurately locating a specific sentence buried deep in a long document. In Anthropic's April 2024 evaluations, Claude 3 Opus scored above 99% on needle-in-a-haystack benchmarks across the full 200K window.

Gemini 1.5: Where Scale and Integration Win

Google's Gemini 1.5 Pro arrived in February 2024 with two structural advantages that are genuinely category-defining — not marketing claims:

The 1-million-token context window. This isn't just "bigger" — it's a qualitative shift. In May 2024, Google demonstrated Gemini 1.5 Pro analyzing the entirety of a 44-minute Buster Keaton film, answering specific scene-level questions from the raw video input. For legal firms processing entire case archives, medical researchers analyzing clinical trial corpora, or developers doing codebase-wide analysis, this window eliminates chunking, summarization, and retrieval steps that introduce errors in shorter-context pipelines.

Native Google Workspace integration. Gemini is embedded into Gmail, Docs, Sheets, and Meet via Google Workspace Labs (now generally available). For organizations on Google Workspace, this means AI that reads your existing Drive files, summarizes your last 30 Gmail threads, and drafts Docs with direct access to your data — without a single API call or custom integration.

Multimodal breadth. Gemini 1.5 Pro accepts interleaved text, image, audio, video, and PDF in a single prompt. This multimodal flexibility is broader than GPT-4o's current API surface and makes Gemini the natural choice for tasks like analyzing a recorded sales call (audio) alongside its CRM notes (text) and the customer's proposal PDF (document).

Side-by-Side Decision Logic

The following table maps task characteristics to the model most likely to produce superior results based on documented capabilities and production usage patterns as of mid-2024.

Task Characteristic	Best Default	Why
Document >200K tokens	Gemini 1.5 Pro	Only model with sufficient native context
Document 50K–200K tokens	Claude 3.5	Strong needle-in-haystack accuracy, lower cost than Gemini 1.5 Pro at this range
Long-form prose editing	Claude 3.5	Voice consistency across long outputs; LMSYS Arena preference data
Google Workspace task	Gemini 1.5	Native Drive/Gmail/Docs integration; zero integration friction
Medical / legal professional tool	Claude 3.5	Less binary safety refusals; Constitutional AI calibration
Multi-modal: audio + video + doc	Gemini 1.5	Broadest native multimodal input support
Tool-calling agent pipeline	GPT-4o	Most mature function-calling API and ecosystem

Lesson 3 Quiz

Test your understanding of Lesson 3

What is the central theme of Lesson 3 in this module?

Correct.

Review Lesson 3 for the core concepts.

Why is practical application important alongside theoretical understanding?

Correct. Practice reveals complexities beyond theoretical models.

Theory and practice complement each other — practice reveals real-world constraints.

What distinguishes effective practitioners in this field?

Correct.

Critical thinking matters more than tools or experience alone.

🎯 Advanced · Lesson 3 Lab

Lab: Explore Lesson 3 Concepts

Apply what you learned in Lesson 3 through guided AI conversation

Your Task

Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.

Try asking about a specific concept from Lesson 3 and how it applies in practice.

🤖 AESOP Lab Assistant Lesson 3 Lab

GPT vs. Claude vs. Gemini · Choosing the Right Model · Lesson 4

Real-World Deployments: Case Studies in Model Selection

Theory becomes judgment when the task is real, the deadline is tomorrow, and the wrong choice ships to thousands of users.

In Lessons 1–3 you built a mental map: a five-axis framework, GPT-4o's ecosystem depth, and the distinct strengths of Claude and Gemini. In practice, model selection happens fast — a team lead asks which model to wire up, and you have five minutes to answer with confidence. This lesson closes the loop with four concrete scenarios: what was chosen, why, and what tradeoffs were accepted.

Case Study 1 — Legal Document Review at Scale

Scenario: A mid-size law firm needs to process full merger-and-acquisition due diligence packages — typically 200–400 pages of contracts, disclosure schedules, and regulatory filings — and produce structured summaries before each partner review meeting.

Model selected: Claude 3.5 Sonnet (200K context window).

The decisive factor was context length. A 300-page document, once converted to text, runs to roughly 150,000–180,000 tokens. GPT-4o's 128K window would require chunking — introducing seam errors where related clauses on page 12 and page 280 never appear in the same context. Claude's 200K window ingests the entire document in one call, allowing the model to catch cross-references, notice contradictions between schedules, and produce a coherent whole-document summary.

Constitutional AI training was a secondary benefit: the firm's ethics committee was comfortable with Claude's cautious defaults on sensitive client data. Cost was a non-issue at this task volume (a few hundred documents per month).

KEY TRADEOFF

Claude's 200K context solved the chunking problem but added per-call cost versus using Claude Haiku on chunked segments. For documents consistently under 80K tokens, a chunk-and-merge strategy with Haiku would be cheaper — but for full due diligence packages, whole-document coherence justified the Sonnet price point.

Case Study 2 — Customer Service Chatbot for a SaaS Platform

Scenario: A B2B SaaS company wants to deploy an AI-powered support bot that handles billing questions, feature explanations, and account troubleshooting. It must integrate with Zendesk, pull from a Confluence knowledge base, and escalate to human agents via webhook.

Model selected: GPT-4o via the OpenAI API.

The ecosystem argument was decisive here. OpenAI's function-calling specification is natively supported by Zendesk's AI partner integrations, and the Confluence connector had already been built by the infrastructure team using an existing OpenAI plugin. Switching to Claude would have meant rebuilding those connectors. GPT-4o's broad tool-use support also made it straightforward to wire up the escalation webhook.

Consumer familiarity mattered too — the support team was already using ChatGPT internally, so GPT-4o's behavioral defaults felt predictable to the team writing system prompts. Response latency on GPT-4o mini (used for intent classification) was fast enough that users experienced no perceptible delay before the model routed to the right knowledge-base section.

KEY TRADEOFF

The team actually ran GPT-4o mini for intent classification (cheap, fast) and GPT-4o for final response generation (higher quality). This two-tier model routing pattern — a small model decides, a large model generates — is increasingly common in production pipelines and cuts cost by 60–80% on high-volume deployments.

Case Study 3 — Scientific Research Assistant at a University Lab

Scenario: A computational biology lab wants an assistant that can answer questions about recent preprints, synthesize findings across 2024 literature, and suggest experimental protocols — all while staying current with work published in the last three months.

Model selected: Gemini 1.5 Pro with search grounding enabled.

Recency was the critical constraint. Claude and standard GPT-4o have training cutoffs and no live web access by default; their knowledge of a preprint posted last Tuesday is zero. Gemini's search grounding capability — routing queries through Google Search before generating a response — meant the assistant could accurately surface findings from arXiv, bioRxiv, and PubMed published days earlier.

The lab also ran experiments through Google Colab and stored datasets in Google Drive. Gemini's Workspace integration let the assistant access Drive folders directly when analyzing existing datasets, eliminating a manual copy-paste step that had been friction in earlier workflows.

KEY TRADEOFF

Search grounding added latency — each query triggered a web search before generation, adding 1–3 seconds per response. For a research assistant used asynchronously (not real-time), this was acceptable. For a latency-sensitive application, that same feature would be a deal-breaker.

Case Study 4 — High-Volume Content Classification Pipeline

Scenario: A media monitoring company needs to classify 500,000 social media posts per day into 12 content categories (news, opinion, satire, misinformation, etc.) with a cost budget of under $50/day and a latency requirement of under 500ms per call.

Model selected: Gemini 1.5 Flash (primary) / Claude Haiku (fallback).

At this volume, cost arithmetic dominates every other consideration. Gemini 1.5 Flash was priced at approximately $0.075 per 1M input tokens at mid-2024 rates. Processing 500,000 posts averaging ~100 tokens each is 50M tokens/day — roughly $3.75/day, well within budget. GPT-4o at $5/1M tokens would cost ~$250/day for the same load. Even GPT-4o mini ($0.15/1M input) would cost $7.50/day — more expensive than Flash with no latency advantage.

Flash's median response time of ~300ms cleared the 500ms SLA. Claude Haiku was configured as a fallback for posts where Flash returned low-confidence classifications, accepting slightly higher cost on the ~5% of ambiguous cases in exchange for better accuracy on edge cases.

KEY TRADEOFF

Flash's speed-and-cost advantages come with a quality ceiling. On nuanced satire detection or multi-label classification, Flash underperformed Sonnet-class models in internal evaluations. The team accepted lower accuracy on hard cases in exchange for the economics working at scale — a deliberate engineering tradeoff, not an oversight.

Cross-Case Patterns

Looking across all four cases, three patterns emerge consistently:

One axis usually dominates. Context length decided Case 1. Ecosystem integration decided Case 2. Recency decided Case 3. Cost decided Case 4. The five-axis framework helps you find that dominant axis quickly rather than treating every axis as equal weight.

Model routing is a first-class pattern. Two of the four cases used multiple models in the same pipeline — a cheap model for classification or routing, a capable model for generation. This is not a workaround; it is standard production architecture.

Tradeoffs are accepted, not solved. Each choice gave something up: chunking coherence for cost (Case 1 alternative), rebuilding connectors for safety tuning (Case 2 alternative), latency for recency (Case 3), accuracy ceiling for economics (Case 4). The job is to choose which tradeoffs your task can absorb — not to find a model that has no tradeoffs.

Lesson 4 Quiz

3 questions — free, untracked, retake anytime.

1. A legal team processes 300-page merger agreements. After one call, they find the model's summary misses a cross-reference between page 14 and page 290. What is the most likely root cause?

✓ Correct. Context-window chunking is the classic cause of cross-reference failures in long-document tasks. When a 300-page document (~150K tokens) exceeds a model's window and must be split, clauses on distant pages never share the same context — making cross-reference detection impossible without a whole-document model like Claude 200K.

✗ Context-window chunking is the issue. When a document is split across calls, content on page 14 and page 290 never appears in the same context, so the model cannot detect the relationship between them. This is the core reason Claude's 200K window matters for full due-diligence review.

2. A product team runs 500,000 classification calls per day on a $50/day budget, latency must be under 500ms. Which model tier is the correct match?

✓ Correct. At 500K calls × ~100 tokens = 50M tokens/day, Flash's ~$0.075/1M pricing costs ~$3.75/day — well inside the $50 budget. GPT-4o at $5/1M would cost ~$250/day; even GPT-4o mini at $0.15/1M costs $7.50/day. Flash and Haiku are the purpose-built high-volume tier; latency of ~300ms clears the 500ms SLA.

✗ Cost arithmetic rules this decision. At 50M tokens/day, GPT-4o costs ~$250/day (10× over budget) and GPT-4o mini costs ~$7.50/day. Gemini Flash at ~$0.075/1M comes to ~$3.75/day and delivers ~300ms latency — both constraints met. This is the Flash/Haiku use case.

3. Which of the following best describes the "model routing" pattern seen in Case Studies 2 and 4?

✓ Correct. Model routing uses a fast, cheap model (GPT-4o mini, Flash, or Haiku) to classify the request and determine what's needed, then passes only the relevant subset of work to a more capable (and more expensive) model for generation. This cuts cost by 60–80% on high-volume pipelines while preserving quality on outputs that matter.

✗ Model routing means a small model handles intent classification (cheap, fast) and a larger model handles final generation (higher quality). The CS platform in Case 2 routed to GPT-4o mini for intent, then GPT-4o for responses. Case 4 used Flash for classification with Haiku as fallback on ambiguous posts. It's a deliberate pipeline architecture, not a UI toggle.

Lab 4: Synthesis and Integration

Apply and extend the concepts from this lesson through guided conversation with an AI assistant.

Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.

Lab 4 Assistant AI Assistant

Module Test

15 questions covering all lessons — free, untracked, retake anytime.

Score: 0/15

1. A team needs to analyze creative brand voice across 500 ad campaigns to identify what emotional tone drives conversions. Which decision-framework axis is most decisive here?

✓ Correct. Task type is the primary axis when the work is fundamentally about reasoning quality — here, nuanced interpretation of creative tone. Ecosystem and latency are secondary once you've established what kind of reasoning the task demands.

✗ Task type is the lead axis. Creative/analytical reasoning quality determines which model's defaults suit the work. Latency, ecosystem, and context window are secondary constraints for this scenario.

2. A consumer health app must ensure its AI never suggests a specific medication dosage. Which model property most directly addresses this safety requirement?

✓ Correct. Constitutional AI is Anthropic's training methodology that embeds safety principles directly into Claude's behavior. It produces consistently cautious outputs on medical, legal, and other high-stakes domains — making it the strongest default fit for compliance-sensitive consumer health applications.

✗ Constitutional AI (CAI) is the relevant property. Claude's CAI training produces cautious, harm-aware defaults on safety-sensitive content — the strongest out-of-the-box posture for a health app that cannot risk specific dosage recommendations.

3. At mid-2024 pricing, which ordering correctly ranks these models from cheapest to most expensive per 1M input tokens?

✓ Correct. Flash (~$0.075/1M) and Haiku are the cheapest tier, GPT-4o mini (~$0.15/1M) sits above them, Claude 3.5 Sonnet (~$3/1M) in the mid tier, and GPT-4o (~$5/1M) at the high end. Knowing this ordering is essential for cost-axis decisions.

✗ The correct order is Flash/Haiku at the bottom, GPT-4o mini above them, mid-tier Sonnet-class models, then GPT-4o at the top. This ordering directly drives high-volume deployment decisions where cost arithmetic dominates.

4. A real-time customer-facing chatbot must respond within 400ms at the 95th percentile. Which model tier is most appropriate?

✓ Correct. Flash and Haiku are designed for latency-sensitive workloads. Their ~300ms median latency comfortably clears a 400ms SLA. Opus and full GPT-4o are significantly slower and would frequently miss the 95th-percentile threshold.

✗ Latency-sensitive applications require the Flash/Haiku tier. These models are optimized for speed (~300ms median) at the cost of some reasoning depth. Opus and standard GPT-4o are much slower and would breach the 400ms SLA at scale.

5. Which two models can natively handle book-length inputs (e.g., a full novel of ~120,000 tokens) in a single API call without chunking?

✓ Correct. At mid-2024, Claude's 200K and Gemini 1.5 Pro's 1M context windows are the two options that handle book-length inputs without chunking. GPT-4o's 128K window falls short of a full novel (~120K tokens is close but margin is thin; longer books exceed it).

✗ Claude (200K) and Gemini 1.5 Pro (1M) are the two models designed for extremely long inputs. GPT-4o's 128K window is insufficient for full-length novels, and GPT-4o mini has the same 128K limit.

6. GPT-4o is described as "natively multimodal." What does that mean in practice?

✓ Correct. "Natively multimodal" means GPT-4o processes audio, images, and text within a single unified model — no separate speech-to-text or image-recognition pipeline feeding into a text model. This reduces latency and error surface compared to stitched-together pipelines.

✗ Native multimodality means the audio, image, and text modalities are handled by one unified model. Earlier GPT systems used separate models (Whisper for audio, CLIP for images) piped together. GPT-4o's unified architecture reduces latency and avoids transcription errors.

7. A developer wants to build a workflow that generates Python code, executes it against sample data, and iterates based on output. Which GPT-4o feature most directly enables this?

✓ Correct. Code Interpreter runs code in a sandboxed Python environment and returns results (including errors, plots, and data outputs) directly into the conversation context. This closed-loop generate → execute → iterate pattern is exactly what Code Interpreter was designed for.

✗ Code Interpreter (now called Advanced Data Analysis in ChatGPT) executes Python in a sandbox and feeds the output back into context — enabling the generate → run → iterate loop. DALL-E is for image generation; context window size affects memory, not execution.

8. Which of the following is a documented, distinctive strength of Claude relative to GPT-4o and Gemini in enterprise deployments?

✓ Correct. Claude's training emphasizes instruction-following accuracy — maintaining complex constraints, format requirements, and behavioral rules across very long responses. This makes it especially strong for tasks requiring precise adherence to detailed specifications, like contract drafting templates or structured research reports.

✗ Claude's distinctive strength is nuanced instruction following — reliably maintaining complex behavioral constraints across long outputs. Real-time search is Gemini's strength; Google Workspace integration is Gemini's strength; DALL-E is an OpenAI product.

9. An enterprise team uses Google Workspace daily — documents live in Drive, meetings are scheduled in Google Calendar, and email runs through Gmail. Which model has the strongest native integration argument?

✓ Correct. Gemini's Workspace extensions are native first-party integrations — not third-party connectors. Gemini in Gmail, Docs, and Drive can read and write directly to the user's Workspace without OAuth headaches or third-party middleware. This is a genuine ecosystem advantage for Google-centric organizations.

✗ Gemini has native first-party Workspace integrations — Gemini in Gmail, Docs, Drive, and Meet are Google-built products. OpenAI's Workspace connectors are third-party plugins; Claude has no direct Workspace integration. For Google-centric teams, Gemini's ecosystem advantage is real and significant.

10. What is "search grounding" as implemented in Gemini, and which scenario most benefits from it?

✓ Correct. Search grounding means Gemini can issue a Google Search query as part of answering your prompt, then incorporate the live results before generating its response. This gives it access to information published after its training cutoff — critical for scientific research, news analysis, or any domain where recency matters.

✗ Search grounding = live Google Search integrated into the generation process. The model queries the web before answering, bringing in post-training information. This is Gemini's key advantage for any task requiring current knowledge — recent research, news events, newly published documentation.

11. All three major model families (OpenAI, Anthropic, Google) now support function calling / tool use. What distinguishes OpenAI's position in this space?

✓ Correct. OpenAI introduced the plugin ecosystem in 2023 and the function-calling API shortly after, giving third-party developers a head start building connectors. By mid-2024, the breadth of pre-built integrations (Zapier, Zendesk, GitHub, etc.) for GPT-4o significantly exceeds what's available for Claude and Gemini.

✗ All three providers support function calling. OpenAI's advantage is ecosystem maturity — they were first, which means more third-party developers have built and published integrations. When a team needs to wire up an existing enterprise tool, the GPT-4o connector often already exists.

12. A compliance team must deploy an AI that will handle HIPAA-adjacent patient intake forms. The primary model-selection criterion should be:

✓ Correct. Safety requirements dominate in compliance-sensitive healthcare deployments. The risk of a single non-compliant output — a diagnostic suggestion, a medication recommendation — far outweighs cost or latency concerns. Claude's Constitutional AI training is purpose-built for this safety posture.

✗ Safety posture is the dominant axis here. In HIPAA-adjacent applications, a single non-compliant model output carries regulatory and legal risk. Cost and context window are secondary; the primary criterion is whether the model's defaults are cautious enough for healthcare content.

13. A startup processes 2 million short text inputs per day for content moderation. Their budget is $100/day. Which strategy is most cost-effective?

✓ Correct. At 2M inputs × ~50 tokens average = 100M tokens/day. Gemini Flash at ~$0.075/1M = $7.50/day; Claude Haiku at similar pricing. GPT-4o at $5/1M = $500/day (5× over budget). GPT-4o mini at $0.15/1M = $15/day — viable but more expensive than Flash. The Flash/Haiku tier is the right answer when cost is the binding constraint.

✗ Flash/Haiku is the correct tier. At 100M tokens/day, GPT-4o costs ~$500/day (far over budget), GPT-4o mini costs ~$15/day, and Flash/Haiku costs ~$7.50/day. The cheapest OpenAI option is not the cheapest overall — Gemini Flash and Claude Haiku are cheaper.

14. Gemini 1.5 Pro's 1M-token context window was publicly demonstrated in April 2024. What real-world document set was used in that demonstration?

✓ Correct. Google's April 2024 demonstration of Gemini 1.5 Pro's context window used the Apollo 11 mission transcript — approximately 240,000 tokens — processed in a single API call. This was cited in the module as evidence that the 1M-token limit is real, not just a marketing figure.

✗ The Apollo 11 mission transcript (~240,000 tokens) was the document used in Google's April 2024 public demonstration of Gemini 1.5 Pro's long-context capability. This real demonstration is cited in the module to establish that the 1M-token window is functional, not theoretical.

15. When applying the five-axis decision framework to a new deployment, which statement best reflects how the axes should be weighted?

✓ Correct. The module's core lesson is that one axis usually dominates: context length dominated the legal document case, ecosystem dominated the SaaS chatbot case, recency dominated the research assistant case, and cost dominated the classification pipeline. The skill is quickly identifying which axis is the binding constraint for your specific task, then using the others as secondary filters.

✗ The framework's practical value comes from identifying which single axis is the binding constraint for your task. Equal weighting leads to paralysis; always-cost or always-safety rules lead to wrong answers in contexts where another axis dominates. Case Study analysis consistently showed one axis driving the decision while others served as filters.

The Decision Framework

Lesson 1 Quiz

Lab 1 — Apply the Decision Framework

Scenario-Based Model Selection Practice

GPT-4o: When to Reach for OpenAI

Lesson 2 Quiz

Lab 2 — GPT-4o Use Case Audit

Is GPT-4o Actually the Right Choice?

Claude & Gemini: Matching Strengths to Tasks

Lesson 3 Quiz

Lab: Explore Lesson 3 Concepts

Your Task

Real-World Deployments: Case Studies in Model Selection

Lesson 4 Quiz

Lab 4: Synthesis and Integration

Module Test

Module Test Result