When Morgan Stanley deployed an internal AI assistant for its 16,000 financial advisors in November 2023, the team didn't simply pick the most popular model. They evaluated document grounding, citation accuracy, and tone consistency across hundreds of proprietary research documents. The answer was GPT-4 with a custom retrieval layer β not because GPT-4 was universally superior, but because it best matched the specific constraints of that deployment.
That choice process β matching task properties to model strengths β is the core skill of this module.
GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are all capable of drafting emails, answering questions, and writing code. At the surface level they seem interchangeable. But capability overlap disguises real differences in default behavior β differences that compound over thousands of prompts in a production environment.
The wrong default model choice creates friction: outputs that require heavy editing, safety refusals that interrupt workflows, context windows that run short on long documents, or latency mismatches on real-time applications. Each issue costs time and erodes trust in the tooling.
Model selection is not a one-time decision made at project kickoff. It is an ongoing judgment call tied to task type, audience, budget, latency requirements, and safety posture. Skilled practitioners build a mental map of where each model excels and where it struggles β and they update that map as models evolve.
Every model selection decision can be decomposed into five axes. Rate your task on each before committing to a model.
The table below summarizes the dominant positioning of each model family as of mid-2024. These are tendencies, not hard limits.
| Dimension | GPT-4o | Claude 3.5 | Gemini 1.5 |
|---|---|---|---|
| Reasoning depth | High | High | High |
| Context window | 128K tokens | 200K tokens | 1M tokens |
| Default tone | Neutral / direct | Careful / thorough | Informative / broad |
| Multimodal input | Text, image, audio, video (via API) | Text, image | Text, image, audio, video, PDF |
| Code generation | Excellent | Excellent | Very good |
| Cost tier (mid-2024) | $5/$15 per 1M tokens in/out | $3/$15 per 1M tokens in/out | $3.50/$10.50 per 1M tokens in/out |
| Live web access | Via tools | No (base) | Via Gemini app / extensions |
KEY INSIGHT
There is no universally best model. There is only the best model for a given task, at a given cost, for a given audience, at a given moment. The frameworks in this module give you a repeatable method to find that model quickly β and to defend the choice when stakeholders ask why.
In this lab you'll describe a task or workflow you actually need AI for. The assistant will walk you through the five-axis decision framework β task type, context length, safety posture, cost/latency, and ecosystem integration β and reason toward a model recommendation with justification.
Push back. Ask why. Try a different scenario to compare outputs. Complete at least 3 exchanges to finish the lab.
When Khan Academy built Khanmigo β its AI tutor for millions of students β the team selected GPT-4 as the underlying model. The reasoning was documented in Sal Khan's March 2023 TED Talk: stepwise reasoning quality, the ability to walk through a math problem without jumping to the answer, and the maturity of OpenAI's content-moderation layer for a child-safe environment.
By 2024 Khanmigo had served tens of millions of tutoring sessions. The choice held not because GPT-4 was cheapest or fastest, but because its pedagogical reasoning pattern matched the product's core requirement.
GPT-4o ("o" for omni) launched in May 2024, combining text, image, audio, and video processing in a single model endpoint. Three structural strengths make it the default choice in specific contexts:
1. Instruction-following precision. On OpenAI's internal evals and on third-party benchmarks like MT-Bench, GPT-4o consistently scores highest on multi-step instruction adherence. When a prompt says "return only JSON with no commentary," GPT-4o is least likely to add a preamble. This matters enormously in agentic pipelines where output is parsed programmatically.
2. Tool-use and function-calling maturity. OpenAI introduced structured function calling in June 2023. By mid-2024 this API was deeply integrated into frameworks like LangChain, LlamaIndex, and AutoGen. The ecosystem assumes GPT-4o defaults, which means less glue code and more battle-tested examples when you're building tool-augmented agents.
3. The Assistants API and persistent threads. GPT-4o is the backbone of OpenAI's Assistants API, which manages conversation state, file uploads, code interpreter, and retrieval augmented generation (RAG) in one managed surface. For teams that don't want to build their own memory and retrieval layer from scratch, this matters significantly.
These are task categories where GPT-4o has documented production advantages over alternatives as of 2024:
Matching a model also means knowing where it underperforms. GPT-4o's documented weaknesses in production settings include:
Context window constraints on very long documents. At 128K tokens, GPT-4o handles most use cases but loses to Gemini 1.5 Pro on tasks requiring full-document analysis of books, legal archives, or codebases exceeding 100K tokens.
Verbosity in long-form creative writing. User comparisons on platforms like LMSYS Chatbot Arena (which has collected over 1 million head-to-head evaluations) consistently show Claude 3.5 Sonnet rated higher on prose quality and nuanced creative writing tasks.
Safety-layer friction in sensitive domains. GPT-4o's default content moderation can refuse edge-case security research, certain medical information, or mature creative scenarios β requiring careful system prompt engineering to unlock legitimate professional use cases.
PRACTITIONER NOTE
If your project involves tool-augmented agents, lives inside the Microsoft ecosystem, requires real-time voice, or needs the most battle-tested function-calling API β GPT-4o is the default-safe choice. When those conditions don't apply, the default shifts.
Many teams default to GPT-4o because it's what they know. In this lab, describe a specific project or task you're considering for GPT-4o. The assistant will probe whether the GPT-4o-specific strengths β function calling maturity, Assistants API, real-time audio, Microsoft ecosystem fit β genuinely apply to your case, or whether a different model might serve better.
Be specific about your task. The more detail you give, the more targeted the analysis. Complete at least 3 exchanges to finish the lab.
Quora's Poe platform β which hosts multiple AI models in one interface β gives users direct A/B comparison data across Claude, GPT-4, and Gemini on identical prompts. By early 2024, Poe's internal usage patterns showed Claude 3 Opus leading on long-form writing and document analysis requests, while GPT-4 dominated code-assistance sessions. Meanwhile, Notion AI, which integrated Claude into its notes and documents product in 2023, cited Claude's ability to maintain consistent document voice across edits as the deciding factor.
These are not preferences β they reflect measurable differences in output patterns at scale.
Anthropic's Claude 3.5 Sonnet (released June 2024) is trained with Constitutional AI β a process where the model critiques its own outputs against a set of principles before finalizing responses. This produces three observable behaviors that make Claude the right choice in specific contexts:
Long-form writing consistency. Claude tends to maintain stylistic coherence across 5,000β20,000 word outputs better than GPT-4o. This was a documented reason Notion AI chose Claude: when editing a 10-page strategy document, Claude preserves voice and structure across multiple revision passes. GPT-4o, by contrast, sometimes introduces tonal shifts between sections.
Nuanced instruction adherence on sensitive topics. Claude's training includes explicit calibration on harm avoidance that is less binary than GPT-4o's moderation layer. In practice, Claude can engage with medical edge cases, legal hypotheticals, and security research questions that GPT-4o refuses β while still declining genuinely harmful requests. This makes Claude a better fit for professional domain tools in medicine, law, and security.
200K-token context with strong retrieval accuracy. Claude 3.5 Sonnet's 200K-token window is paired with strong "needle in a haystack" performance β accurately locating a specific sentence buried deep in a long document. In Anthropic's April 2024 evaluations, Claude 3 Opus scored above 99% on needle-in-a-haystack benchmarks across the full 200K window.
Google's Gemini 1.5 Pro arrived in February 2024 with two structural advantages that are genuinely category-defining β not marketing claims:
The 1-million-token context window. This isn't just "bigger" β it's a qualitative shift. In May 2024, Google demonstrated Gemini 1.5 Pro analyzing the entirety of a 44-minute Buster Keaton film, answering specific scene-level questions from the raw video input. For legal firms processing entire case archives, medical researchers analyzing clinical trial corpora, or developers doing codebase-wide analysis, this window eliminates chunking, summarization, and retrieval steps that introduce errors in shorter-context pipelines.
Native Google Workspace integration. Gemini is embedded into Gmail, Docs, Sheets, and Meet via Google Workspace Labs (now generally available). For organizations on Google Workspace, this means AI that reads your existing Drive files, summarizes your last 30 Gmail threads, and drafts Docs with direct access to your data β without a single API call or custom integration.
Multimodal breadth. Gemini 1.5 Pro accepts interleaved text, image, audio, video, and PDF in a single prompt. This multimodal flexibility is broader than GPT-4o's current API surface and makes Gemini the natural choice for tasks like analyzing a recorded sales call (audio) alongside its CRM notes (text) and the customer's proposal PDF (document).
The following table maps task characteristics to the model most likely to produce superior results based on documented capabilities and production usage patterns as of mid-2024.
| Task Characteristic | Best Default | Why |
|---|---|---|
| Document >200K tokens | Gemini 1.5 Pro | Only model with sufficient native context |
| Document 50Kβ200K tokens | Claude 3.5 | Strong needle-in-haystack accuracy, lower cost than Gemini 1.5 Pro at this range |
| Long-form prose editing | Claude 3.5 | Voice consistency across long outputs; LMSYS Arena preference data |
| Google Workspace task | Gemini 1.5 | Native Drive/Gmail/Docs integration; zero integration friction |
| Medical / legal professional tool | Claude 3.5 | Less binary safety refusals; Constitutional AI calibration |
| Multi-modal: audio + video + doc | Gemini 1.5 | Broadest native multimodal input support |
| Tool-calling agent pipeline | GPT-4o | Most mature function-calling API and ecosystem |
Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.
In Lessons 1β3 you built a mental map: a five-axis framework, GPT-4o's ecosystem depth, and the distinct strengths of Claude and Gemini. In practice, model selection happens fast β a team lead asks which model to wire up, and you have five minutes to answer with confidence. This lesson closes the loop with four concrete scenarios: what was chosen, why, and what tradeoffs were accepted.
Scenario: A mid-size law firm needs to process full merger-and-acquisition due diligence packages β typically 200β400 pages of contracts, disclosure schedules, and regulatory filings β and produce structured summaries before each partner review meeting.
Model selected: Claude 3.5 Sonnet (200K context window).
The decisive factor was context length. A 300-page document, once converted to text, runs to roughly 150,000β180,000 tokens. GPT-4o's 128K window would require chunking β introducing seam errors where related clauses on page 12 and page 280 never appear in the same context. Claude's 200K window ingests the entire document in one call, allowing the model to catch cross-references, notice contradictions between schedules, and produce a coherent whole-document summary.
Constitutional AI training was a secondary benefit: the firm's ethics committee was comfortable with Claude's cautious defaults on sensitive client data. Cost was a non-issue at this task volume (a few hundred documents per month).
KEY TRADEOFF
Claude's 200K context solved the chunking problem but added per-call cost versus using Claude Haiku on chunked segments. For documents consistently under 80K tokens, a chunk-and-merge strategy with Haiku would be cheaper β but for full due diligence packages, whole-document coherence justified the Sonnet price point.
Scenario: A B2B SaaS company wants to deploy an AI-powered support bot that handles billing questions, feature explanations, and account troubleshooting. It must integrate with Zendesk, pull from a Confluence knowledge base, and escalate to human agents via webhook.
Model selected: GPT-4o via the OpenAI API.
The ecosystem argument was decisive here. OpenAI's function-calling specification is natively supported by Zendesk's AI partner integrations, and the Confluence connector had already been built by the infrastructure team using an existing OpenAI plugin. Switching to Claude would have meant rebuilding those connectors. GPT-4o's broad tool-use support also made it straightforward to wire up the escalation webhook.
Consumer familiarity mattered too β the support team was already using ChatGPT internally, so GPT-4o's behavioral defaults felt predictable to the team writing system prompts. Response latency on GPT-4o mini (used for intent classification) was fast enough that users experienced no perceptible delay before the model routed to the right knowledge-base section.
KEY TRADEOFF
The team actually ran GPT-4o mini for intent classification (cheap, fast) and GPT-4o for final response generation (higher quality). This two-tier model routing pattern β a small model decides, a large model generates β is increasingly common in production pipelines and cuts cost by 60β80% on high-volume deployments.
Scenario: A computational biology lab wants an assistant that can answer questions about recent preprints, synthesize findings across 2024 literature, and suggest experimental protocols β all while staying current with work published in the last three months.
Model selected: Gemini 1.5 Pro with search grounding enabled.
Recency was the critical constraint. Claude and standard GPT-4o have training cutoffs and no live web access by default; their knowledge of a preprint posted last Tuesday is zero. Gemini's search grounding capability β routing queries through Google Search before generating a response β meant the assistant could accurately surface findings from arXiv, bioRxiv, and PubMed published days earlier.
The lab also ran experiments through Google Colab and stored datasets in Google Drive. Gemini's Workspace integration let the assistant access Drive folders directly when analyzing existing datasets, eliminating a manual copy-paste step that had been friction in earlier workflows.
KEY TRADEOFF
Search grounding added latency β each query triggered a web search before generation, adding 1β3 seconds per response. For a research assistant used asynchronously (not real-time), this was acceptable. For a latency-sensitive application, that same feature would be a deal-breaker.
Scenario: A media monitoring company needs to classify 500,000 social media posts per day into 12 content categories (news, opinion, satire, misinformation, etc.) with a cost budget of under $50/day and a latency requirement of under 500ms per call.
Model selected: Gemini 1.5 Flash (primary) / Claude Haiku (fallback).
At this volume, cost arithmetic dominates every other consideration. Gemini 1.5 Flash was priced at approximately $0.075 per 1M input tokens at mid-2024 rates. Processing 500,000 posts averaging ~100 tokens each is 50M tokens/day β roughly $3.75/day, well within budget. GPT-4o at $5/1M tokens would cost ~$250/day for the same load. Even GPT-4o mini ($0.15/1M input) would cost $7.50/day β more expensive than Flash with no latency advantage.
Flash's median response time of ~300ms cleared the 500ms SLA. Claude Haiku was configured as a fallback for posts where Flash returned low-confidence classifications, accepting slightly higher cost on the ~5% of ambiguous cases in exchange for better accuracy on edge cases.
KEY TRADEOFF
Flash's speed-and-cost advantages come with a quality ceiling. On nuanced satire detection or multi-label classification, Flash underperformed Sonnet-class models in internal evaluations. The team accepted lower accuracy on hard cases in exchange for the economics working at scale β a deliberate engineering tradeoff, not an oversight.
Looking across all four cases, three patterns emerge consistently:
Apply and extend the concepts from this lesson through guided conversation with an AI assistant.
Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.
15 questions covering all lessons β free, untracked, retake anytime.