GPT vs. Claude vs. Gemini · Module 8 · Lesson 1

Release Velocity: How Fast the Labs Are Moving

GPT-4 launched in March 2023. By early 2025 its successors had been replaced twice over. The pace is not slowing.

On March 14, 2023, OpenAI published GPT-4's system card alongside a live API. Within 48 hours, developers had built plugins, law-school exam solvers, and coding agents. Anthropic responded six days later with Claude 1. Google's Bard — then still powered by LaMDA — had already launched on February 6, under intense pressure after ChatGPT surpassed 100 million users in two months, the fastest consumer-app adoption ever recorded. The race was public, the stakes were existential, and the release cadence would only compress from there.

What followed was not a slow technological rollout. It was a sprint where each lab dropped a major model roughly every three to six months, each leapfrogging or at minimum matching the previous leader on benchmarks. Understanding that velocity — its causes, its costs, and its strategic logic — is the foundation for everything that comes next.

The Documented Timeline: 2023–2025

The public release record shows just how compressed the cycles became. OpenAI shipped GPT-4 in March 2023, then GPT-4 Turbo in November 2023 (128k context, lower cost), then GPT-4o in May 2024 (native multimodal, real-time voice), then o1 in September 2024 (chain-of-thought reasoning), then o3 in December 2024 preview. Anthropic shipped Claude 1 in March 2023, Claude 2 in July 2023 (100k context window), Claude 3 Opus/Sonnet/Haiku in March 2024, and Claude 3.5 Sonnet in June 2024 — which many independent benchmarks scored above GPT-4o on coding tasks at that moment. Google shipped Gemini 1.0 in December 2023, Gemini 1.5 Pro in February 2024 (1 million token context), and Gemini 1.5 Flash in May 2024.

Each release was not merely a version bump. GPT-4o's native multimodality meant the model could process images, audio, and text simultaneously without separate pipelines. Claude 3.5 Sonnet introduced "computer use" capability in October 2024 — the model could see a screen and click elements. Gemini 1.5 Pro's one-million-token context let users upload entire codebases or feature-length films for analysis. These were category expansions, not polish releases.

Why the Labs Release So Fast

Three forces drive the cadence. First, compute cost deflation: inference costs per token have fallen roughly 10× every twelve months since GPT-4's launch, meaning yesterday's expensive frontier model becomes today's cheap API. Labs must release newer models to maintain premium pricing before their current flagship commoditizes. Second, talent competition: researchers want to publish or ship — long internal development cycles without external milestones cause defections. OpenAI lost several founding researchers to Anthropic partly because of strategic disagreements about pace. Third, enterprise sales cycles: large customers sign multi-year contracts anchored to capability promises, so labs must demonstrate continuous progress to justify renewals.

The cadence also reflects a genuine scientific reality: scaling laws continued to hold longer than many predicted. Adding more compute and data kept improving performance, so labs with more capital could keep shipping improvements without fundamental algorithmic breakthroughs. Microsoft's $13 billion investment in OpenAI (announced in January 2023) and Google's $300 million investment in Anthropic (March 2023, later growing to $2 billion) gave those labs the runway to sustain rapid training runs.

What "Frontier" Actually Means in 2025

The term "frontier model" has a specific technical meaning: the most capable publicly accessible model at any given moment, typically evaluated on a standard suite of benchmarks including MMLU, HumanEval, MATH, and GPQA. In practice, the frontier now moves so fast that a model can be "state of the art" for six weeks before being overtaken. The LMSYS Chatbot Arena leaderboard — which uses human preference ratings from blind A/B comparisons rather than fixed benchmarks — has become the most watched real-time indicator, and it showed leadership changing hands among GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro multiple times during 2024.

For practitioners, this creates a practical challenge: the "best" model for your use case may not be the same model next quarter. Building workflows that lock to a single model version is increasingly risky. The strategic lesson is to understand the dimensions on which each lab competes — reasoning depth, context length, multimodal fidelity, latency, cost — rather than simply chasing the current benchmark leader.

Key Fact

From GPT-4's launch in March 2023 to the end of 2024, OpenAI, Anthropic, and Google collectively shipped more than 20 distinct named model variants available via public API or consumer product. The average time between major capability jumps from any one lab was approximately 4.5 months.

Strategic Takeaway

Release velocity is itself a competitive moat: the lab that ships fastest accumulates user feedback fastest, which improves RLHF data, which improves the next model. Speed compounds. This is why all three major labs have maintained or accelerated their cadence despite enormous cost and safety review burdens.

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.

1. ChatGPT reached 100 million users in approximately how long after launch — the fastest consumer-app adoption ever recorded at the time?

✓ Correct. ChatGPT hit 100 million users in roughly two months after its November 2022 launch, setting a record for consumer app growth and triggering Google's emergency "Code Red" response.

✗ Not quite. ChatGPT reached 100 million users in approximately two months — the fastest consumer-app growth ever recorded at that point, which directly accelerated Google's and Anthropic's release timelines.

2. Which capability made Gemini 1.5 Pro's February 2024 release a genuine category expansion rather than an incremental update?

✓ Correct. Gemini 1.5 Pro launched with a one-million-token context window, enabling analysis of entire codebases or feature-length films in a single prompt — a qualitatively different capability class.

✗ Not this one. The defining capability of Gemini 1.5 Pro was its one-million-token context window. Computer use was Claude 3.5 Sonnet's October 2024 addition; native voice was GPT-4o's feature.

3. Why does high release velocity function as a competitive moat — not just a marketing tactic?

✓ Correct. Speed compounds: more user interactions generate more preference data, which improves alignment training, which makes the next model better, which attracts more users. The flywheel is real.

✗ The deeper reason is that shipping fast generates real-world usage data that feeds back into RLHF training, creating a compounding advantage. The lab that ships most frequently learns fastest from actual users.

Lab 1: Mapping the Release Timeline

Use the AI to drill into real release events and their strategic significance.

Practice: Analyzing AI Model Release Cadence

In this lab you'll interrogate the documented timeline of GPT, Claude, and Gemini releases to extract patterns, understand strategic motivations, and think about what each major capability jump meant for practitioners at the time.

Try asking the assistant to compare specific releases, explain why a lab chose a particular launch window, or identify which capability expansion had the largest practical impact on developers.

Suggested start: "Between GPT-4 Turbo in November 2023 and GPT-4o in May 2024, what changed strategically for OpenAI and why did the multimodal-native approach matter?"

AI Lab Assistant Module 8 · L1

GPT vs. Claude vs. Gemini · Module 8 · Lesson 2

Reasoning Models and the "Thinking" Revolution

When OpenAI's o1 spent 20 seconds before answering a math problem, something fundamental had shifted. The model wasn't faster — it was slower on purpose.

On September 12, 2024, OpenAI released o1 under the codename "Strawberry" with a notable warning in its own documentation: responses might take up to several minutes for difficult problems. This was not a bug. The model had been trained to spend tokens on internal reasoning steps before producing an answer — a technique the research community had been discussing since Google's 2022 paper on "chain-of-thought prompting" showed that asking models to "think step by step" significantly improved performance on multi-step reasoning tasks. O1 internalized that process. On the AIME 2024 mathematics competition, o1 scored 83% — compared to GPT-4o's 13%. On the Codeforces competitive programming benchmark, o1 ranked in the 89th percentile of human participants.

The implications were immediate and unsettling for the other labs. Gemini and Claude had been competitive on standard benchmarks. On hard reasoning tasks, they were suddenly far behind. Both Anthropic and Google accelerated their own reasoning-model programs in response.

What "Reasoning Models" Actually Do

Standard large language models generate tokens autoregressively — each token is produced in one forward pass based on context. The model does not "reconsider." Reasoning models like o1 introduce a trained behavior where the model generates a scratchpad of intermediate reasoning steps (often invisible to the user) before producing a final answer. This is sometimes called test-time compute scaling: spending more computation at inference time to improve output quality, rather than relying solely on parameters learned during training.

The research basis comes from several documented findings. Wei et al. (2022) at Google Brain demonstrated chain-of-thought prompting on GPT-3-scale models. Kojima et al. (2022) showed that "Let's think step by step" as a zero-shot prompt significantly improved arithmetic and commonsense reasoning. OpenAI's o1 system card (September 2024) confirmed the model was fine-tuned using reinforcement learning to produce longer chains of thought before answering, with rewards given for correct final answers regardless of intermediate steps. The model learned to backtrack, try alternative approaches, and self-correct — behaviors that regular fine-tuning did not produce.

The trade-off is latency and cost. O1's inference is significantly more expensive per query than GPT-4o because it processes many more tokens internally. OpenAI priced o1 at $15 per million input tokens at launch, compared to $5 for GPT-4o — a 3× premium reflecting the additional compute.

The Competition Responds

Anthropic released Claude 3.5 Sonnet with extended thinking in early 2025, exposing its reasoning chain to users (o1's scratchpad is hidden). This design choice was deliberate: Anthropic argued that visible reasoning improves user trust and allows verification of the model's logic. Google DeepMind released Gemini 2.0 Flash Thinking in December 2024, also with visible thought traces, optimizing for speed rather than maximum depth. The company noted that Flash Thinking performed comparably to o1 on many benchmarks at substantially lower latency.

The emergence of reasoning models also changed the benchmark landscape. Tasks that had previously been considered "solved" by frontier models — like grade-school math — were no longer useful discriminators. Researchers shifted to FrontierMath (research-level mathematics problems), GPQA Diamond (graduate-level science), and LiveCodeBench (competitive programming with contamination controls). On these harder evaluations, the gap between reasoning and non-reasoning models is stark: o1 scores roughly 2–3× higher than GPT-4o on GPQA Diamond questions.

Practical Implications for Users

Not every task benefits from a reasoning model. For tasks requiring fast, conversational, or creative responses — drafting emails, brainstorming, summarization — the added latency and cost of o1-class models provide no benefit. The sweet spot is tasks with verifiable correct answers involving multi-step logic: debugging complex code, solving mathematical proofs, analyzing legal argument structure, planning multi-step research. OpenAI's own documentation recommends using GPT-4o for most tasks and escalating to o1 only when reasoning depth is the bottleneck.

The strategic picture for practitioners: you now need to maintain awareness of two tiers of model — fast/cheap generalist models and slow/expensive reasoning models — and route tasks accordingly. This model-routing decision is itself a skill that is becoming part of AI literacy in 2025.

Benchmark Snapshot

On AIME 2024 (high-school mathematics olympiad): GPT-4o scored 13%, Claude 3.5 Sonnet scored 16%, o1 scored 83%. On GPQA Diamond (PhD-level science): GPT-4o scored 53%, o1 scored 78%. These gaps illustrate why reasoning models are not just incremental improvements — they represent a different capability tier for hard problems.

The Meta-Lesson

Reasoning models reveal that "intelligence" in AI systems is not a single axis. A model can be highly capable at language tasks while being poor at multi-step logic — and vice versa. The frontier in 2025 is defined not just by what models know but by how deeply they can reason through problems they have never seen before.

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.

1. What technique does OpenAI's o1 use to dramatically improve performance on hard reasoning tasks compared to GPT-4o?

✓ Correct. O1 was trained with reinforcement learning to generate internal chain-of-thought reasoning steps before answering — called test-time compute scaling — not simply a bigger model.

✗ Not this one. O1's key innovation is test-time compute scaling: it was trained with RL to produce extended internal reasoning chains before giving a final answer, allowing it to backtrack and self-correct on hard problems.

2. On the AIME 2024 mathematics competition, what scores did GPT-4o and o1 receive respectively?

✓ Correct. GPT-4o scored 13% and o1 scored 83% on AIME 2024 — a 70-point gap that demonstrated reasoning models occupy a qualitatively different capability tier on hard math.

✗ The documented scores were GPT-4o at 13% and o1 at 83% — a 70-percentage-point gap that shocked the AI research community and accelerated rival labs' reasoning-model programs.

3. For which task type does OpenAI's own documentation recommend using GPT-4o rather than o1?

✓ Correct. Fast, creative, or conversational tasks get no benefit from o1's added latency and cost — GPT-4o is faster, cheaper, and sufficient. O1 shines only when reasoning depth is the actual bottleneck.

✗ OpenAI recommends reserving o1 for tasks where multi-step reasoning depth is the bottleneck. For fast conversational or creative tasks like email drafting or brainstorming, GPT-4o is faster, cheaper, and equally good.

Lab 2: Reasoning Model Decision-Making

Practice routing tasks between standard and reasoning-tier models.

Practice: When to Use Reasoning Models

The skill of knowing when to escalate to a reasoning model (o1, Claude thinking mode, Gemini Flash Thinking) is increasingly important. In this lab, describe a task to the assistant and get a structured recommendation: which model tier is appropriate, why, and what the cost-benefit trade-off looks like.

You can also ask the assistant to explain benchmark differences, walk through why o1 outperforms GPT-4o on specific problem types, or contrast how Anthropic and Google designed their own reasoning systems differently from OpenAI.

Suggested start: "I need to analyze 200 pages of contract language to find inconsistent indemnification clauses across 12 sections. Should I use a reasoning model or a standard model, and which specific model would you recommend?"

AI Lab Assistant Module 8 · L2

GPT vs. Claude vs. Gemini · Module 8 · Lesson 3

Multimodality, Agents, and the Expanding Action Space

In 2023, these models answered questions. By 2025 they were booking flights, writing code, and clicking buttons on screens they had never seen before.

On October 22, 2024, Anthropic published a research blog post titled "Developing a computer use model." Claude 3.5 Sonnet could now receive a screenshot, identify UI elements, and output actions — move cursor here, click this button, type this text — that a computer would then execute. The demo showed Claude navigating a desktop browser, filling out web forms, and running terminal commands. Anthropic called it a "beta" feature and explicitly warned against giving it access to sensitive accounts or running it without a human in the loop. The capability was real but the safety infrastructure was still forming.

This was not an isolated demo. It followed GPT-4o's real-time voice mode launch in October 2024, which let users have fluid spoken conversations with the model maintaining context and responding with natural prosody. And it preceded Google's Project Mariner announcement in December 2024 — a Gemini-powered agent that could navigate the web autonomously within the Chrome browser. The action space for these models had expanded from "generate text" to "interact with the world."

The Architecture of Multimodality

Early multimodal systems stitched together separate models: a vision encoder (like CLIP) would process an image and produce an embedding that was then concatenated with text tokens and fed into a language model. GPT-4 (March 2023) worked this way for vision. The shift came with natively multimodal architectures, where the model is trained from the beginning on interleaved text, image, audio, and video data — all modalities share a single token space and attention mechanism.

GPT-4o (May 2024) was OpenAI's first natively multimodal model. Its system card noted that audio was processed directly rather than through a whisper transcription step, enabling the model to detect emotion, hesitation, and ambient sound. Gemini 1.0 (December 2023) was Google's first model designed as natively multimodal from training, built on Google DeepMind's experience with Flamingo and SoundStorm. Claude 3 launched with image understanding but not audio; Anthropic has been more deliberate about adding modalities, citing safety evaluation time.

The practical difference matters: a stitched multimodal system can only respond to images in text; a natively multimodal system can reason about the relationship between what it hears and what it sees simultaneously — crucial for video understanding, real-time conversation, and agentic tasks where visual feedback informs next actions.

AI Agents: From Chatbots to Actors

An agent in the AI sense is a model that can take sequences of actions to complete a goal, rather than responding to a single prompt. The minimal architecture requires: a model capable of planning, a set of tools the model can call (web search, code execution, file read/write), and a loop that feeds tool outputs back into the model's context. In 2024 this architecture moved from research demo to production product.

OpenAI launched Operator in January 2025 — a web-browsing agent using GPT-4o that could complete tasks like ordering groceries or booking restaurant reservations without user intervention at each step. Google launched Project Astra (a real-time multimodal assistant) and Project Mariner (web navigation agent) at Google I/O in May 2024 and expanded them in December 2024. Anthropic published its Model Spec section on agentic behavior in May 2024, establishing principles for how Claude should behave when taking actions with real-world consequences — including a "minimal footprint" principle: request only necessary permissions, prefer reversible actions, and check with humans when uncertain.

The benchmark for agentic capability is WebArena and OSWorld — tasks that require navigating real websites or operating systems to achieve specified goals. As of late 2024, the best models achieved roughly 40–50% success on OSWorld tasks, up from near zero in 2023. Progress is rapid but reliability for high-stakes autonomous tasks remains limited.

The Safety Dimension of Agentic AI

When a model can act — not just answer — the stakes of errors change qualitatively. A wrong answer in a chatbot is annoying; a wrong action by an agent managing email, code deployment, or financial transactions can be irreversible. Each lab has responded differently. Anthropic's published Constitutional AI approach includes explicit constraints for agentic settings. OpenAI's Operator documentation includes task-specific safety rails and human confirmation steps for "sensitive" actions. Google's Project Mariner runs in an isolated browser context with no access to local filesystem or other accounts by default.

Regulators have noticed. The EU AI Act's high-risk category definitions include autonomous systems making consequential decisions. The US AI Safety Institute (AISI), established in late 2023, has begun evaluating agentic models alongside pure language models. The expansion into agents is the primary reason AI governance conversations intensified through 2024.

Documented Capability Milestones

GPT-4o real-time voice: October 2024. Claude 3.5 computer use: October 2024. Google Project Mariner web agent: December 2024. OpenAI Operator product launch: January 2025. Each of these moved the boundary from "language tool" to "action-taking system" — a distinction with profound practical and regulatory implications.

The Practitioner's Question

Before deploying any agentic capability, ask: What is the blast radius of a wrong action? Can it be reversed? Is there a human-in-the-loop checkpoint? The models are capable; the infrastructure for safe deployment is still being built. In 2025, the constraint is not usually capability — it is reliability and reversibility.

Lesson 3 Quiz

3 questions — free, untracked, retake anytime.

1. What distinguishes a "natively multimodal" model like GPT-4o from an earlier "stitched" multimodal system like the initial GPT-4 with vision?

✓ Correct. In natively multimodal architectures, text, image, audio, and video are all part of a unified token space trained together — the model genuinely reasons across modalities rather than converting one to text and then processing.

✗ The key distinction is architectural: natively multimodal models train all modalities in a single unified process, allowing the model to reason across audio, image, and text simultaneously rather than converting each to text separately.

2. What is Anthropic's "minimal footprint" principle for agentic Claude behavior?

✓ Correct. Anthropic's published Model Spec defines "minimal footprint" as: request only necessary permissions, prefer reversible over irreversible actions, and pause to confirm with humans when uncertain — a safety-first agentic design philosophy.

✗ Anthropic's "minimal footprint" principle, published in the Claude Model Spec (2024), means: acquire only necessary permissions, prefer reversible actions over irreversible ones, and check with humans when scope is unclear.

3. As of late 2024, approximately what success rate did the best models achieve on OSWorld — a benchmark of real operating-system task completion?

✓ Correct. The best models achieved roughly 40–50% on OSWorld tasks as of late 2024 — up dramatically from near zero in 2023, but still far below the reliability threshold needed for high-stakes autonomous deployment.

✗ As of late 2024, the best models achieved roughly 40–50% on OSWorld, up from near zero in 2023. Impressive progress, but still well short of the reliability needed for unsupervised real-world use.

Lab 3: Agentic AI in Practice

Explore multimodality, computer use, and agent design through guided conversation.

Practice: Agents, Computer Use, and Multimodality

In this lab you'll work through real scenarios involving agentic AI: deciding when to use agents vs. standard models, thinking through the safety implications of computer use, and understanding what "natively multimodal" means in practice.

Try asking the assistant to walk through a specific agent architecture decision, explain why computer use is considered a "beta" capability despite being technically impressive, or describe the difference between GPT-4o's native voice and earlier speech-to-text pipelines.

Suggested start: "If I wanted to build an agent that monitors my email and drafts replies, what are the minimal capabilities it needs — and what are the biggest safety risks I should design around?"

AI Lab Assistant Module 8 · L3

GPT vs. Claude vs. Gemini · The Frontier Is Moving · Lesson 4

Staying Current: How to Track AI Model Development

The frontier moves every few months. Staying informed is now a professional skill, not a hobby.

In early 2024, a law firm completed a six-month evaluation process and selected GPT-4 Turbo as the basis for its document review pipeline. By the time the integration was live, GPT-4o had launched with lower cost and better performance. By October 2024, Claude 3.5 Sonnet had introduced computer use, and several competing firms were already piloting it. The lawyers had done everything right — and still felt behind. The problem wasn't the decision. It was the assumption that a one-time evaluation was enough.

In a field where major capability jumps arrive every three to six months, staying current is not about reading every paper. It is about building a reliable, efficient information system — knowing which sources matter, how to evaluate new releases critically, and how to think strategically about when to switch versus when to stay.

Primary Sources: Where Real Information Comes From

The most reliable information about model capabilities comes from the labs themselves, but it requires critical reading. openai.com/blog, anthropic.com/news, and deepmind.google/research publish system cards, technical reports, and launch announcements that contain actual benchmark numbers, training details, and capability disclosures. These are primary sources — not curated for hype, but not fully neutral either. A lab's system card for a new model will emphasize benchmark wins and may understate limitations.

For independent verification, Hugging Face (huggingface.co) hosts the Open LLM Leaderboard for open-source models and aggregates community evaluations. The LMSYS Chatbot Arena (chat.lmsys.org) publishes Elo ratings based on tens of thousands of human preference votes in blind A/B comparisons — currently the most watched real-time ranking of frontier models because it reflects actual user preference rather than synthetic test sets. Papers With Code tracks benchmark state-of-the-art across standardized evaluations and links to original research.

For synthesized coverage, several newsletters have earned consistent credibility: The Batch (deeplearning.ai), Import AI (Jack Clark, Anthropic co-founder), and The Neuron publish weekly digests with context. For real-time alerts, following the official accounts of OpenAI, Anthropic, and Google DeepMind on X/Twitter and subscribing to their email lists catches announcements within hours of release.

How to Evaluate a New Model Release

When a lab announces a new model, the first instinct is often to look at the benchmark table. Resist trusting it uncritically. Labs design their own benchmark suites, choose which results to highlight, and sometimes evaluate on data their model may have seen during training — a problem called benchmark contamination. The questions to ask are: Who ran these evaluations — the lab or an independent third party? Are these benchmarks standardized across labs or lab-specific? Has the LMSYS Arena Elo been updated since launch?

Wait 72–96 hours after a major release. By then, practitioners on Reddit's r/LocalLLaMA, Hacker News, and AI Twitter will have run their own real-world tests — coding challenges, document extraction, multi-step reasoning — and posted honest assessments. This community evaluation often surfaces limitations that don't appear in launch materials. When Claude 3.5 Sonnet launched in June 2024, user testing within 48 hours confirmed its coding ability was competitive with GPT-4o; when certain reasoning models launched with impressive benchmarks, community testing quickly identified gaps in real-world instruction following.

For professional contexts, build a small personal benchmark: three to five tasks that are representative of your actual work, with outputs you can judge. Run every new frontier model against it. Your benchmark will be narrow but it will be yours — calibrated to what actually matters for your use case, not a lab's marketing priorities.

The Acceleration Trend: What "Fast" Actually Means Now

In 2020–2021, major model releases happened roughly annually: GPT-3 in May 2020, Codex in August 2021. By 2023–2024, the cadence had compressed to every three to six months. By 2025, sub-model updates — fine-tunes, context expansions, capability additions — were arriving even faster, sometimes within weeks of a base model launch.

The relevant comparison: the gap between GPT-3 and GPT-4 was roughly three years of development. The gap between GPT-4o and o1 was four months. The gap between Claude 3 Opus and Claude 3.5 Sonnet — which outperformed it on most benchmarks at a fraction of the cost — was three months. This compression has practical consequences: evaluation cycles that took six months are now longer than the model generation they're evaluating. Organizations that build on fixed model versions without upgrade plans are operating on an increasingly stale foundation.

The underlying driver is not just compute scaling — it is the maturation of the research pipeline. Labs now have established processes for data curation, RLHF training, and safety evaluation that were artisanal in 2021 and are now industrial. Each iteration refines the process, which speeds the next iteration. There is no strong signal that this compression will reverse in the near term.

Strategic Advice for Professionals

Maintaining AI fluency in a fast-moving field requires a system, not willpower. Dedicate a fixed time slot — 30 minutes weekly — to reading one primary source and one community digest. This is enough to catch major developments without drowning in noise. Follow the Elo leaderboard monthly rather than weekly; Chatbot Arena rankings stabilize over time and chasing daily fluctuations creates more confusion than clarity.

Develop a mental model of each lab's strategic priorities rather than just tracking individual models. OpenAI has consistently prioritized multimodality and consumer reach. Anthropic has emphasized safety, long-context, and coding. Google has prioritized context length, multimodality, and search integration. Understanding these priorities helps you anticipate what each lab's next release will emphasize — and whether it is likely to be relevant to your work.

Finally, accept that some uncertainty is structural. You cannot know today which model will be best in six months. The professional response is to build workflows that are model-agnostic where possible — using abstraction layers, prompt design that doesn't depend on specific model quirks, and evaluation frameworks that can be re-run quickly. The goal is not to always be on the best model. The goal is to never be so locked into a worse one that switching costs are prohibitive.

Key Sources Summary

Primary: openai.com/blog · anthropic.com/news · deepmind.google/research. Independent rankings: LMSYS Chatbot Arena · Hugging Face Leaderboard · Papers With Code. Community evaluation: Hacker News · r/LocalLLaMA · AI Twitter within 48–72 hours of launch. Newsletters: The Batch · Import AI · The Neuron.

The Core Skill

In 2025, AI fluency is not about knowing which model is best today. It is about having a reliable system for learning which model is best next month — and a workflow flexible enough to act on that information without starting over.

Lesson 4 Quiz

3 questions — free, untracked, retake anytime.

1. Why should you wait 72–96 hours after a major model release before drawing conclusions about its real-world capabilities?

✓ Correct. Within 72–96 hours, practitioners on Hacker News, r/LocalLLaMA, and AI Twitter run their own coding, reasoning, and extraction tests — and publish honest findings that complement (and sometimes contradict) the lab's launch materials.

✗ The key reason is community evaluation: within a few days, practitioners run real-world tests and publish honest assessments that surface limitations the lab's own benchmarks may not emphasize. This is consistently more useful than launch-day coverage.

2. Which ranking system is most useful for comparing frontier models based on actual human preference rather than synthetic benchmark scores?

✓ Correct. LMSYS Chatbot Arena uses blind A/B comparisons voted on by real users — tens of thousands of preference judgments — making its Elo ratings the most watched real-time measure of actual user preference among frontier models.

✗ The LMSYS Chatbot Arena uses blind A/B comparisons judged by real users, generating Elo ratings based on actual preference — not synthetic test sets. It is the most-watched real-time ranking for this reason.

3. What does it mean to build "model-agnostic" workflows, and why is it strategically important given the current pace of AI development?

✓ Correct. Model-agnostic design means using abstraction layers and prompt patterns that don't rely on a specific model's idiosyncrasies — so when a better model ships (every 3–6 months), you can upgrade without rebuilding from scratch.

✗ Model-agnostic workflows use prompt designs and evaluation frameworks that don't depend on one model's quirks, keeping switching costs low. Given major releases every 3–6 months, this is the practical insurance against being locked into a rapidly dated system.

Lab 4: Building Your AI Tracking System

Practice evaluating new model releases and designing a personal information workflow.

Practice: Staying Current on the Frontier

In this lab you'll work through the practical challenge of staying informed about AI model development. Ask the assistant to help you design a personal tracking system, evaluate a hypothetical new model announcement, or think through whether a specific model switch would be worth the transition cost for your work.

You can also ask it to walk through what sources to trust for a specific claim, or simulate what community evaluation of a new release might look like 72 hours after launch.

Suggested start: "A new model just launched claiming to beat GPT-4o on every benchmark. Walk me through the questions I should ask before recommending my team switch to it."

AI Lab Assistant Module 8 · L4

Module 8 Test

15 questions covering all four lessons — free, untracked, retake anytime.

1. In what month and year did ChatGPT launch publicly, and how quickly did it reach one million users?

✓ Correct. ChatGPT launched in November 2022 and reached one million users in approximately five days — then 100 million users in roughly two months, the fastest consumer-app adoption ever recorded at that time.

✗ ChatGPT launched in November 2022 and hit one million users in about five days. It then reached 100 million users in approximately two months, setting a record for consumer app growth.

2. GPT-4 was released in March 2023. What was the defining new capability of GPT-4o when it launched in May 2024?

✓ Correct. GPT-4o was OpenAI's first natively multimodal model: audio, image, and text shared a single token space and training process, enabling real-time voice with natural prosody and simultaneous cross-modal reasoning.

✗ GPT-4o's defining capability was native multimodality — all modalities (audio, image, text) trained together in one model rather than stitched via separate pipelines. This enabled real-time voice and genuine cross-modal reasoning.

3. OpenAI's o1 model launched in September 2024. What score did it achieve on the AIME 2024 mathematics competition, compared to GPT-4o's 13%?

✓ Correct. O1 scored 83% on AIME 2024 versus GPT-4o's 13% — a 70-point gap that demonstrated reasoning models occupy a qualitatively different capability tier for multi-step mathematical problems.

✗ O1 scored 83% on AIME 2024 while GPT-4o scored 13%. That 70-point gap shocked the AI research community and immediately accelerated Anthropic's and Google's own reasoning-model programs.

4. The Claude 3 family launched in March 2024. Which Claude release in June 2024 surprised many observers by outperforming GPT-4o on coding benchmarks at lower cost?

✓ Correct. Claude 3.5 Sonnet launched in June 2024 and scored above GPT-4o on several independent coding evaluations — at a lower price point than Claude 3 Opus, which had previously been Anthropic's flagship model.

✗ Claude 3.5 Sonnet, launched June 2024, outperformed GPT-4o on multiple independent coding benchmarks and cost less than Claude 3 Opus. It became Anthropic's de facto flagship within weeks of launch.

5. Gemini 1.5 Pro launched in February 2024 with a capability that represented a genuine category expansion. What was it?

✓ Correct. Gemini 1.5 Pro's one-million-token context window was not an incremental update — it enabled qualitatively new use cases like uploading an entire large codebase or a feature-length film for analysis in a single prompt.

✗ Gemini 1.5 Pro's defining innovation was its one-million-token context window, enabling analysis of entire codebases or feature-length videos in a single prompt — a qualitative capability leap, not just a bigger number.

6. What is the primary trade-off when using a reasoning model like o1 instead of a standard model like GPT-4o?

✓ Correct. O1 was priced at $15/million input tokens at launch versus $5 for GPT-4o, and responses can take seconds to minutes. The trade-off is deliberate: more compute at inference time yields dramatically better results on hard reasoning tasks.

✗ The core trade-off is latency and cost versus reasoning depth. O1 launched at 3× GPT-4o's price and can take minutes on hard problems — but scores dramatically higher on math, logic, and coding tasks where step-by-step reasoning matters.

7. How does Anthropic's implementation of extended thinking in Claude differ from OpenAI's o1 in a key design choice visible to users?

✓ Correct. Anthropic made a deliberate choice to show Claude's reasoning chain to users, arguing visible reasoning improves trust and allows verification. O1's internal scratchpad is hidden — users see only the final answer.

✗ The key design difference: Anthropic shows Claude's reasoning chain to users, arguing transparency improves trust. OpenAI hides o1's internal scratchpad — you see only the final answer, not the steps that produced it.

8. Claude 3.5 Sonnet gained "computer use" capability in October 2024. What does this mean in practice?

✓ Correct. Computer use means Claude receives a screenshot, identifies UI elements, and outputs specific actions (move cursor here, click this button, type this text) that are then executed on a real computer — not just generating code for a human to run.

✗ Computer use means the model sees a screenshot and outputs actual cursor/keyboard actions that execute on a real computer — it is controlling the machine directly, not generating code for the user to run. Anthropic launched it as a beta with explicit safety warnings.

9. What is an AI agent, and what minimal architecture does it require beyond a base language model?

✓ Correct. An agent requires: a planning-capable model, a set of callable tools, and a loop that feeds tool outputs back into context. This architecture moves the system from single-prompt response to multi-step goal completion.

✗ An agent needs a model capable of planning, a toolset (web search, code execution, file read/write), and a feedback loop that returns tool outputs to the model's context. This enables multi-step task completion rather than one-shot response.

10. How did the typical cadence between major AI model releases change from 2020–2021 compared to 2023–2024?

✓ Correct. GPT-3 to GPT-4 took roughly three years. By 2023–2024, each lab was shipping major capability jumps every three to six months, with sub-model updates arriving even faster. The evaluation cycles that once matched release cadences are now longer than the model generations they evaluate.

✗ The cadence compressed dramatically: from roughly annual releases in 2020–2021 (GPT-3 to Codex to GPT-4 over three years) to major capability jumps every three to six months by 2023–2024, with multiple named model variants from each lab per year.

11. What is Anthropic's "Responsible Scaling Policy" (RSP)?

✓ Correct. Anthropic's RSP commits to evaluating each model for dangerous capabilities (CBRN uplift, autonomous replication, etc.) before deployment. If a model crosses defined thresholds, additional safety measures or deployment restrictions are triggered before release.

✗ The Responsible Scaling Policy is Anthropic's commitment to evaluate each new model for dangerous capabilities before release. Defined thresholds trigger escalating safety requirements — it's a pre-deployment safety gate, not a pricing or access policy.

12. OpenAI has a published safety commitment analogous to Anthropic's RSP. What is it called?

✓ Correct. OpenAI's Preparedness Framework commits to evaluating models for catastrophic risks — CBRN, cybersecurity, persuasion, and autonomous replication — before deployment, with a "scorecard" system that can restrict releases that reach critical risk thresholds.

✗ OpenAI's analogous commitment is the Preparedness Framework, which defines a scorecard for evaluating catastrophic risk categories before model deployment. It mirrors Anthropic's RSP in intent: safety evaluation gates deployment, not just release timing.

13. Which benchmark is most commonly cited as the best real-time indicator of which frontier model users actually prefer, because it uses blind human comparisons rather than fixed test sets?

✓ Correct. LMSYS Chatbot Arena generates Elo ratings from tens of thousands of blind A/B preference votes by real users — making it the most-watched real-time ranking of frontier models, because it reflects actual user preference rather than performance on synthetic benchmarks.

✗ LMSYS Chatbot Arena is the most watched real-time ranking because it uses blind A/B comparisons judged by real users, producing Elo ratings that reflect actual preference. MMLU, HumanEval, and GPQA are fixed test sets — useful but gameable and not user-preference measures.

14. Where should you go first to read an official, factual account of a newly released model's capabilities, training approach, and known limitations?

✓ Correct. Lab blogs and system cards are primary sources — they contain actual benchmark numbers, training methodology, and capability disclosures. They require critical reading (labs emphasize wins), but they are the only place that contains the full technical detail at launch.

✗ The lab's own blog and system card are primary sources: openai.com/blog, anthropic.com/news, deepmind.google/research. They require critical reading since labs emphasize favorable results, but they contain actual benchmark data and technical disclosure that journalism pieces summarize and sometimes misrepresent.

15. What does "benchmark contamination" mean in the context of evaluating a new AI model release?

✓ Correct. Benchmark contamination occurs when training data includes examples from the benchmark, so the model effectively "memorizes" answers rather than demonstrating genuine capability. This is why independent benchmarks with contamination controls — like LiveCodeBench — are increasingly important for trustworthy evaluation.

✗ Benchmark contamination means a model saw the benchmark's questions or answers during training, inflating its score beyond its genuine capability. It's a key reason to look for independent evaluations with contamination controls (like LiveCodeBench) rather than trusting lab-reported scores on widely published test sets.