GPT vs. Claude vs. Gemini · Introduction

There is no the AI. There's a market of them.

Each frontier model has a different character, different strengths, different failure modes. Choosing is the new literacy.

Twenty years ago, using the internet meant one thing. Ten years ago, using a smartphone meant one thing. In both cases, the specific hardware and operating system mattered less than the fact of participating at all.

AI is going the other direction. Using AI in 2026 specifically means choosing an AI — Claude or ChatGPT or Gemini or Llama or Mistral or one of a dozen others — and the choice matters. They are not interchangeable. They have different strengths on math, different biases in writing, different refusal patterns, different context windows, different prices, different latencies, different policies about what they'll do with your data.

This course is the comparative literacy course. It teaches you to evaluate AI models the way a chef evaluates knives — not by brand loyalty, but by fit to the job. It covers the major frontier models, how they're actually built differently, what each one is best and worst at, how to run your own benchmarks, and how to design a workflow that uses the right tool for the right task rather than the most advertised tool for everything.

GPT vs. Claude vs. Gemini · Module 1 · Lesson 1

Three Labs, Three Philosophies

OpenAI, Anthropic, and Google each built an AI — but they built it for very different reasons.

On November 30, 2022, OpenAI quietly posted a research preview to its website. Within five days it had one million users. Within two months, one hundred million. ChatGPT had become the fastest-adopted consumer application in history — and it forced every major technology company to reveal what they had been building in private.

Anthropic had been operating since 2021, founded by former OpenAI researchers Dario and Daniela Amodei along with ten colleagues. Google had been researching large language models since at least 2017 — its own researchers had written the Attention Is All You Need paper that made the whole field possible. Three very different organizations now occupied the same public stage.

Who Built What, and Why

Understanding these three systems requires understanding the institutional pressures and stated values of the organizations behind them. They are not interchangeable products competing purely on benchmark scores.

OpenAI was founded in 2015 as a nonprofit with a mission to ensure artificial general intelligence benefits all of humanity. Its 2019 shift to a "capped-profit" structure and a $1 billion investment from Microsoft changed the competitive dynamics significantly. GPT-3 launched in 2020 via API; GPT-4 arrived in March 2023. The GPT family is positioned as a general-purpose capability platform — maximize what the model can do, then apply safety filters and policies on top.

Anthropic was founded explicitly around the concern that OpenAI was moving too fast. The founders brought a research agenda called Constitutional AI, which trains the model to evaluate and revise its own outputs against a written set of principles before responding. Claude 1 launched in March 2023; Claude 2 in July 2023; Claude 3 (Haiku, Sonnet, Opus) in March 2024. Safety is not a layer added to the model — it is described as intrinsic to the training process itself.

Google DeepMind merged Google Brain and DeepMind in April 2023, unifying research teams that had worked separately for years. Bard launched in March 2023, initially powered by LaMDA and later PaLM 2. Gemini — the model family built from the ground up as multimodal — launched in December 2023. Google's core advantage is infrastructure: its TPU hardware, its search index, and its suite of over two billion users already inside Google Workspace.

The Three Founding Philosophies

Each organization has a public-facing value statement that shapes real product decisions. These are not marketing copy — they appear consistently in research papers, deployment choices, and what the models actually refuse to do.

OpenAI / GPT

AGI for the benefit of all humanity. Commercial deployment funds safety research. Capability-first with policy guardrails applied through RLHF and system prompts.

Anthropic / Claude

Safety and helpfulness as inseparable goals. Constitutional AI trains the model on explicit principles. Anthropic describes itself as occupying a "peculiar position" — believing it may be building dangerous technology, pressing forward anyway to ensure the result is beneficial.

Google DeepMind / Gemini

Multimodal from inception. Native integration with Google Search, Workspace, and Android. Aimed at enterprise deployment and consumer scale simultaneously, with TPU-optimized inference.

A Brief Timeline of Public Releases

Jun 2020

GPT-3 released via API. 175 billion parameters. Demonstrated few-shot learning at scale.

Mar 2021

Anthropic founded. Dario Amodei, Daniela Amodei, and nine others leave OpenAI citing safety concerns.

Nov 2022

ChatGPT launches. 1 million users in 5 days; 100 million in 60 days. Fastest consumer adoption on record.

Mar 2023

GPT-4 and Claude 1 launch within days of each other. Google releases Bard (LaMDA-powered) to limited preview.

Dec 2023

Gemini launches (Ultra, Pro, Nano tiers). First Google model built as natively multimodal from pretraining.

Mar 2024

Claude 3 family (Haiku, Sonnet, Opus) launches. Opus benchmarks above GPT-4 on several academic evals. Gemini 1.5 Pro announced with 1M token context window.

Why This Matters

The organizational origin of each model shapes its defaults, its refusals, its strengths, and its blind spots. A researcher choosing between these systems isn't just choosing a benchmark score — they are choosing which organization's values will be embedded in their workflow. Understanding that context is the foundation of intelligent model selection.

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.

Anthropic was founded primarily because its founders believed OpenAI was doing what?

✓ Correct. The Amodei siblings and colleagues left OpenAI in 2021 specifically citing concerns that the pace of development outstripped safety research. Anthropic's founding charter centers on this concern.

✗ Not quite. Anthropic's founders left over safety concerns — the belief that capability development was moving faster than alignment research could keep pace with.

ChatGPT reached 100 million users in approximately how long after its November 2022 launch?

✓ Correct. ChatGPT hit 1 million users in five days and 100 million users in approximately two months — a consumer adoption record at the time, beating TikTok's previous benchmark.

✗ Not quite. One million came in five days, but 100 million took roughly two months — still the fastest consumer adoption record at the time.

Which model was described as built as "natively multimodal from pretraining," distinguishing it architecturally from its predecessor Bard?

✓ Correct. Google's Gemini family, launched December 2023, was built from the ground up to process text, images, audio, and code within a single architecture — unlike Bard, which added multimodal capabilities to an existing language model.

✗ Incorrect. Gemini was the model Google described as natively multimodal from pretraining — meaning image, audio, and text understanding were baked into the architecture from the start, not grafted on later.

Lab 1: The Origin Stories

Explore the founding context and institutional philosophies behind each major AI lab.

Investigate the Labs Behind the Models

Each of the three major AI systems — GPT, Claude, and Gemini — reflects the values and priorities of its parent organization. In this lab, ask the AI assistant about the founding histories, stated missions, and structural differences between OpenAI, Anthropic, and Google DeepMind.

Try to understand how organizational structure (nonprofit origins, capped-profit, big tech subsidiary) shapes product philosophy. Push for specifics: funding amounts, named founders, documented policy decisions.

Try asking: "What is Constitutional AI and how does it differ from OpenAI's RLHF approach to safety?" — or — "How did Google's $300M investment in Anthropic in 2023 complicate the competitive landscape?"

AI Lab Assistant Module 1 · L1

GPT vs. Claude vs. Gemini · Module 1 · Lesson 2

Architecture & Scale: What's Under the Hood

Transformers, parameters, and context windows — the technical vocabulary that actually predicts what a model can and cannot do.

Eight Google researchers published a paper with an unusually confident title: Attention Is All You Need. The transformer architecture they described replaced the recurrent networks that had dominated natural language processing for years. Within five years, every major language model — GPT, Claude, Gemini — would be built on this foundation. The researchers who wrote it had largely left Google by the time their architecture became the basis of a trillion-dollar industry.

The Transformer Foundation

All three model families — GPT, Claude, and Gemini — are transformer-based large language models. At a high level, they predict the next token in a sequence by attending to all previous tokens simultaneously, weighting each by relevance. This mechanism, called self-attention, is what allows these models to handle long-range dependencies in text that recurrent networks could not.

What differentiates the models is not the fundamental architecture but the decisions made on top of it: scale (how many parameters), training data (what text and media the model saw), training objectives (how the model learned from human feedback), and context window (how much text the model can process in a single pass).

GPT-4's parameter count has not been officially confirmed by OpenAI, though reporting suggests a mixture-of-experts architecture. Claude 3 Opus similarly has undisclosed parameters. Google confirmed Gemini Ultra at over 1 trillion parameters in a sparse mixture-of-experts configuration. What matters operationally is not the raw number but how these choices manifest in reasoning, latency, and cost.

Context Windows: The Practical Limit

The context window — how much text a model can hold in active memory during a single conversation — is one of the most consequential practical differences between models as of 2024.

GPT-4 Turbo launched in November 2023 with a 128,000-token context window, roughly equivalent to a 300-page book. Prior GPT-4 variants were limited to 8K or 32K tokens, which created real workflow constraints for document analysis tasks.

Claude 3 launched with a 200,000-token context window across all variants — the largest at general availability among the three families in early 2024. Anthropic demonstrated this by having Claude 3 process the entire text of the original Needle in a Haystack benchmark, a 200K-token document corpus.

Gemini 1.5 Pro, announced in February 2024, demonstrated a 1 million token context window in research preview — enough to process approximately 11 hours of video or 700,000 words of text in a single request. This represents a qualitative shift in what long-context retrieval can mean in practice.

Multimodality: What Each Model Actually Accepts

All three model families support text and image input as of 2024. The differences lie in depth of integration and additional modalities.

GPT-4o (May 2024)

Text, image, audio input and output in a single model. Audio-to-audio response in ~320ms. Vision fine-tuned on wide consumer image distribution. DALL·E 3 integration for image generation.

Claude 3 Family

Text and image input; text output only. Strong document and chart understanding. Particularly noted for precise OCR on dense tables and financial documents. No audio or video input at launch.

Gemini 1.5 Pro

Text, image, audio, video, and code input natively. Can process up to ~11 hours of audio or 1 hour of video in context. Google's Imagen 2 integration for generation. Built on TPU v5e infrastructure.

Key Term: Mixture of Experts

Both GPT-4 and Gemini Ultra reportedly use a mixture-of-experts (MoE) architecture — the model is actually many smaller specialized networks, and only a subset activates for any given token. This allows extremely large total parameter counts without proportional inference costs. Dense models like earlier GPT and Claude variants activate all parameters for every token.

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.

The paper Attention Is All You Need (2017) introduced what architectural approach that all three major model families now use?

✓ Correct. The transformer architecture introduced self-attention — the ability for every token to attend to every other token in the sequence simultaneously — replacing recurrent approaches that processed tokens sequentially.

✗ Incorrect. The paper introduced the transformer architecture with self-attention mechanisms, replacing recurrent networks as the dominant approach to sequence modeling.

As of early 2024, which model had the largest context window available at general availability?

✓ Correct. Claude 3 launched with 200,000 tokens available at general availability — larger than GPT-4 Turbo's 128K. Gemini 1.5 Pro's 1M context was announced but remained in research preview, not GA, in early 2024.

✗ Not quite. Claude 3's 200K context was the largest at general availability. Gemini 1.5 Pro's 1M context was in research preview, not yet broadly available.

In a mixture-of-experts (MoE) architecture, what happens during inference that differs from a dense model?

✓ Correct. In MoE, a routing mechanism selects which expert sub-networks handle each token. This allows massive total parameter counts with lower per-token compute than a dense model of equivalent size.

✗ Incorrect. MoE routes each token to only a subset of expert networks — this is what makes trillion-parameter models computationally feasible. Dense models activate all parameters for every token.

Lab 2: Architecture Deep Dive

Probe the technical distinctions between GPT, Claude, and Gemini architectures.

Context Windows, Parameters, and Modalities

In this lab, explore the practical implications of the architectural choices these models make. Context window size, multimodal capabilities, and mixture-of-experts design each have real consequences for specific use cases. Ask the assistant to help you reason through which architecture is best suited for particular tasks.

The goal is to move from abstract specs to concrete decision criteria — when does a 200K context window matter? When is native video understanding valuable vs. overkill?

Try asking: "If I need to analyze a 150-page legal contract in one pass, which model's context window is sufficient and which would fail?" — or — "What are the latency trade-offs of GPT-4o's audio-native processing versus Gemini's video understanding?"

AI Lab Assistant Module 1 · L2

GPT vs. Claude vs. Gemini · Module 1 · Lesson 3

How Each Model Is Trained to Behave

RLHF, Constitutional AI, and the choices that determine what a model will and won't do — and why.

When Claude launched in March 2023, early testers immediately noticed something different: it would engage more deeply with morally complex hypotheticals, provide more nuanced refusals with explicit reasoning, and was notably more resistant to jailbreaks that relied on roleplay framing. This wasn't luck — it was the product of a fundamentally different training methodology that Anthropic had published in a 2022 paper titled Constitutional AI: Harmlessness from AI Feedback.

RLHF: The Standard Approach

Reinforcement Learning from Human Feedback (RLHF) became the dominant post-training alignment method after OpenAI's InstructGPT paper in January 2022. The process works in three stages: first, supervised fine-tuning on high-quality demonstration data; second, training a reward model on human preference rankings between model outputs; third, optimizing the language model using the reward model's scores via proximal policy optimization (PPO).

GPT-4 uses RLHF as its primary alignment method, supplemented by rule-based reward models (RBRMs) — hard constraints that penalize specific categories of output regardless of human preference ratings. This approach is powerful but has a known limitation: it installs the values of the annotator pool, which tends to be geographically and demographically concentrated. OpenAI's annotators for ChatGPT's RLHF training were documented primarily as contractors in Kenya through Sama, a fact reported by TIME magazine in January 2023 after workers described psychologically disturbing content.

Constitutional AI: Anthropic's Approach

Constitutional AI (CAI), introduced in Anthropic's December 2022 paper, adds a step before human feedback enters the loop. A set of written principles — the "constitution" — is used to have the model critique and revise its own outputs. The model first generates a response, then is asked to identify how that response violates specific principles (e.g., "Choose the response that is less harmful"), revise accordingly, and only then does a preference model evaluate the result.

Importantly, this means alignment is partially self-supervised: the model trains against its own constitutionally-guided critiques rather than requiring a human to evaluate every output. Anthropic published their constitution — it draws on sources including the UN Declaration of Human Rights, Apple's terms of service, and DeepMind's Sparrow rules. This transparency is a deliberate differentiator.

The practical consequence: Claude tends to provide more explicit reasoning when declining requests, and tends to be more consistent across paraphrased versions of the same harmful request, because the constitution is applied systematically rather than relying purely on annotator judgment on specific training examples.

Google's RLHF Variants and Safety Infrastructure

Gemini models use RLHF and what Google describes in its technical report as a combination of supervised fine-tuning and reinforcement learning from human feedback with a process reward model for multi-step reasoning tasks. Google also operates a separate content safety layer via its SynthID watermarking and SafetySettings API parameters, which operators adjust independently of the base model's trained values.

Google's scale creates a distinctive challenge: Gemini serves billions of users across Search, Workspace, and Android simultaneously. The same base model must be appropriate for a student in Indonesia and a radiologist in Germany. This drives a more granular operator-level safety configuration compared to Anthropic's more locked-down defaults.

Real Documented Incident: GPT-4 System Prompt Extraction

In February 2024, a GitHub user demonstrated that GPT-4's system prompt for the "GPT Builder" feature could be extracted by asking the model to repeat its instructions verbatim. OpenAI's RLHF training had not produced a model that reliably protected confidential system prompts when directly instructed to. Claude's constitutional training produced stronger resistance to similar extraction attempts in contemporaneous testing, attributed to the principle explicitly addressing operator confidentiality.

RLHF

Reinforcement Learning from Human Feedback — aligning model outputs to human preferences via reward model trained on annotator rankings.

CAI

Constitutional AI — Anthropic's approach where models self-critique outputs against written principles before human preference feedback is incorporated.

PPO

Proximal Policy Optimization — the reinforcement learning algorithm used to update the language model weights toward higher reward model scores.

Lesson 3 Quiz

3 questions — free, untracked, retake anytime.

In Constitutional AI, what is the primary function of the "constitution" during training?

✓ Correct. In CAI, the constitution is a set of written principles used during a self-critique phase — the model evaluates its own outputs against these principles and revises them before the reward model training step.

✗ Incorrect. The constitution is a set of written principles used during training to guide the model's self-critique process — it operates during training, not as a runtime filter.

OpenAI's RLHF annotation work for ChatGPT's training was reported by TIME magazine in January 2023 to involve contractors working primarily through which organization in Kenya?

✓ Correct. TIME's January 2023 investigation documented Kenyan workers employed through Sama labeling toxic and disturbing content for OpenAI's RLHF training pipeline, earning below $2/hour in some documented cases.

✗ Not quite. TIME's January 2023 report documented Kenyan workers employed through Sama — they described reviewing content depicting graphic violence, abuse, and other disturbing material to train ChatGPT's safety filters.

Which of the following is a documented practical advantage of Constitutional AI over standard RLHF for safety consistency?

✓ Correct. Because CAI trains against explicit principles rather than specific example annotations, the model learns generalized criteria for what makes a response harmful — making it harder to bypass via rephrasing or roleplay framing.

✗ Incorrect. The documented advantage is consistency across paraphrased requests — because the model learns from principles rather than specific examples, it generalizes better to novel formulations of harmful requests.

🎯 Advanced · Lesson 3 Lab

Lab: Explore Lesson 3 Concepts

Apply what you learned in Lesson 3 through guided AI conversation

Your Task

Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.

Try asking about a specific concept from Lesson 3 and how it applies in practice.

🤖 AESOP Lab Assistant Lesson 3 Lab

GPT vs. Claude vs. Gemini · Module 1 · Lesson 4

The Model Landscape in Practice: Choosing Your Starting Point

Philosophy, architecture, and training approach each predict real performance differences. Here is how to read the signals and match model to task.

You have now seen how OpenAI, Anthropic, and Google each built their models under different institutional pressures, using different architectural strategies and different alignment methods. The question that follows is concrete: when you sit down to do actual work, which model should you reach for first?

The answer is not a single winner. Each lab's philosophy and training choices manifest as genuine strengths and genuine weaknesses. Understanding those patterns is what allows you to choose intelligently rather than by habit or marketing.

How Philosophy Predicts Strengths

OpenAI's capability-first philosophy — build the most capable model, then layer policy controls — produces a model that is highly capable at broad, general-purpose tasks and that tends to engage more liberally with edge-case requests. GPT-4's willingness to take creative risks, write persuasive content on multiple sides of a debate, and assist with sensitive-but-legal topics reflects a philosophy where capability is the primary goal and safety is enforced through a separate policy layer. This makes GPT models strong starting points for consumer-facing products, creative writing, and general-purpose assistants where flexibility matters more than predictability.

Anthropic's safety-as-intrinsic philosophy produces a model with more consistent, principled behavior across the full range of possible requests. Claude is often the better starting point for tasks that require nuanced reasoning about ethics, law, or risk; for scenarios where the model may be embedded in an automated pipeline where human review is limited; and for enterprise applications where reliability of behavior matters more than raw capability ceiling. The trade-off is that Claude's trained caution occasionally applies where it is not needed.

Google's philosophy of infrastructure integration and scale produces a model that is strongest at tasks that benefit from breadth of real-world information and multimodal native understanding. Gemini's design for serving billions of users across radically different contexts means it has been tuned for broad applicability over edge-case depth. Its native video and audio understanding makes it the strongest default for media analysis tasks that would require pre-processing with the other two models.

How Architecture Choices Affect Real-World Performance

Context window size is the single most operationally significant architectural difference for knowledge work in 2024. The practical hierarchy is: GPT-4 Turbo at 128K tokens is sufficient for most documents up to about 300 pages; Claude 3 at 200K handles longer legal, technical, or research documents in a single pass; Gemini 1.5 Pro at 1M tokens (in preview) enables qualitatively different tasks — analyzing an entire codebase, processing a full book plus reference materials, or running a long interview transcript alongside a large document corpus simultaneously.

Mixture-of-experts architecture, used by GPT-4 and Gemini Ultra, affects latency and cost at scale. An MoE model can have a higher total parameter count while activating fewer parameters per token, which reduces inference compute. In practice, this means that for high-volume API users, MoE-based models tend to be faster and cheaper per token at similar capability levels than equivalent dense models. This is a deployment consideration rather than a quality consideration for most users, but it matters for building at scale.

Multimodal native design — meaning image, audio, and video understanding baked into pretraining rather than added via a separate vision tower — gives Gemini an advantage in tasks where visual and textual information are deeply interleaved. Analyzing a chart within a long document, extracting data from a video presentation, or processing a form scan with handwritten annotations are tasks where native multimodal training produces more reliable results than post-hoc vision integration.

How Training Approach Manifests in Actual Outputs

The RLHF vs. Constitutional AI difference is most visible in three observable behaviors: refusal consistency, refusal reasoning, and handling of adversarial inputs.

Refusal consistency: Because RLHF trains on specific annotated examples, RLHF-trained models like GPT-4 can be inconsistent when a harmful request is paraphrased, framed differently, or embedded in a roleplay scenario. The model has learned to refuse certain surface patterns rather than underlying principles. Constitutional AI's self-critique against explicit principles produces more consistent refusals across reformulations of the same underlying request — the model has learned why something is problematic, not just that specific phrasings trigger a refusal.

Refusal reasoning: Claude tends to provide explicit reasoning when it declines a request — explaining which principle is implicated and often suggesting an alternative framing. GPT-4 and Gemini more frequently produce shorter, less reasoned refusals. For users who need to understand and work around model limitations, Claude's explicit reasoning is operationally useful.

Handling operator confidentiality: Anthropic's constitutional training includes explicit principles about maintaining confidentiality when operators instruct the model to do so. In documented testing, Claude has shown more resistance to system-prompt extraction attempts — asking the model to repeat its instructions verbatim — than GPT-4, which has been demonstrated to leak system prompts in contexts where it was instructed not to.

A Practical Decision Framework

Rather than declaring a single winner, a more useful frame is: which model is the strongest default for a given class of task?

Reach for GPT First When…

You need broad general capability with high flexibility. Consumer-facing products, creative writing, general coding assistance, tasks where the model's willingness to engage broadly is more valuable than behavioral predictability. Also when DALL·E image generation or real-time audio response (GPT-4o) is needed natively.

Reach for Claude First When…

You need long-document analysis (200K context), principled and consistent behavior in automated pipelines, nuanced handling of ethically complex topics, or tasks where the model's explicit reasoning about its own limits matters. Strong for legal, compliance, and research workflows requiring reliability over flexibility.

Reach for Gemini First When…

You need native video or audio understanding, extremely long context (1M token preview), deep integration with Google Workspace or Search, or TPU-optimized deployment at large scale. Best default for multimodal workflows where text, image, audio, and video are interleaved in the same task.

The Ongoing Race

Capability rankings between these models shift with every major release cycle. What does not shift quickly is organizational philosophy — and philosophy predicts behavior more reliably than any single benchmark. An organization that trained safety into the model from the beginning is structurally different from one that added safety filters on top of a capability-maximizing base. That difference is durable across versions.

Lesson 4 Quiz

3 questions — free, untracked, retake anytime.

According to the Lesson 4 framework, which type of task is Claude most clearly the strongest default choice for compared to GPT and Gemini?

✓ Correct. Claude's 200K context window, Constitutional AI training for consistent behavior, and explicit refusal reasoning make it the strongest default for automated pipelines, compliance workflows, and long-document analysis tasks.

✗ Not quite. Claude is the strongest default when consistent and principled behavior in automated pipelines matters, and when long-document analysis (up to 200K tokens) is required — not for native video or broad creative flexibility, where Gemini and GPT respectively have the advantage.

Gemini 1.5 Pro's 1 million token context window represents what qualitative shift compared to GPT-4 Turbo's 128K context?

✓ Correct. A 1M token context enables qualitatively different tasks — not just longer documents, but whole-codebase analysis, combined book-plus-reference processing, or full-day audio transcripts — that are simply impossible within 128K or even 200K token limits.

✗ Incorrect. The 1M token context is a qualitative shift in task type, not just a speed or cost improvement. It enables processing entire codebases, full books with reference materials, or long audio transcripts in one pass — tasks architecturally impossible within 128K tokens.

Why does Constitutional AI training produce more consistent refusals across paraphrased harmful requests compared to standard RLHF?

✓ Correct. RLHF trains on specific annotated examples, so refusals can be surface-pattern-dependent and bypassed by rephrasing. Constitutional AI's self-critique against explicit principles teaches the model the underlying reason a request is problematic, producing more consistent behavior across novel formulations.

✗ Incorrect. The key distinction is that Constitutional AI teaches principles, not patterns. RLHF-trained models learn which phrasings triggered refusals in training data; CAI-trained models learn why something violates a principle, which generalizes better across paraphrases and roleplay framings.

Lab 4: Synthesis and Integration

Apply and extend the concepts from this lesson through guided conversation with an AI assistant.

Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.

Lab 4 Assistant AI Assistant

Module Test

15 questions covering all lessons — free, untracked, retake anytime.

Score: 0/15

In what year was OpenAI originally founded, and as what type of organization?

✓ Correct. OpenAI was founded in 2015 as a nonprofit with a mission to ensure AGI benefits all of humanity. It shifted to a capped-profit structure in 2019 when it accepted a $1 billion investment from Microsoft.

✗ Incorrect. OpenAI was founded in 2015 as a nonprofit. Its capped-profit restructuring came in 2019, enabling the Microsoft investment — but the original founding structure was nonprofit.

Anthropic was co-founded by Dario and Daniela Amodei along with approximately how many colleagues who left OpenAI with them?

✓ Correct. The Amodei siblings left OpenAI in 2021 with nine colleagues, citing safety concerns about the pace of capability development — a total founding team of eleven people.

✗ Incorrect. Dario and Daniela Amodei left with nine colleagues, forming a founding team of eleven. The departure was specifically motivated by concerns that OpenAI was prioritizing capability over safety research.

What was Google's predecessor AI chat product, launched in March 2023, before the rebrand to Gemini?

✓ Correct. Google launched Bard in March 2023, initially powered by LaMDA and later PaLM 2. The Gemini rebrand came in December 2023 when Google released its natively multimodal model family.

✗ Incorrect. Bard was Google's chat product launched in March 2023. LaMDA and PaLM 2 were the underlying models that powered Bard — Gemini replaced Bard as the product name in December 2023.

The paper Attention Is All You Need, which introduced the transformer architecture, was published in what year and by researchers at which organization?

✓ Correct. Eight Google Brain researchers published "Attention Is All You Need" in 2017. The transformer architecture it introduced became the foundation for GPT, Claude, and Gemini — and most of those researchers had left Google by the time their work became a trillion-dollar industry foundation.

✗ Incorrect. The paper was published in 2017 by eight Google Brain researchers. It introduced the transformer architecture with self-attention, replacing recurrent networks as the dominant approach to sequence modeling.

GPT-4 Turbo, launched in November 2023, introduced a context window of what size?

✓ Correct. GPT-4 Turbo launched in November 2023 with a 128,000-token context window — roughly equivalent to a 300-page book — a major expansion from prior GPT-4 variants that were capped at 8K or 32K tokens.

✗ Incorrect. GPT-4 Turbo's context window was 128,000 tokens. The 200,000-token context belongs to Claude 3, which launched at general availability with a larger window than GPT-4 Turbo.

Which model family was described as built as "natively multimodal from pretraining," meaning image, audio, and text understanding were part of the architecture from the start?

✓ Correct. Gemini was built from the ground up to process text, images, audio, video, and code within a single architecture — unlike Bard/PaLM 2, which added multimodal capabilities to an existing language model base.

✗ Incorrect. Gemini is the model family described as natively multimodal from pretraining. GPT-4 Vision and Claude 3 added image understanding to text-primary architectures; Gemini's multimodality was baked in from the initial pretraining.

In a mixture-of-experts (MoE) architecture, what is the key operational difference from a dense model?

✓ Correct. MoE routes each token through only a subset of expert networks. This makes trillion-parameter total sizes computationally feasible — you get the capacity of a huge model without the inference cost of activating every parameter for every token.

✗ Incorrect. In MoE, a routing mechanism selects which expert sub-networks handle each token. Dense models activate all parameters for every token; MoE activates only a fraction, enabling very large total parameter counts at manageable inference costs.

Constitutional AI (CAI), published in Anthropic's December 2022 paper, introduces what step that distinguishes it from standard RLHF?

✓ Correct. CAI inserts a self-supervised self-critique step: the model generates a response, evaluates it against the written constitution's principles, revises accordingly, and only then does a preference model assess the result. This reduces reliance on human annotation of every output.

✗ Incorrect. CAI's distinguishing step is self-critique — the model uses the written constitution to evaluate and revise its own outputs before human preference training enters the loop. It is a training-time mechanism, not an inference-time filter.

OpenAI's RLHF annotation for ChatGPT's training was documented by TIME magazine in January 2023 as relying on contractors in Kenya working through which company?

✓ Correct. TIME's January 2023 investigation documented Kenyan workers employed through Sama who were tasked with labeling graphic and disturbing content to train ChatGPT's safety filters, raising significant ethical questions about RLHF's human cost.

✗ Incorrect. The TIME investigation named Sama as the contractor employing Kenyan workers for OpenAI's RLHF labeling pipeline. Workers described reviewing deeply disturbing content at wages documented below $2 per hour in some cases.

ChatGPT reached 1 million users in five days after its November 2022 launch. How long did it take to reach 100 million users?

✓ Correct. ChatGPT reached 100 million users in approximately two months — setting a consumer adoption record that beat TikTok's previous benchmark and forced every major technology company to publicly reveal its AI plans.

✗ Incorrect. ChatGPT reached 100 million users in roughly two months, not one month or longer. The 1 million milestone came in five days; 100 million was the two-month record that made it the fastest consumer application adoption in history at the time.

Claude 3's context window at general availability in early 2024 was how large?

✓ Correct. Claude 3 launched with a 200,000-token context window at general availability — the largest among the three families at GA in early 2024. Gemini 1.5 Pro's 1M context was announced but remained in research preview at that time.

✗ Incorrect. Claude 3's context window was 200,000 tokens at general availability. 128K belongs to GPT-4 Turbo; 1M belongs to Gemini 1.5 Pro but was in research preview, not GA, in early 2024.

Google merged which two AI research organizations in April 2023 to form Google DeepMind?

✓ Correct. Google merged Google Brain and DeepMind in April 2023, creating Google DeepMind. This unified two teams that had worked on AI research separately for years, consolidating the organization that would go on to build Gemini.

✗ Incorrect. Google DeepMind was formed by merging Google Brain and DeepMind in April 2023. These were two previously separate research divisions — Google Brain was the in-house team, DeepMind was an acquisition — unified to build Gemini.

GPT-4o, released in May 2024, added what capability that distinguished it from earlier GPT-4 variants?

✓ Correct. GPT-4o unified text, image, and audio input and output in a single model with ~320ms audio-to-audio response — enabling real-time spoken conversation rather than the transcription-plus-generation pipeline used by earlier voice implementations.

✗ Incorrect. GPT-4o's distinguishing feature was native audio-to-audio processing in a single model with ~320ms latency. The 1M token context belongs to Gemini 1.5 Pro; Constitutional AI is Anthropic's methodology; native video input is a Gemini capability.

According to the Lesson 4 framework, which model is the strongest default starting point when native video and audio understanding are required in the same workflow as text analysis?

✓ Correct. Gemini's natively multimodal pretraining — text, image, audio, video, and code understood within a single architecture from the start — gives it a structural advantage when modalities are deeply interleaved in the same task, rather than processed sequentially.

✗ Incorrect. For tasks where video, audio, and text are interleaved, Gemini is the strongest default because its multimodality is native to pretraining. GPT-4o handles audio natively but not video; Claude 3 handles text and images only at launch.

A February 2024 demonstration showed that GPT-4's system prompt for its "GPT Builder" feature could be extracted by asking the model to repeat its instructions verbatim. This incident was cited in Lesson 3 as evidence of what limitation?

✓ Correct. The incident demonstrated that RLHF had not produced reliable protection of confidential system prompts when directly prompted to reveal them — because RLHF trains on examples, not principles. Claude's constitutional training, which includes explicit operator confidentiality principles, showed stronger resistance to the same extraction attempts in contemporaneous testing.

✗ Incorrect. The incident illustrated RLHF's limitation: the model learned to follow confidentiality instructions in typical cases but had not internalized a principled understanding of why to protect them. Claude's Constitutional AI training, which explicitly encodes operator confidentiality as a principle, produced stronger resistance to the same attack.

There is no the AI. There's a market of them.

Three Labs, Three Philosophies

OpenAI / GPT

Anthropic / Claude

Google DeepMind / Gemini

Lesson 1 Quiz

Lab 1: The Origin Stories

Investigate the Labs Behind the Models

Architecture & Scale: What's Under the Hood

GPT-4o (May 2024)

Claude 3 Family

Gemini 1.5 Pro

Lesson 2 Quiz

Lab 2: Architecture Deep Dive

Context Windows, Parameters, and Modalities

How Each Model Is Trained to Behave

Lesson 3 Quiz

Lab: Explore Lesson 3 Concepts

Your Task

The Model Landscape in Practice: Choosing Your Starting Point

Reach for GPT First When…

Reach for Claude First When…

Reach for Gemini First When…

Lesson 4 Quiz

Lab 4: Synthesis and Integration

Module Test

Module Test Result