Twenty years ago, using the internet meant one thing. Ten years ago, using a smartphone meant one thing. In both cases, the specific hardware and operating system mattered less than the fact of participating at all.
AI is going the other direction. Using AI in 2026 specifically means choosing an AI — Claude or ChatGPT or Gemini or Llama or Mistral or one of a dozen others — and the choice matters. They are not interchangeable. They have different strengths on math, different biases in writing, different refusal patterns, different context windows, different prices, different latencies, different policies about what they'll do with your data.
This course is the comparative literacy course. It teaches you to evaluate AI models the way a chef evaluates knives — not by brand loyalty, but by fit to the job. It covers the major frontier models, how they're actually built differently, what each one is best and worst at, how to run your own benchmarks, and how to design a workflow that uses the right tool for the right task rather than the most advertised tool for everything.
On November 30, 2022, OpenAI quietly posted a research preview to its website. Within five days it had one million users. Within two months, one hundred million. ChatGPT had become the fastest-adopted consumer application in history — and it forced every major technology company to reveal what they had been building in private.
Anthropic had been operating since 2021, founded by former OpenAI researchers Dario and Daniela Amodei along with ten colleagues. Google had been researching large language models since at least 2017 — its own researchers had written the Attention Is All You Need paper that made the whole field possible. Three very different organizations now occupied the same public stage.
Understanding these three systems requires understanding the institutional pressures and stated values of the organizations behind them. They are not interchangeable products competing purely on benchmark scores.
OpenAI was founded in 2015 as a nonprofit with a mission to ensure artificial general intelligence benefits all of humanity. Its 2019 shift to a "capped-profit" structure and a $1 billion investment from Microsoft changed the competitive dynamics significantly. GPT-3 launched in 2020 via API; GPT-4 arrived in March 2023. The GPT family is positioned as a general-purpose capability platform — maximize what the model can do, then apply safety filters and policies on top.
Anthropic was founded explicitly around the concern that OpenAI was moving too fast. The founders brought a research agenda called Constitutional AI, which trains the model to evaluate and revise its own outputs against a written set of principles before responding. Claude 1 launched in March 2023; Claude 2 in July 2023; Claude 3 (Haiku, Sonnet, Opus) in March 2024. Safety is not a layer added to the model — it is described as intrinsic to the training process itself.
Google DeepMind merged Google Brain and DeepMind in April 2023, unifying research teams that had worked separately for years. Bard launched in March 2023, initially powered by LaMDA and later PaLM 2. Gemini — the model family built from the ground up as multimodal — launched in December 2023. Google's core advantage is infrastructure: its TPU hardware, its search index, and its suite of over two billion users already inside Google Workspace.
Each organization has a public-facing value statement that shapes real product decisions. These are not marketing copy — they appear consistently in research papers, deployment choices, and what the models actually refuse to do.
AGI for the benefit of all humanity. Commercial deployment funds safety research. Capability-first with policy guardrails applied through RLHF and system prompts.
Safety and helpfulness as inseparable goals. Constitutional AI trains the model on explicit principles. Anthropic describes itself as occupying a "peculiar position" — believing it may be building dangerous technology, pressing forward anyway to ensure the result is beneficial.
Multimodal from inception. Native integration with Google Search, Workspace, and Android. Aimed at enterprise deployment and consumer scale simultaneously, with TPU-optimized inference.
The organizational origin of each model shapes its defaults, its refusals, its strengths, and its blind spots. A researcher choosing between these systems isn't just choosing a benchmark score — they are choosing which organization's values will be embedded in their workflow. Understanding that context is the foundation of intelligent model selection.
Each of the three major AI systems — GPT, Claude, and Gemini — reflects the values and priorities of its parent organization. In this lab, ask the AI assistant about the founding histories, stated missions, and structural differences between OpenAI, Anthropic, and Google DeepMind.
Try to understand how organizational structure (nonprofit origins, capped-profit, big tech subsidiary) shapes product philosophy. Push for specifics: funding amounts, named founders, documented policy decisions.
Eight Google researchers published a paper with an unusually confident title: Attention Is All You Need. The transformer architecture they described replaced the recurrent networks that had dominated natural language processing for years. Within five years, every major language model — GPT, Claude, Gemini — would be built on this foundation. The researchers who wrote it had largely left Google by the time their architecture became the basis of a trillion-dollar industry.
All three model families — GPT, Claude, and Gemini — are transformer-based large language models. At a high level, they predict the next token in a sequence by attending to all previous tokens simultaneously, weighting each by relevance. This mechanism, called self-attention, is what allows these models to handle long-range dependencies in text that recurrent networks could not.
What differentiates the models is not the fundamental architecture but the decisions made on top of it: scale (how many parameters), training data (what text and media the model saw), training objectives (how the model learned from human feedback), and context window (how much text the model can process in a single pass).
GPT-4's parameter count has not been officially confirmed by OpenAI, though reporting suggests a mixture-of-experts architecture. Claude 3 Opus similarly has undisclosed parameters. Google confirmed Gemini Ultra at over 1 trillion parameters in a sparse mixture-of-experts configuration. What matters operationally is not the raw number but how these choices manifest in reasoning, latency, and cost.
The context window — how much text a model can hold in active memory during a single conversation — is one of the most consequential practical differences between models as of 2024.
GPT-4 Turbo launched in November 2023 with a 128,000-token context window, roughly equivalent to a 300-page book. Prior GPT-4 variants were limited to 8K or 32K tokens, which created real workflow constraints for document analysis tasks.
Claude 3 launched with a 200,000-token context window across all variants — the largest at general availability among the three families in early 2024. Anthropic demonstrated this by having Claude 3 process the entire text of the original Needle in a Haystack benchmark, a 200K-token document corpus.
Gemini 1.5 Pro, announced in February 2024, demonstrated a 1 million token context window in research preview — enough to process approximately 11 hours of video or 700,000 words of text in a single request. This represents a qualitative shift in what long-context retrieval can mean in practice.
All three model families support text and image input as of 2024. The differences lie in depth of integration and additional modalities.
Text, image, audio input and output in a single model. Audio-to-audio response in ~320ms. Vision fine-tuned on wide consumer image distribution. DALL·E 3 integration for image generation.
Text and image input; text output only. Strong document and chart understanding. Particularly noted for precise OCR on dense tables and financial documents. No audio or video input at launch.
Text, image, audio, video, and code input natively. Can process up to ~11 hours of audio or 1 hour of video in context. Google's Imagen 2 integration for generation. Built on TPU v5e infrastructure.
Both GPT-4 and Gemini Ultra reportedly use a mixture-of-experts (MoE) architecture — the model is actually many smaller specialized networks, and only a subset activates for any given token. This allows extremely large total parameter counts without proportional inference costs. Dense models like earlier GPT and Claude variants activate all parameters for every token.
In this lab, explore the practical implications of the architectural choices these models make. Context window size, multimodal capabilities, and mixture-of-experts design each have real consequences for specific use cases. Ask the assistant to help you reason through which architecture is best suited for particular tasks.
The goal is to move from abstract specs to concrete decision criteria — when does a 200K context window matter? When is native video understanding valuable vs. overkill?
When Claude launched in March 2023, early testers immediately noticed something different: it would engage more deeply with morally complex hypotheticals, provide more nuanced refusals with explicit reasoning, and was notably more resistant to jailbreaks that relied on roleplay framing. This wasn't luck — it was the product of a fundamentally different training methodology that Anthropic had published in a 2022 paper titled Constitutional AI: Harmlessness from AI Feedback.
Reinforcement Learning from Human Feedback (RLHF) became the dominant post-training alignment method after OpenAI's InstructGPT paper in January 2022. The process works in three stages: first, supervised fine-tuning on high-quality demonstration data; second, training a reward model on human preference rankings between model outputs; third, optimizing the language model using the reward model's scores via proximal policy optimization (PPO).
GPT-4 uses RLHF as its primary alignment method, supplemented by rule-based reward models (RBRMs) — hard constraints that penalize specific categories of output regardless of human preference ratings. This approach is powerful but has a known limitation: it installs the values of the annotator pool, which tends to be geographically and demographically concentrated. OpenAI's annotators for ChatGPT's RLHF training were documented primarily as contractors in Kenya through Sama, a fact reported by TIME magazine in January 2023 after workers described psychologically disturbing content.
Constitutional AI (CAI), introduced in Anthropic's December 2022 paper, adds a step before human feedback enters the loop. A set of written principles — the "constitution" — is used to have the model critique and revise its own outputs. The model first generates a response, then is asked to identify how that response violates specific principles (e.g., "Choose the response that is less harmful"), revise accordingly, and only then does a preference model evaluate the result.
Importantly, this means alignment is partially self-supervised: the model trains against its own constitutionally-guided critiques rather than requiring a human to evaluate every output. Anthropic published their constitution — it draws on sources including the UN Declaration of Human Rights, Apple's terms of service, and DeepMind's Sparrow rules. This transparency is a deliberate differentiator.
The practical consequence: Claude tends to provide more explicit reasoning when declining requests, and tends to be more consistent across paraphrased versions of the same harmful request, because the constitution is applied systematically rather than relying purely on annotator judgment on specific training examples.
Gemini models use RLHF and what Google describes in its technical report as a combination of supervised fine-tuning and reinforcement learning from human feedback with a process reward model for multi-step reasoning tasks. Google also operates a separate content safety layer via its SynthID watermarking and SafetySettings API parameters, which operators adjust independently of the base model's trained values.
Google's scale creates a distinctive challenge: Gemini serves billions of users across Search, Workspace, and Android simultaneously. The same base model must be appropriate for a student in Indonesia and a radiologist in Germany. This drives a more granular operator-level safety configuration compared to Anthropic's more locked-down defaults.
In February 2024, a GitHub user demonstrated that GPT-4's system prompt for the "GPT Builder" feature could be extracted by asking the model to repeat its instructions verbatim. OpenAI's RLHF training had not produced a model that reliably protected confidential system prompts when directly instructed to. Claude's constitutional training produced stronger resistance to similar extraction attempts in contemporaneous testing, attributed to the principle explicitly addressing operator confidentiality.
Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.
You have now seen how OpenAI, Anthropic, and Google each built their models under different institutional pressures, using different architectural strategies and different alignment methods. The question that follows is concrete: when you sit down to do actual work, which model should you reach for first?
The answer is not a single winner. Each lab's philosophy and training choices manifest as genuine strengths and genuine weaknesses. Understanding those patterns is what allows you to choose intelligently rather than by habit or marketing.
OpenAI's capability-first philosophy — build the most capable model, then layer policy controls — produces a model that is highly capable at broad, general-purpose tasks and that tends to engage more liberally with edge-case requests. GPT-4's willingness to take creative risks, write persuasive content on multiple sides of a debate, and assist with sensitive-but-legal topics reflects a philosophy where capability is the primary goal and safety is enforced through a separate policy layer. This makes GPT models strong starting points for consumer-facing products, creative writing, and general-purpose assistants where flexibility matters more than predictability.
Anthropic's safety-as-intrinsic philosophy produces a model with more consistent, principled behavior across the full range of possible requests. Claude is often the better starting point for tasks that require nuanced reasoning about ethics, law, or risk; for scenarios where the model may be embedded in an automated pipeline where human review is limited; and for enterprise applications where reliability of behavior matters more than raw capability ceiling. The trade-off is that Claude's trained caution occasionally applies where it is not needed.
Google's philosophy of infrastructure integration and scale produces a model that is strongest at tasks that benefit from breadth of real-world information and multimodal native understanding. Gemini's design for serving billions of users across radically different contexts means it has been tuned for broad applicability over edge-case depth. Its native video and audio understanding makes it the strongest default for media analysis tasks that would require pre-processing with the other two models.
Context window size is the single most operationally significant architectural difference for knowledge work in 2024. The practical hierarchy is: GPT-4 Turbo at 128K tokens is sufficient for most documents up to about 300 pages; Claude 3 at 200K handles longer legal, technical, or research documents in a single pass; Gemini 1.5 Pro at 1M tokens (in preview) enables qualitatively different tasks — analyzing an entire codebase, processing a full book plus reference materials, or running a long interview transcript alongside a large document corpus simultaneously.
Mixture-of-experts architecture, used by GPT-4 and Gemini Ultra, affects latency and cost at scale. An MoE model can have a higher total parameter count while activating fewer parameters per token, which reduces inference compute. In practice, this means that for high-volume API users, MoE-based models tend to be faster and cheaper per token at similar capability levels than equivalent dense models. This is a deployment consideration rather than a quality consideration for most users, but it matters for building at scale.
Multimodal native design — meaning image, audio, and video understanding baked into pretraining rather than added via a separate vision tower — gives Gemini an advantage in tasks where visual and textual information are deeply interleaved. Analyzing a chart within a long document, extracting data from a video presentation, or processing a form scan with handwritten annotations are tasks where native multimodal training produces more reliable results than post-hoc vision integration.
The RLHF vs. Constitutional AI difference is most visible in three observable behaviors: refusal consistency, refusal reasoning, and handling of adversarial inputs.
Refusal consistency: Because RLHF trains on specific annotated examples, RLHF-trained models like GPT-4 can be inconsistent when a harmful request is paraphrased, framed differently, or embedded in a roleplay scenario. The model has learned to refuse certain surface patterns rather than underlying principles. Constitutional AI's self-critique against explicit principles produces more consistent refusals across reformulations of the same underlying request — the model has learned why something is problematic, not just that specific phrasings trigger a refusal.
Refusal reasoning: Claude tends to provide explicit reasoning when it declines a request — explaining which principle is implicated and often suggesting an alternative framing. GPT-4 and Gemini more frequently produce shorter, less reasoned refusals. For users who need to understand and work around model limitations, Claude's explicit reasoning is operationally useful.
Handling operator confidentiality: Anthropic's constitutional training includes explicit principles about maintaining confidentiality when operators instruct the model to do so. In documented testing, Claude has shown more resistance to system-prompt extraction attempts — asking the model to repeat its instructions verbatim — than GPT-4, which has been demonstrated to leak system prompts in contexts where it was instructed not to.
Rather than declaring a single winner, a more useful frame is: which model is the strongest default for a given class of task?
You need broad general capability with high flexibility. Consumer-facing products, creative writing, general coding assistance, tasks where the model's willingness to engage broadly is more valuable than behavioral predictability. Also when DALL·E image generation or real-time audio response (GPT-4o) is needed natively.
You need long-document analysis (200K context), principled and consistent behavior in automated pipelines, nuanced handling of ethically complex topics, or tasks where the model's explicit reasoning about its own limits matters. Strong for legal, compliance, and research workflows requiring reliability over flexibility.
You need native video or audio understanding, extremely long context (1M token preview), deep integration with Google Workspace or Search, or TPU-optimized deployment at large scale. Best default for multimodal workflows where text, image, audio, and video are interleaved in the same task.
Capability rankings between these models shift with every major release cycle. What does not shift quickly is organizational philosophy — and philosophy predicts behavior more reliably than any single benchmark. An organization that trained safety into the model from the beginning is structurally different from one that added safety filters on top of a capability-maximizing base. That difference is durable across versions.
Apply and extend the concepts from this lesson through guided conversation with an AI assistant.
Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.
15 questions covering all lessons — free, untracked, retake anytime.