GPT vs. Claude vs. Gemini · Module 2 · Lesson 1

Transformers, Tokens, and the Attention Revolution

Before GPT-4o could exist, someone had to invent the engine it runs on.

Eight researchers at Google Brain published a 31-page paper titled "Attention Is All You Need." They discarded the recurrent networks that had dominated NLP for a decade and replaced them with a single idea: instead of reading text left-to-right like a human, let the model attend to every word simultaneously and weigh how much each word matters to every other word. The Transformer architecture was born.

OpenAI read that paper and, within eighteen months, built GPT-1 on top of it. By the time GPT-4o launched in May 2024, the same core idea — scaled by orders of magnitude — was processing images, audio, and text inside a single unified model. The transformer did not just survive; it became the universal substrate of modern AI.

What a Transformer Actually Does

A transformer takes a sequence of tokens — chunks of text, pixels, or audio frames — and applies a mathematical operation called self-attention. Self-attention asks, for every token in the sequence: "How relevant is every other token to understanding this one?" The answer is a set of floating-point weights, computed in parallel across the entire sequence.

This parallelism is what made transformers faster to train than recurrent networks (RNNs), which had to process tokens one at a time. GPT-4 was trained on roughly 13 trillion tokens across thousands of GPUs simultaneously — something impossible with sequential architectures.

The result of self-attention feeds into a feed-forward network that applies learned transformations, and this cycle (attention → feed-forward) is stacked into "layers." GPT-4 is estimated to have around 120 such layers. Each layer refines the model's internal representation of what the text means.

Tokenizer

Splits raw text into sub-word tokens (~100K vocabulary)

Embeddings

Maps each token to a high-dimensional vector (position-aware)

Attention Heads

Parallel circuits that weigh token-to-token relevance

Feed-Forward Layers

Dense networks that transform attended representations

Residual Streams

Skip connections that preserve gradient flow across 100+ layers

Softmax Output

Converts final vector to probability distribution over next token

Tokens: The Currency of GPT

GPT-4o does not read words — it reads tokens. OpenAI uses a byte-pair encoding (BPE) scheme called cl100k_base, which has roughly 100,277 tokens. Common words like "the" are a single token; rarer words get split. "Tokenization" becomes ["Token", "ization"]. Emojis and non-Latin scripts often cost more tokens per character.

This matters practically: GPT-4o's context window of 128,000 tokens (as of its May 2024 release) translates to roughly 96,000 words of English text — about the length of a full novel. Every inference call processes all tokens in the window simultaneously through the attention mechanism, which is why longer contexts cost more compute (attention scales as O(n²) with sequence length).

REAL BENCHMARK

On OpenAI's internal token-efficiency tests published in the GPT-4 Technical Report (March 2023), GPT-4 outperformed GPT-3.5 on the bar exam by 37 percentile points — scoring in the top 10% of human test-takers. The architecture was identical in type; what changed was scale and training data.

From GPT-1 to GPT-4o: The Scaling Story

GPT-1 (2018): 117 million parameters, trained on BookCorpus (~4.5 GB). Proved the pre-train/fine-tune paradigm could work for NLP tasks.

GPT-2 (2019): 1.5 billion parameters. OpenAI famously staged its release over fears of misuse — the first time an AI lab treated a language model as potentially dangerous to release in full.

GPT-3 (2020): 175 billion parameters. Demonstrated few-shot learning: the model could perform tasks it had never been explicitly trained on, just from examples in the prompt.

GPT-4 (2023): Parameters undisclosed, but widely estimated at ~1.8 trillion in a mixture-of-experts (MoE) configuration. Added multimodal vision input. Scored 90th percentile on the bar exam.

GPT-4o (May 2024): "omni" — handles text, image, and audio natively in a single model rather than routing through separate systems. Response latency dropped to ~232ms average, approaching human conversational speed.

KEY INSIGHT

The transformer architecture has not fundamentally changed since 2017. What has changed is scale (parameters, data, compute) and the quality of alignment training applied afterward. GPT-4o's "magic" is less architectural novelty and more the result of extraordinary scaling discipline.

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.

What was the core innovation of the 2017 paper "Attention Is All You Need"?

✓ Correct. The Transformer replaced sequential RNNs with parallel self-attention, making large-scale training feasible.

✗ The paper introduced self-attention as a replacement for recurrent networks, enabling full parallel processing of token sequences.

Approximately how many tokens does GPT-4o's context window hold as of its May 2024 release?

✓ Correct. GPT-4o launched with a 128K token context window, equivalent to roughly 96,000 English words.

✗ GPT-4o's context window is 128,000 tokens — roughly the length of a full novel in English.

What does "GPT-4o" stand for, and what capability did it add over GPT-4?

✓ Correct. The "o" stands for omni, reflecting GPT-4o's native multimodal processing of text, images, and audio in one model.

✗ GPT-4o means "omni" — it unified text, image, and audio processing in a single model rather than routing through separate systems.

Lab 1 — Tokens & Transformers

Explore how GPT tokenizes text and why it matters for prompting.

Investigate GPT-4o's Tokenization and Attention

In this lab you'll probe how GPT-4o tokenizes unusual inputs, what happens at context window edges, and how the transformer's attention mechanism shapes what the model "notices" in a prompt. Ask questions about real tokenization behavior, count tokens in sample text, or explore what self-attention means in plain language.

Try asking: "How would GPT-4o tokenize the word 'tokenization'? How many tokens is it? What about an emoji like 🔥?"

AI Lab Assistant GPT-4o Architecture

GPT vs. Claude vs. Gemini · Module 2 · Lesson 2

Pre-training, RLHF, and the Making of ChatGPT

A transformer trained on raw internet text is powerful but wild. Alignment is what makes it useful.

Sam Altman's team had been quietly testing a chatbot interface for six days before deciding to release it publicly. They expected perhaps a million users in the first year. Within five days of launch, ChatGPT had one million users. Within two months, 100 million. It became the fastest consumer product to 100 million users in history, overtaking TikTok's record of nine months.

What made ChatGPT feel different from the GPT-3 API that had been available for two years? The answer was a training technique called Reinforcement Learning from Human Feedback — RLHF. The same model, made conversational and helpful, sparked a technological inflection point that no one, including OpenAI, fully anticipated.

Phase 1: Pre-training on the Internet

GPT-4's pre-training involved a dataset orders of magnitude larger than GPT-3's. OpenAI has not disclosed the exact corpus, but the GPT-4 Technical Report confirms it included data from the internet through early 2023, filtered using proprietary classifiers. The training objective is simple: predict the next token. Given "The capital of France is," the model should assign high probability to "Paris."

This self-supervised objective, applied to trillions of tokens, causes the model to implicitly learn grammar, facts, reasoning patterns, coding conventions, and much more — because predicting the next word accurately requires understanding almost everything about language and knowledge. The model never receives a label or correct-answer signal; it creates its own training signal from the structure of text.

Phase 2: Supervised Fine-Tuning (SFT)

After pre-training, OpenAI's contractors (via Scale AI) wrote thousands of example conversations demonstrating ideal assistant behavior — helpful, harmless, honest responses. The model was then fine-tuned on these examples using standard supervised learning. This shifted the model from "next-token predictor" to "assistant trying to be helpful."

This step is why ChatGPT responds in a conversational tone and acknowledges its limitations. The base GPT-4 model, before SFT, would just continue text — ask it "What is 2+2?" and it might continue with "... a question that has stumped philosophers."

Pre-train (predict next token, trillions of tokens)

→

SFT (human-written ideal responses)

→

Reward Model (rank outputs)

→

PPO (optimize against reward)

Phase 3: RLHF — Reinforcement Learning from Human Feedback

RLHF, originally described by OpenAI researchers in a 2017 paper and refined in the 2022 "Training language models to follow instructions with human feedback" paper (Ouyang et al.), involves two sub-steps:

Reward Model Training: Human raters compared pairs of model outputs and indicated which was better. These preferences trained a separate neural network — the reward model — to predict human preference scores for any given output.

PPO Optimization: The language model was then fine-tuned using Proximal Policy Optimization (PPO), a reinforcement learning algorithm, to maximize the reward model's score. Crucially, a KL-divergence penalty kept the model from drifting too far from the SFT baseline (preventing "reward hacking" where it learns to game the reward model with nonsense).

InstructGPT — the precursor to ChatGPT — showed in OpenAI's 2022 paper that a 1.3 billion parameter model trained with RLHF was preferred by human raters over a 175 billion parameter raw GPT-3 model. Alignment, not just scale, determined perceived quality.

DOCUMENTED RESULT

OpenAI's InstructGPT paper (Ouyang et al., 2022) found that labelers preferred InstructGPT outputs over GPT-3 outputs 85% of the time, even though InstructGPT was 100× smaller by parameter count. This was the empirical proof that RLHF could substitute for scale in producing useful outputs.

GPT-4o's Training: Adding the "o"

For GPT-4o, OpenAI extended this pipeline to audio and vision. Instead of separate encoders that translate images or audio into text tokens first, GPT-4o was trained end-to-end on interleaved text, image, and audio data. This means its attention layers directly attend to audio spectrograms and image patches alongside text tokens — no translation step, no latency from pipeline handoffs.

The result: GPT-4o's average audio response latency at launch was 232 milliseconds, compared to 2.8 seconds for the previous voice pipeline that routed through Whisper (speech-to-text) → GPT-4 → TTS. That 12× latency reduction came from architectural unification, not hardware.

PRACTICAL TAKEAWAY

When GPT-4o refuses a request or adds caveats, that behavior was shaped during RLHF — human raters scored cautious responses higher in certain categories. Understanding this helps you prompt around it: framing tasks as professional, legitimate, and specific tends to shift the reward-model-trained behavior toward compliance.

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.

What is the training objective during GPT's pre-training phase?

✓ Correct. Pre-training uses next-token prediction as a self-supervised objective across trillions of tokens.

✗ Pre-training is self-supervised next-token prediction — the model learns from the structure of text itself, without external labels.

According to OpenAI's InstructGPT paper (2022), human raters preferred InstructGPT over raw GPT-3 what percentage of the time?

✓ Correct. 85% preference for InstructGPT — a 1.3B parameter model — over GPT-3 at 175B parameters proved alignment outweighs raw scale.

✗ Labelers preferred InstructGPT 85% of the time, even though it was 100× smaller than GPT-3 by parameter count.

What does the KL-divergence penalty do during RLHF's PPO optimization step?

✓ Correct. The KL penalty keeps the RL-optimized model close to the SFT baseline, preventing reward hacking.

✗ The KL-divergence penalty constrains how far the model can drift from the SFT baseline — it's the guard against reward hacking.

Lab 2 — RLHF in Action

See how alignment training shaped GPT-4o's behavior — and how to prompt around it.

Probe GPT-4o's RLHF-Shaped Responses

In this lab you'll explore how RLHF influences GPT-4o's refusals, caveats, and tone. Try different framings of the same request to see how professional or specific context shifts the model's output. Discuss why the reward model was trained to add hedges, and how to legitimately reduce unnecessary hedging.

Try asking: "Why does GPT-4o add disclaimers to medical questions? How would framing a request as 'for a healthcare professional' change its response, and why does RLHF cause that shift?"

AI Lab Assistant RLHF & Alignment

GPT vs. Claude vs. Gemini · Module 2 · Lesson 3

Multimodal Architecture: How GPT-4o Sees and Hears

Merging text, vision, and audio into one model required rethinking what a "token" even means.

OpenAI CTO Mira Murati and researchers demonstrated GPT-4o live on a laptop — no special hardware — solving a linear equation written by hand on paper, coaching a nervous speaker in real time by reacting to vocal tone, and switching languages mid-conversation. The model reacted with vocal affect: laughing, sighing, expressing mock impatience when interrupted. None of this used separate models. It was one model, one forward pass.

The technical press noted that GPT-4o's audio understanding was qualitatively different from the previous pipeline: it could now hear that a speaker was anxious, not just transcribe their words. That emotional register was information encoded in the audio spectrogram — and the model had learned to attend to it.

The Old Pipeline vs. the New Architecture

Before GPT-4o, OpenAI's voice mode worked as a three-stage pipeline:

Whisper (ASR model) transcribed audio to text → GPT-4 processed the text → a TTS model synthesized speech. Each handoff introduced latency and lost information — tone, pacing, emphasis, emotional register were all discarded at the Whisper transcription step.

GPT-4o eliminates those handoffs. It processes audio spectrograms — visual representations of sound frequency over time — as patches, similar to how it processes image regions. These audio patches are tokenized and fed directly into the same attention layers that process text tokens. The model attends to audio and text simultaneously, in the same forward pass.

Image Encoder

Image divided into 14×14 px patches; each patch embedded as a token

Audio Encoder

Mel spectrogram chunks embedded as tokens; preserves prosody & tone

Text Tokenizer

cl100k_base BPE; ~100K vocab; text and modal tokens share sequence

Unified Attention

All modalities attend to each other in same transformer layers

Shared Decoder

One decoder generates text, audio, or both depending on task

Output Head

Modality-specific projection layers for text vs. audio output

Vision: How GPT-4o Reads Images

GPT-4o's vision processing builds on CLIP-style image encoding. An input image is divided into fixed-size patches — typically 14×14 pixel regions — each of which is projected into the same embedding space as text tokens. A 512×512 image becomes roughly 1,365 image tokens.

This is why GPT-4o's context window cost scales with image size: a high-resolution image can consume hundreds or thousands of tokens, competing with text for attention capacity. OpenAI's system resizes images based on detail level requested — "low" detail mode uses 85 tokens; "high" detail mode tiles the image and can use up to 1,700+ tokens per image.

The model's vision capability was evaluated on MMMU (Massive Multidisciplinary Multimodal Understanding), where GPT-4o scored 69.1% at release — above human experts in several subcategories including medical imaging and scientific diagrams.

TECHNICAL DETAIL — DOCUMENTED

GPT-4o's "high detail" image mode tiles a resized copy of the image into 512×512 blocks and processes each block separately, then combines representations. This is why asking GPT-4o to read small text in a high-resolution screenshot is possible in high detail mode but fails in low detail mode — the tiling creates enough token resolution to resolve individual characters.

Practical Limits of Multimodal GPT-4o

Despite its capabilities, GPT-4o has documented multimodal limitations. It cannot generate images (DALL-E is a separate model). Its audio output, while expressive, cannot reproduce specific voices reliably. It struggles with spatial reasoning in images — knowing that object A is "to the left of" object B in complex scenes with many objects.

OpenAI's System Card for GPT-4o (May 2024) also documents that the model was restricted from certain audio behaviors at launch — including generating audio that closely imitates real individuals' voices — pending further safety evaluation. Some emotional vocal expressiveness demonstrated in the demo was disabled in the initial public API release.

PROMPT IMPLICATION

When submitting images to GPT-4o, specify "use high detail mode" for tasks requiring fine text or spatial precision. For broad scene understanding, low detail is faster and cheaper. The model doesn't automatically choose — it defaults to a heuristic based on image size unless instructed.

Lesson 3 Quiz

3 questions — free, untracked, retake anytime.

What critical information was lost when the old GPT-4 voice pipeline routed audio through Whisper?

✓ Correct. Transcription to text stripped paralinguistic information — GPT-4o's end-to-end processing retains it.

✗ Whisper transcribes words, but discards tone, pacing, emotional affect — information encoded in the audio signal that GPT-4o now retains.

How does GPT-4o process image input at the architectural level?

✓ Correct. Image patches are embedded as tokens and attend alongside text in the unified transformer — no captioning intermediary.

✗ Images are split into patches, embedded as tokens, and fed directly into the shared attention layers — no separate caption model is used.

GPT-4o's average audio response latency at launch was approximately 232ms. What was the main reason for this improvement over the previous 2.8-second pipeline?

✓ Correct. Eliminating Whisper → GPT-4 → TTS handoffs by processing everything in one model is what cut latency 12×.

✗ The latency gain came from architectural unification — one model, one forward pass, no handoffs between Whisper, GPT-4, and TTS.

🎯 Advanced · Lesson 3 Lab

Lab: Explore Lesson 3 Concepts

Apply what you learned in Lesson 3 through guided AI conversation

Your Task

Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.

Try asking about a specific concept from Lesson 3 and how it applies in practice.

🤖 AESOP Lab Assistant Lesson 3 Lab

GPT vs. Claude vs. Gemini · How GPT-4o Works · Lesson 4

GPT-4o in the Wild: From Architecture to Application

The transformer, RLHF pipeline, and native multimodality aren't just theory — they shape every interaction you have with GPT-4o.

By the time GPT-4o launched in May 2024, the technology press was focused on its emotional voice and real-time camera demos. But the more significant story was architectural: for the first time, the same weights that processed your typed question also heard the tone in your voice and read the expression in your photograph. What the model "knew" about one modality could inform its reasoning in another.

Understanding why GPT-4o behaves the way it does — why it hedges, how it handles ambiguous images, why long conversations sometimes drift — requires connecting the transformer mechanics from Lesson 1, the RLHF pipeline from Lesson 2, and the multimodal architecture from Lesson 3. This lesson makes those connections explicit.

How the Transformer Foundation Enables Real Use Cases

The self-attention mechanism is not merely an internal detail — it has direct practical consequences. Because attention is computed across all tokens simultaneously, GPT-4o can resolve pronouns, track argument threads, and maintain coherent reasoning across a 128,000-token context. A model built on recurrent networks would degrade over long sequences; the transformer does not.

This enables use cases that would have been impossible five years ago: reading an entire legal contract and answering questions about specific clauses; ingesting a codebase and explaining a function's interaction with the rest of the system; maintaining a coherent medical case history across a multi-hour diagnostic conversation. The 128K context is not a marketing number — it is the attention mechanism scaled to its current practical limit.

Tokenization also has practical implications. GPT-4o's ~100K vocabulary means it handles code, scientific notation, and multiple languages more efficiently than models with smaller vocabularies. But it also means unusual inputs — strings of random characters, highly technical jargon, or rare non-Latin scripts — may tokenize into many small pieces, consuming context faster than expected and sometimes degrading coherence.

Long-context reasoning

Attention over 128K tokens enables full-document analysis, codebase review, multi-session coherence

Cross-modal inference

Audio and image tokens share attention with text — GPT-4o connects tone of voice to word meaning

RLHF-shaped tone

Refusals, caveats, and professional framing all trace to reward model preferences during training

Native audio latency

232ms response vs 2.8s pipeline — architectural unification, not hardware, explains the gap

How RLHF Shapes Responses You Actually See

RLHF is the invisible hand behind GPT-4o's communication style. Human raters during training consistently scored responses higher when they acknowledged uncertainty, provided caveats on sensitive topics, and structured complex answers with headers or bullet points. Those preferences became baked into the reward model — and now emerge automatically in GPT-4o's outputs.

This has practical implications in both directions. GPT-4o's tendency to add disclaimers ("I'm not a doctor, but...") is not a capability limitation — it is a trained preference. Framing requests with professional context ("as a licensed pharmacist, I need...") shifts the reward-model-influenced distribution toward directness, because such framing was associated with appropriate expert queries during annotation.

RLHF also explains GPT-4o's sycophancy risk: human raters tend to prefer responses that agree with them, so RLHF-trained models can subtly lean toward validation over correction. OpenAI has applied additional training to counteract this, but the tension between human approval and factual accuracy is a structural consequence of the training pipeline — not a bug that can be fully eliminated by prompting.

PRACTICAL IMPLICATION

When GPT-4o gives you an answer you doubt, press it explicitly: "What evidence supports this?" or "What would contradict your conclusion?" RLHF-trained models respond strongly to direct intellectual challenge because raters consistently scored self-correcting, evidence-citing responses higher than confident assertions.

Multimodal Capability: What It Unlocks and What It Doesn't

Native multimodality — processing image patches and audio spectrograms in the same attention layers as text — enables genuinely new application categories. A doctor can photograph a skin lesion and describe symptoms in the same message, and GPT-4o reasons over both simultaneously rather than captioning the image and then reasoning about the caption. A language tutor can hear a student's accent, see their written exercise, and provide integrated feedback on pronunciation and grammar in one response.

But it is worth being precise about what GPT-4o's multimodality does not include. It cannot generate images — that requires DALL-E, a separate diffusion model. Its video understanding at launch was limited to still frames from video, not continuous temporal reasoning over motion. And its audio output capabilities were significantly restricted at launch compared to the demo — certain emotional range and voice-matching features were held back pending safety review.

The practical gap between "omni model" and "replaces all specialized tools" is significant. GPT-4o is better understood as a powerful reasoning layer that can accept richer inputs than before — not as a replacement for purpose-built vision, speech, or generation models in production pipelines.

MODULE SYNTHESIS

GPT-4o's capabilities are a product of three compounding factors: the transformer's ability to attend across long, mixed-modality sequences (Lesson 1); RLHF training that shaped its communication style and risk calibration (Lesson 2); and architectural unification that eliminated the pipeline latency and information loss of the old voice and vision systems (Lesson 3). Understanding any one factor in isolation explains only part of the picture.

Lesson 4 Quiz

3 questions — free, untracked, retake anytime.

GPT-4o's tendency to add caveats and disclaimers (e.g., "I'm not a doctor, but…") is best explained by which aspect of its development?

✓ Correct. Disclaimers are a trained preference, not a capability cap — human annotators rated cautious responses higher, so RLHF baked that style into the reward model.

✗ The caveats come from RLHF: human raters preferred hedged responses, training the reward model to favor them. It's a trained preference, not a filter or architecture constraint.

A medical team wants to send GPT-4o both a photo of a skin lesion and a text description of the patient's symptoms in a single message. Which statement best describes what happens architecturally?

✓ Correct. Native multimodality means image patches and text tokens share the same attention computation — the model reasons over both simultaneously, not sequentially.

✗ GPT-4o's unified architecture embeds image patches as tokens and runs them through the same attention layers as text — one forward pass, no captioning step, no separate pipelines.

Which of the following is something GPT-4o cannot do, despite being called an "omni" model?

✓ Correct. Image generation is handled by DALL-E, a separate diffusion model. GPT-4o understands images but cannot create them.

✗ GPT-4o cannot generate images — that requires DALL-E, which is a separate model. GPT-4o can see, hear, and reason across modalities, but image generation is not part of its architecture.

Lab 4: Synthesis and Integration

Apply and extend the concepts from this lesson through guided conversation with an AI assistant.

Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.

Lab 4 Assistant AI Assistant

Module Test

15 questions covering all lessons — free, untracked, retake anytime.

Score: 0/15

In what year was the paper "Attention Is All You Need" published, and who wrote it?

✓ Correct. "Attention Is All You Need" was published in 2017 by eight Google Brain researchers, introducing the transformer architecture.

✗ The paper was published in 2017 by Google Brain researchers — it introduced the transformer by replacing recurrent networks with self-attention.

What is self-attention's key advantage over recurrent neural networks (RNNs) for training large language models?

✓ Correct. Parallel processing is the core advantage — RNNs process tokens one at a time, making large-scale GPU parallelism impossible.

✗ Self-attention processes all tokens simultaneously, unlike RNNs which process sequentially. This parallelism is what made training on trillions of tokens feasible.

GPT-4o uses a tokenizer called cl100k_base. Approximately how many tokens are in its vocabulary?

✓ Correct. The cl100k_base vocabulary has ~100,277 tokens, using byte-pair encoding (BPE) to represent sub-word units.

✗ GPT-4o's cl100k_base tokenizer has roughly 100,277 tokens — a sub-word vocabulary using byte-pair encoding.

What is the training objective during GPT's pre-training phase?

✓ Correct. Next-token prediction on a massive internet corpus is the self-supervised objective that causes the model to implicitly learn language, facts, and reasoning.

✗ Pre-training uses next-token prediction — a self-supervised objective where the model creates its own training signal from text structure, requiring no external labels.

What is InstructGPT, and why was it significant?

✓ Correct. InstructGPT (2022) proved that RLHF alignment could substitute for raw scale — a 1.3B parameter aligned model outperformed 175B raw GPT-3 in human preference ratings.

✗ InstructGPT was a smaller GPT-3-era model trained with RLHF. Human raters preferred it 85% of the time over the much larger raw GPT-3, proving alignment matters more than scale for perceived quality.

In the RLHF pipeline, what does the reward model do?

✓ Correct. The reward model is a separate neural network trained on human pairwise comparisons — it scores outputs so the language model can be optimized to produce highly-scored responses.

✗ The reward model is trained on human pairwise preference rankings and learns to score model outputs. The language model is then optimized (via PPO) to maximize those scores.

What does Proximal Policy Optimization (PPO) do in the RLHF pipeline?

✓ Correct. PPO is the RL algorithm used to update the LM weights toward higher reward scores, with a KL-divergence penalty to prevent reward hacking.

✗ PPO (Proximal Policy Optimization) fine-tunes the language model against the reward model's scores, with a constraint (KL penalty) keeping it from straying too far from the supervised baseline.

ChatGPT launched on November 30, 2022. How quickly did it reach 1 million users?

✓ Correct. ChatGPT reached 1 million users within 5 days of launch — faster than any consumer product in history to that point.

✗ ChatGPT hit 1 million users in approximately 5 days — and 100 million users within 2 months, making it the fastest consumer product ever to reach that milestone.

GPT-4o was released in May 2024. What does the "o" in its name stand for?

✓ Correct. "Omni" — GPT-4o was OpenAI's first model to natively process text, images, and audio in a single unified model without routing through separate systems.

✗ The "o" stands for omni. GPT-4o was OpenAI's first native omni model, handling text, image, and audio in one set of weights rather than through a pipeline of separate models.

What was the key architectural difference between GPT-4V (GPT-4 with vision) and GPT-4o in how they handle image input?

✓ Correct. GPT-4V used an adapter-based approach that translated images before the LM saw them; GPT-4o processes image patches directly alongside text tokens in native multimodal attention.

✗ The key difference is native vs. adapter-based multimodality. GPT-4V translated images via an adapter; GPT-4o embeds image patches as tokens and attends to them directly alongside text — no translation step.

GPT-4o's context window holds 128,000 tokens. Roughly how many English words does that correspond to?

✓ Correct. English averages roughly 0.75 words per token, so 128,000 tokens ≈ 96,000 words — about the length of a full novel.

✗ At roughly 0.75 words per token, 128K tokens equals approximately 96,000 words — the length of a full novel. This ratio varies by language and content type.

What information is lost when audio is routed through a speech-to-text model like Whisper before being processed by a language model?

✓ Correct. Transcription converts speech to text, discarding paralinguistic information — GPT-4o's native audio processing retains this because it attends to the audio spectrogram directly.

✗ Transcription preserves words but discards tone, pacing, and emotional affect. GPT-4o avoids this loss by attending to audio spectrograms directly rather than transcribing first.

Which statement correctly describes how RLHF's KL-divergence penalty works?

✓ Correct. The KL penalty constrains how much the RL-optimized model can diverge from the SFT baseline — this prevents reward hacking where the model finds degenerate high-scoring outputs.

✗ The KL-divergence penalty prevents the model from straying too far from the supervised baseline. Without it, PPO optimization can cause the model to "hack" the reward model with nonsensical outputs that score well.

What was GPT-4o's average audio response latency at launch, and what primarily drove that improvement over the prior 2.8-second voice pipeline?

✓ Correct. 232ms was the documented average latency at launch. The improvement came from one model doing what three previously did — eliminating handoff latency entirely.

✗ GPT-4o averaged ~232ms. The gain came from architectural unification: instead of Whisper transcribing, GPT-4 reasoning, and TTS synthesizing in sequence, one model handles all three in a single forward pass.

Which of the following capabilities does GPT-4o NOT have, despite being described as an "omni" model?

✓ Correct. Image generation requires DALL-E, a separate diffusion model. GPT-4o can see and reason about images but cannot create them.

✗ GPT-4o cannot generate images — that requires DALL-E, a separate diffusion model. "Omni" refers to its unified text/audio/image input and reasoning, not image generation capability.

Transformers, Tokens, and the Attention Revolution

Lesson 1 Quiz

Lab 1 — Tokens & Transformers

Investigate GPT-4o's Tokenization and Attention

Pre-training, RLHF, and the Making of ChatGPT

Lesson 2 Quiz

Lab 2 — RLHF in Action

Probe GPT-4o's RLHF-Shaped Responses

Multimodal Architecture: How GPT-4o Sees and Hears

Lesson 3 Quiz

Lab: Explore Lesson 3 Concepts

Your Task

GPT-4o in the Wild: From Architecture to Application

Lesson 4 Quiz

Lab 4: Synthesis and Integration

Module Test

Module Test Result