Eight researchers at Google Brain published a 31-page paper titled "Attention Is All You Need." They discarded the recurrent networks that had dominated NLP for a decade and replaced them with a single idea: instead of reading text left-to-right like a human, let the model attend to every word simultaneously and weigh how much each word matters to every other word. The Transformer architecture was born.
OpenAI read that paper and, within eighteen months, built GPT-1 on top of it. By the time GPT-4o launched in May 2024, the same core idea — scaled by orders of magnitude — was processing images, audio, and text inside a single unified model. The transformer did not just survive; it became the universal substrate of modern AI.
A transformer takes a sequence of tokens — chunks of text, pixels, or audio frames — and applies a mathematical operation called self-attention. Self-attention asks, for every token in the sequence: "How relevant is every other token to understanding this one?" The answer is a set of floating-point weights, computed in parallel across the entire sequence.
This parallelism is what made transformers faster to train than recurrent networks (RNNs), which had to process tokens one at a time. GPT-4 was trained on roughly 13 trillion tokens across thousands of GPUs simultaneously — something impossible with sequential architectures.
The result of self-attention feeds into a feed-forward network that applies learned transformations, and this cycle (attention → feed-forward) is stacked into "layers." GPT-4 is estimated to have around 120 such layers. Each layer refines the model's internal representation of what the text means.
GPT-4o does not read words — it reads tokens. OpenAI uses a byte-pair encoding (BPE) scheme called cl100k_base, which has roughly 100,277 tokens. Common words like "the" are a single token; rarer words get split. "Tokenization" becomes ["Token", "ization"]. Emojis and non-Latin scripts often cost more tokens per character.
This matters practically: GPT-4o's context window of 128,000 tokens (as of its May 2024 release) translates to roughly 96,000 words of English text — about the length of a full novel. Every inference call processes all tokens in the window simultaneously through the attention mechanism, which is why longer contexts cost more compute (attention scales as O(n²) with sequence length).
REAL BENCHMARK
On OpenAI's internal token-efficiency tests published in the GPT-4 Technical Report (March 2023), GPT-4 outperformed GPT-3.5 on the bar exam by 37 percentile points — scoring in the top 10% of human test-takers. The architecture was identical in type; what changed was scale and training data.
GPT-1 (2018): 117 million parameters, trained on BookCorpus (~4.5 GB). Proved the pre-train/fine-tune paradigm could work for NLP tasks.
GPT-2 (2019): 1.5 billion parameters. OpenAI famously staged its release over fears of misuse — the first time an AI lab treated a language model as potentially dangerous to release in full.
GPT-3 (2020): 175 billion parameters. Demonstrated few-shot learning: the model could perform tasks it had never been explicitly trained on, just from examples in the prompt.
GPT-4 (2023): Parameters undisclosed, but widely estimated at ~1.8 trillion in a mixture-of-experts (MoE) configuration. Added multimodal vision input. Scored 90th percentile on the bar exam.
GPT-4o (May 2024): "omni" — handles text, image, and audio natively in a single model rather than routing through separate systems. Response latency dropped to ~232ms average, approaching human conversational speed.
KEY INSIGHT
The transformer architecture has not fundamentally changed since 2017. What has changed is scale (parameters, data, compute) and the quality of alignment training applied afterward. GPT-4o's "magic" is less architectural novelty and more the result of extraordinary scaling discipline.
In this lab you'll probe how GPT-4o tokenizes unusual inputs, what happens at context window edges, and how the transformer's attention mechanism shapes what the model "notices" in a prompt. Ask questions about real tokenization behavior, count tokens in sample text, or explore what self-attention means in plain language.
Sam Altman's team had been quietly testing a chatbot interface for six days before deciding to release it publicly. They expected perhaps a million users in the first year. Within five days of launch, ChatGPT had one million users. Within two months, 100 million. It became the fastest consumer product to 100 million users in history, overtaking TikTok's record of nine months.
What made ChatGPT feel different from the GPT-3 API that had been available for two years? The answer was a training technique called Reinforcement Learning from Human Feedback — RLHF. The same model, made conversational and helpful, sparked a technological inflection point that no one, including OpenAI, fully anticipated.
GPT-4's pre-training involved a dataset orders of magnitude larger than GPT-3's. OpenAI has not disclosed the exact corpus, but the GPT-4 Technical Report confirms it included data from the internet through early 2023, filtered using proprietary classifiers. The training objective is simple: predict the next token. Given "The capital of France is," the model should assign high probability to "Paris."
This self-supervised objective, applied to trillions of tokens, causes the model to implicitly learn grammar, facts, reasoning patterns, coding conventions, and much more — because predicting the next word accurately requires understanding almost everything about language and knowledge. The model never receives a label or correct-answer signal; it creates its own training signal from the structure of text.
After pre-training, OpenAI's contractors (via Scale AI) wrote thousands of example conversations demonstrating ideal assistant behavior — helpful, harmless, honest responses. The model was then fine-tuned on these examples using standard supervised learning. This shifted the model from "next-token predictor" to "assistant trying to be helpful."
This step is why ChatGPT responds in a conversational tone and acknowledges its limitations. The base GPT-4 model, before SFT, would just continue text — ask it "What is 2+2?" and it might continue with "... a question that has stumped philosophers."
RLHF, originally described by OpenAI researchers in a 2017 paper and refined in the 2022 "Training language models to follow instructions with human feedback" paper (Ouyang et al.), involves two sub-steps:
Reward Model Training: Human raters compared pairs of model outputs and indicated which was better. These preferences trained a separate neural network — the reward model — to predict human preference scores for any given output.
PPO Optimization: The language model was then fine-tuned using Proximal Policy Optimization (PPO), a reinforcement learning algorithm, to maximize the reward model's score. Crucially, a KL-divergence penalty kept the model from drifting too far from the SFT baseline (preventing "reward hacking" where it learns to game the reward model with nonsense).
InstructGPT — the precursor to ChatGPT — showed in OpenAI's 2022 paper that a 1.3 billion parameter model trained with RLHF was preferred by human raters over a 175 billion parameter raw GPT-3 model. Alignment, not just scale, determined perceived quality.
DOCUMENTED RESULT
OpenAI's InstructGPT paper (Ouyang et al., 2022) found that labelers preferred InstructGPT outputs over GPT-3 outputs 85% of the time, even though InstructGPT was 100× smaller by parameter count. This was the empirical proof that RLHF could substitute for scale in producing useful outputs.
For GPT-4o, OpenAI extended this pipeline to audio and vision. Instead of separate encoders that translate images or audio into text tokens first, GPT-4o was trained end-to-end on interleaved text, image, and audio data. This means its attention layers directly attend to audio spectrograms and image patches alongside text tokens — no translation step, no latency from pipeline handoffs.
The result: GPT-4o's average audio response latency at launch was 232 milliseconds, compared to 2.8 seconds for the previous voice pipeline that routed through Whisper (speech-to-text) → GPT-4 → TTS. That 12× latency reduction came from architectural unification, not hardware.
PRACTICAL TAKEAWAY
When GPT-4o refuses a request or adds caveats, that behavior was shaped during RLHF — human raters scored cautious responses higher in certain categories. Understanding this helps you prompt around it: framing tasks as professional, legitimate, and specific tends to shift the reward-model-trained behavior toward compliance.
In this lab you'll explore how RLHF influences GPT-4o's refusals, caveats, and tone. Try different framings of the same request to see how professional or specific context shifts the model's output. Discuss why the reward model was trained to add hedges, and how to legitimately reduce unnecessary hedging.
OpenAI CTO Mira Murati and researchers demonstrated GPT-4o live on a laptop — no special hardware — solving a linear equation written by hand on paper, coaching a nervous speaker in real time by reacting to vocal tone, and switching languages mid-conversation. The model reacted with vocal affect: laughing, sighing, expressing mock impatience when interrupted. None of this used separate models. It was one model, one forward pass.
The technical press noted that GPT-4o's audio understanding was qualitatively different from the previous pipeline: it could now hear that a speaker was anxious, not just transcribe their words. That emotional register was information encoded in the audio spectrogram — and the model had learned to attend to it.
Before GPT-4o, OpenAI's voice mode worked as a three-stage pipeline:
Whisper (ASR model) transcribed audio to text → GPT-4 processed the text → a TTS model synthesized speech. Each handoff introduced latency and lost information — tone, pacing, emphasis, emotional register were all discarded at the Whisper transcription step.
GPT-4o eliminates those handoffs. It processes audio spectrograms — visual representations of sound frequency over time — as patches, similar to how it processes image regions. These audio patches are tokenized and fed directly into the same attention layers that process text tokens. The model attends to audio and text simultaneously, in the same forward pass.
GPT-4o's vision processing builds on CLIP-style image encoding. An input image is divided into fixed-size patches — typically 14×14 pixel regions — each of which is projected into the same embedding space as text tokens. A 512×512 image becomes roughly 1,365 image tokens.
This is why GPT-4o's context window cost scales with image size: a high-resolution image can consume hundreds or thousands of tokens, competing with text for attention capacity. OpenAI's system resizes images based on detail level requested — "low" detail mode uses 85 tokens; "high" detail mode tiles the image and can use up to 1,700+ tokens per image.
The model's vision capability was evaluated on MMMU (Massive Multidisciplinary Multimodal Understanding), where GPT-4o scored 69.1% at release — above human experts in several subcategories including medical imaging and scientific diagrams.
TECHNICAL DETAIL — DOCUMENTED
GPT-4o's "high detail" image mode tiles a resized copy of the image into 512×512 blocks and processes each block separately, then combines representations. This is why asking GPT-4o to read small text in a high-resolution screenshot is possible in high detail mode but fails in low detail mode — the tiling creates enough token resolution to resolve individual characters.
Despite its capabilities, GPT-4o has documented multimodal limitations. It cannot generate images (DALL-E is a separate model). Its audio output, while expressive, cannot reproduce specific voices reliably. It struggles with spatial reasoning in images — knowing that object A is "to the left of" object B in complex scenes with many objects.
OpenAI's System Card for GPT-4o (May 2024) also documents that the model was restricted from certain audio behaviors at launch — including generating audio that closely imitates real individuals' voices — pending further safety evaluation. Some emotional vocal expressiveness demonstrated in the demo was disabled in the initial public API release.
PROMPT IMPLICATION
When submitting images to GPT-4o, specify "use high detail mode" for tasks requiring fine text or spatial precision. For broad scene understanding, low detail is faster and cheaper. The model doesn't automatically choose — it defaults to a heuristic based on image size unless instructed.
Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.
By the time GPT-4o launched in May 2024, the technology press was focused on its emotional voice and real-time camera demos. But the more significant story was architectural: for the first time, the same weights that processed your typed question also heard the tone in your voice and read the expression in your photograph. What the model "knew" about one modality could inform its reasoning in another.
Understanding why GPT-4o behaves the way it does — why it hedges, how it handles ambiguous images, why long conversations sometimes drift — requires connecting the transformer mechanics from Lesson 1, the RLHF pipeline from Lesson 2, and the multimodal architecture from Lesson 3. This lesson makes those connections explicit.
The self-attention mechanism is not merely an internal detail — it has direct practical consequences. Because attention is computed across all tokens simultaneously, GPT-4o can resolve pronouns, track argument threads, and maintain coherent reasoning across a 128,000-token context. A model built on recurrent networks would degrade over long sequences; the transformer does not.
This enables use cases that would have been impossible five years ago: reading an entire legal contract and answering questions about specific clauses; ingesting a codebase and explaining a function's interaction with the rest of the system; maintaining a coherent medical case history across a multi-hour diagnostic conversation. The 128K context is not a marketing number — it is the attention mechanism scaled to its current practical limit.
Tokenization also has practical implications. GPT-4o's ~100K vocabulary means it handles code, scientific notation, and multiple languages more efficiently than models with smaller vocabularies. But it also means unusual inputs — strings of random characters, highly technical jargon, or rare non-Latin scripts — may tokenize into many small pieces, consuming context faster than expected and sometimes degrading coherence.
RLHF is the invisible hand behind GPT-4o's communication style. Human raters during training consistently scored responses higher when they acknowledged uncertainty, provided caveats on sensitive topics, and structured complex answers with headers or bullet points. Those preferences became baked into the reward model — and now emerge automatically in GPT-4o's outputs.
This has practical implications in both directions. GPT-4o's tendency to add disclaimers ("I'm not a doctor, but...") is not a capability limitation — it is a trained preference. Framing requests with professional context ("as a licensed pharmacist, I need...") shifts the reward-model-influenced distribution toward directness, because such framing was associated with appropriate expert queries during annotation.
RLHF also explains GPT-4o's sycophancy risk: human raters tend to prefer responses that agree with them, so RLHF-trained models can subtly lean toward validation over correction. OpenAI has applied additional training to counteract this, but the tension between human approval and factual accuracy is a structural consequence of the training pipeline — not a bug that can be fully eliminated by prompting.
PRACTICAL IMPLICATION
When GPT-4o gives you an answer you doubt, press it explicitly: "What evidence supports this?" or "What would contradict your conclusion?" RLHF-trained models respond strongly to direct intellectual challenge because raters consistently scored self-correcting, evidence-citing responses higher than confident assertions.
Native multimodality — processing image patches and audio spectrograms in the same attention layers as text — enables genuinely new application categories. A doctor can photograph a skin lesion and describe symptoms in the same message, and GPT-4o reasons over both simultaneously rather than captioning the image and then reasoning about the caption. A language tutor can hear a student's accent, see their written exercise, and provide integrated feedback on pronunciation and grammar in one response.
But it is worth being precise about what GPT-4o's multimodality does not include. It cannot generate images — that requires DALL-E, a separate diffusion model. Its video understanding at launch was limited to still frames from video, not continuous temporal reasoning over motion. And its audio output capabilities were significantly restricted at launch compared to the demo — certain emotional range and voice-matching features were held back pending safety review.
The practical gap between "omni model" and "replaces all specialized tools" is significant. GPT-4o is better understood as a powerful reasoning layer that can accept richer inputs than before — not as a replacement for purpose-built vision, speech, or generation models in production pipelines.
MODULE SYNTHESIS
GPT-4o's capabilities are a product of three compounding factors: the transformer's ability to attend across long, mixed-modality sequences (Lesson 1); RLHF training that shaped its communication style and risk calibration (Lesson 2); and architectural unification that eliminated the pipeline latency and information loss of the old voice and vision systems (Lesson 3). Understanding any one factor in isolation explains only part of the picture.
Apply and extend the concepts from this lesson through guided conversation with an AI assistant.
Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.
15 questions covering all lessons — free, untracked, retake anytime.