Arthur C. Clarke famously said that any sufficiently advanced technology is indistinguishable from magic. The corollary is less often quoted: to the people who built it, it's never magic. They know where the seams are, where the brittle parts are, where the miracles are, and where the obvious failures that nobody's fixed yet still live.
Large language models are exactly that kind of technology. To a casual user, they're shocking — they write code, explain jokes, summarize books, roleplay characters, debug errors, and sometimes hallucinate with complete confidence. To someone who understands the mechanics, all of it is expected: tokenization, embeddings, transformer attention, training data, sampling, alignment, each explaining a specific part of what the system does and doesn't do.
This course makes the magic legible. You leave knowing how an LLM actually represents language, how it's trained, why it hallucinates, what alignment is actually doing, what context windows really are, and the architectural choices that make Claude and GPT and Gemini behave differently. You won't be able to build a frontier model after this — but you'll know enough about what one is to use, evaluate, and reason about them with real judgment.
In the summer of 2017, a team at Google Brain and Google Research posted a paper to arXiv titled "Attention Is All You Need." Its premise was radical: throw out recurrence entirely. No LSTMs. No GRUs. Just attention mechanisms stacked together. Within five years, virtually every frontier language model — GPT-4, Claude, Gemini, LLaMA — would be built on the architecture they described.
The eight authors — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin — were not proposing a small improvement. They were proposing that the field had been solving the sequence problem with the wrong fundamental tool for nearly a decade.
Prior to 2017, the dominant approach to sequence modeling was the recurrent neural network (RNN) and its gated variants — LSTMs (introduced by Hochreiter & Schmidhuber in 1997) and GRUs (Cho et al., 2014). These networks processed text one token at a time, left to right, maintaining a hidden state that accumulated context.
The fundamental limitation was sequential computation: to process token 512, you had to first process tokens 1 through 511. This made training on long documents extremely slow and made it nearly impossible to fully exploit modern parallel hardware like GPUs. Information from early tokens also tended to fade — the "vanishing gradient" problem meant that even LSTMs struggled to relate a pronoun at position 400 to its referent at position 12.
Researchers had already begun bolting attention mechanisms onto encoder-decoder RNNs for machine translation — Bahdanau et al.'s 2015 paper "Neural Machine Translation by Jointly Learning to Align and Translate" showed attention could dramatically improve translation quality. But attention was treated as an add-on, not the whole architecture.
On the WMT 2014 English-to-German translation benchmark, the original Transformer model achieved 28.4 BLEU — a new state of the art at the time — while training in 3.5 days on 8 P100 GPUs, compared to weeks for comparable recurrent models. The parallel training advantage was not marginal; it was transformative.
The Transformer's key conceptual move was to ask: what if, instead of passing a hidden state through time, we let every token directly attend to every other token in a single parallel operation? This is self-attention: a mechanism that computes, for each token, a weighted sum of all other tokens' representations, where the weights express relevance.
The word "bank" in "river bank" needs to know about "river" to be interpreted correctly. Self-attention lets "bank" look directly at "river" in a single step, regardless of the distance between them. No recurrence. No vanishing gradient across long distances. And crucially, every token can do this simultaneously — the computation parallelizes across all positions at once.
This single insight, fully implemented, produced a model that was faster to train, scaled better with data, and handled long-range dependencies more reliably than anything before it.
Every AI tool you use today — ChatGPT, Claude, Gemini, Copilot, Midjourney's text encoder — runs on the Transformer architecture or a direct descendant. Understanding the original design is understanding the foundation beneath all of modern AI.
You have a direct line to an AI tutor specialized in Transformer history and architecture foundations. Use it to deepen your understanding of why the 2017 paper was so significant — and what problems it actually solved.
Before a Transformer can process a single word, that word must be transformed into something mathematics can operate on. The journey from the string "The quick brown fox" to the first computation inside the model involves three distinct transformations — tokenization, embedding, and positional encoding — each solving a specific problem.
Text is not fed character-by-character or word-by-word into modern Transformers. Instead, it is split into tokens — subword units produced by algorithms like Byte Pair Encoding (BPE, used in GPT models) or WordPiece (used in BERT). These algorithms were developed to balance vocabulary size against coverage of rare words.
OpenAI's GPT-4 uses a tokenizer called cl100k_base, which has a vocabulary of approximately 100,277 tokens. The word "unhappiness" might be tokenized as ["un", "happiness"] — two tokens. The word "cat" is a single token. An emoji might be 2–3 tokens. On average, one token corresponds to roughly 0.75 English words.
This matters practically: GPT-4's context window of 128,000 tokens corresponds to roughly 96,000 words — about the length of a full novel. Every token, not every word, consumes part of that window.
When OpenAI released its tokenizer tool tiktoken in 2022, developers discovered that code is tokenized very differently from prose. Python's print("hello") — 15 characters — becomes approximately 6 tokens. Dense mathematical notation can tokenize extremely inefficiently, which partly explains why math was historically harder for LLMs than prose.
Each token ID is mapped to a high-dimensional vector through an embedding matrix. In the original Transformer, this vector had 512 dimensions. In GPT-3, 12,288 dimensions. In practice, these vectors are learned during training: the model discovers, through gradient descent, that the vector for "king" should be positioned in embedding space such that "king − man + woman ≈ queen" — the famous demonstration from Word2Vec (Mikolov et al., 2013).
Embeddings encode semantic similarity geometrically. Words used in similar contexts end up with similar vectors. This is not programmed — it emerges from the training objective of predicting the next token accurately.
Here is the problem: self-attention, as described, is permutation-invariant. Give it the tokens for "dog bites man" or "man bites dog" in any order, and the raw attention computation produces the same result — it has no notion of sequence. This would be catastrophic for language, where word order is meaning.
The original Transformer paper solved this with sinusoidal positional encoding: a set of sine and cosine functions at different frequencies, added directly to the embedding vectors before any processing. Each position gets a unique positional signal injected into its representation.
The formula uses sin(pos/10000^(2i/d)) and cos(pos/10000^(2i/d)) for alternating dimensions, where pos is the token position and d is the embedding dimension. The authors chose sinusoids specifically because they allow the model to generalize to longer sequences than it saw during training — a relative offset between positions is always expressible as a linear function of the encoding.
Later models, including GPT-2 and BERT, switched to learned positional embeddings — simply trainable parameters for each position, letting the model discover whatever positional signal works best. More recent architectures like LLaMA use Rotary Position Embeddings (RoPE), which encode relative rather than absolute position within the attention computation itself.
Understanding tokenization explains real AI behavior: why models sometimes "count" letters wrong (they never see individual characters), why code and math can confuse models (inefficient tokenization), and why context windows are measured in tokens, not words.
Explore the practical consequences of tokenization and positional encoding with your AI tutor. Many puzzling LLM behaviors — letter-counting errors, math struggles, context length limits — trace back to these input representations.
The self-attention mechanism is arguably the most important algorithm in contemporary AI. Understanding it — not just knowing it exists, but understanding how it actually computes — unlocks the ability to reason about what language models can and cannot do, why they sometimes fail, and how they are being extended by new research.
Each token's embedding is projected into three separate vectors through learned linear transformations: a Query (Q), a Key (K), and a Value (V). These names come from a loose analogy with database retrieval: the Query is what you're looking for, Keys are what's available to match against, and Values are the actual information retrieved.
For each token, attention is computed as follows: take the token's Query vector and compute its dot product with the Key vectors of every other token. This produces a raw score expressing how relevant each other token is. Divide by √d_k (the square root of the key dimension) — this scaling prevents the dot products from becoming so large that the softmax function produces near-zero gradients. Apply softmax to normalize these scores into a probability distribution. Finally, compute a weighted sum of all Value vectors using these normalized scores.
The result: each token's new representation is a blend of all other tokens' Value vectors, weighted by relevance. In a single layer, "bank" in "river bank" can already incorporate strong signal from "river."
Attention(Q, K, V) = softmax(QKT / √dk) · V
This is the entire self-attention computation. Every frontier language model runs billions of these operations per forward pass. The elegance is real: it is a differentiable, parallelizable lookup.
A single attention computation can only capture one kind of relationship at a time. The original Transformer used multi-head attention: run the Q, K, V projection and attention computation h times in parallel, each with different learned projection matrices. Concatenate the results and project them back down to the model dimension.
In the original paper, the base model used h=8 heads with d_k=64. Why? Because different heads learn to attend to different types of relationships. Research by Voita et al. (2019) at Yandex analyzed trained Transformer heads and found that different heads specialized: some tracked syntactic dependencies, others tracked positional patterns, others tracked coreference (which "it" refers to).
This is not programmed in — it emerges from training. The multi-head structure gives the model enough capacity to simultaneously represent multiple types of inter-token relationships.
Researchers at Google Brain (Clark et al., 2019, "What Does BERT Look At?") visualized attention patterns in trained BERT models. They found that certain heads consistently attended to specific linguistic structures: delimiter tokens like [SEP], the next/previous token, and words in specific syntactic relationships. One head almost perfectly tracked direct objects of verbs across a range of sentences.
This analysis was important because it demonstrated that Transformers were not black boxes in the sense of being completely opaque — the attention weights provide a partial window into what the model is computing, though interpreting attention weights as "what the model uses" remains an active research debate (Jain & Wallace, 2019, argued attention is not explanation).
The scaling factor √dk in the attention formula is easy to overlook but critical. Without it, with large d_k, dot products grow large and softmax saturates — pushing all probability mass onto one token and producing near-zero gradients that prevent learning. This single detail explains a class of early Transformer training instabilities.
The self-attention formula is deceptively compact. In this lab, push your understanding of how it actually works: what the three projections represent, why multi-head attention matters, and what interpretability research has revealed about what trained attention heads do.
Attention handles routing: it decides which information from which tokens is relevant and mixes it together. But attention alone cannot store or apply knowledge. The feed-forward layers — often overlooked in popular explanations — are where the model's learned factual associations actually live. Research by Geva et al. (2021) at Tel Aviv University demonstrated that feed-forward layers in Transformers function as key-value memories, with each neuron encoding specific input patterns and their associated outputs.
Every Transformer layer contains two sublayers: the multi-head self-attention sublayer (which we covered in Lesson 3) and a position-wise feed-forward network (FFN). The FFN is applied independently and identically to each token position — there is no mixing between positions here.
The original architecture used a two-layer fully-connected network with a ReLU activation: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. The inner layer dimension was 4× the model dimension — so in the base model with d_model=512, the FFN expanded to 2048, then contracted back. This expansion-contraction pattern persists across virtually all Transformer variants.
Modern models like GPT-4 use a variant of this called SwiGLU (Shazeer, 2020), which uses a different gating activation. LLaMA, Mistral, and most open-source models use SwiGLU or similar gated FFN variants, finding them to improve performance at the same parameter count.
Geva et al. (2021), "Transformer Feed-Forward Layers Are Key-Value Memories," showed that individual neurons in FFN layers respond to specific input patterns (keys) and promote specific output tokens (values). A neuron that fires strongly for "Paris is the capital of" tends to amplify the probability of "France" in the output. This is direct empirical evidence that factual knowledge is stored in FFN weights, not attention weights.
Between each sublayer, the original Transformer applied Layer Normalization (Ba et al., 2016). Layer Norm normalizes the activations across the feature dimension (not the batch dimension, as Batch Norm does), then applies learned scale and shift parameters. This stabilizes training by preventing internal covariate shift — the phenomenon where the distribution of activations changes unpredictably during training, making each layer adapt to a moving target.
The original paper used post-layer norm (norm applied after the residual addition). Most modern models — GPT-2 onwards — switched to pre-layer norm (norm applied before each sublayer). This change, analyzed in depth by Liu et al. (2020), substantially improves training stability and allows larger models to train without warmup. LLaMA and most 2023+ open models use RMSNorm, a simplified normalization that drops the mean-centering step and is slightly faster.
Both sublayers (attention and FFN) use residual connections — the input to each sublayer is added back to its output: output = sublayer(x) + x. This pattern, borrowed from ResNets (He et al., 2015), allows gradients to flow directly back to early layers without being multiplied through every transformation in the stack. Without residuals, very deep networks fail to train; with them, 96-layer models (GPT-3 has 96 layers) train stably.
The residual connection also means each layer is learning a residual function — the difference from the identity. If a layer learns nothing, it can simply output zeros and pass the input unchanged. This initialization-friendly property makes deep Transformers surprisingly robust to the specific initialization of individual layers.
A complete Transformer decoder block (as used in GPT models) processes input x through:
1. Layer Norm → Masked Multi-Head Self-Attention → add residual
2. Layer Norm → Feed-Forward Network → add residual
This block is stacked N times (12 in GPT-2 small, 96 in GPT-3). The output of the final block goes through a final Layer Norm, then a linear projection to vocabulary size, then a softmax to produce next-token probabilities. That is the complete architecture of every GPT-style model ever deployed.
If you want to understand where a language model stores a fact (e.g., "Paris is the capital of France"), it is largely in the FFN weights distributed across layers. If you want to understand how it routes and combines information in context, look to attention. Both components are necessary; neither is sufficient alone.
You've now seen all the components of the Transformer: tokenization, embeddings, positional encoding, self-attention, feed-forward layers, layer norm, and residual connections. In this final lab, work with your tutor to connect them into a coherent picture — and begin thinking about how this architecture scales.