How Large Language Models Work · Introduction

Magic you understand is a tool.

LLMs look like magic until you see the mechanics. Then they become something you can use with judgment.

Arthur C. Clarke famously said that any sufficiently advanced technology is indistinguishable from magic. The corollary is less often quoted: to the people who built it, it's never magic. They know where the seams are, where the brittle parts are, where the miracles are, and where the obvious failures that nobody's fixed yet still live.

Large language models are exactly that kind of technology. To a casual user, they're shocking — they write code, explain jokes, summarize books, roleplay characters, debug errors, and sometimes hallucinate with complete confidence. To someone who understands the mechanics, all of it is expected: tokenization, embeddings, transformer attention, training data, sampling, alignment, each explaining a specific part of what the system does and doesn't do.

This course makes the magic legible. You leave knowing how an LLM actually represents language, how it's trained, why it hallucinates, what alignment is actually doing, what context windows really are, and the architectural choices that make Claude and GPT and Gemini behave differently. You won't be able to build a frontier model after this — but you'll know enough about what one is to use, evaluate, and reason about them with real judgment.

Module 1 · Lesson 1

The Paper That Changed Everything

From RNNs to "Attention Is All You Need" — the 2017 breakthrough that made modern AI possible

Why did eight Google researchers abandon the dominant sequence model of the era and bet everything on attention?

In the summer of 2017, a team at Google Brain and Google Research posted a paper to arXiv titled "Attention Is All You Need." Its premise was radical: throw out recurrence entirely. No LSTMs. No GRUs. Just attention mechanisms stacked together. Within five years, virtually every frontier language model — GPT-4, Claude, Gemini, LLaMA — would be built on the architecture they described.

The eight authors — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin — were not proposing a small improvement. They were proposing that the field had been solving the sequence problem with the wrong fundamental tool for nearly a decade.

Why Recurrent Networks Were Failing

Prior to 2017, the dominant approach to sequence modeling was the recurrent neural network (RNN) and its gated variants — LSTMs (introduced by Hochreiter & Schmidhuber in 1997) and GRUs (Cho et al., 2014). These networks processed text one token at a time, left to right, maintaining a hidden state that accumulated context.

The fundamental limitation was sequential computation: to process token 512, you had to first process tokens 1 through 511. This made training on long documents extremely slow and made it nearly impossible to fully exploit modern parallel hardware like GPUs. Information from early tokens also tended to fade — the "vanishing gradient" problem meant that even LSTMs struggled to relate a pronoun at position 400 to its referent at position 12.

Researchers had already begun bolting attention mechanisms onto encoder-decoder RNNs for machine translation — Bahdanau et al.'s 2015 paper "Neural Machine Translation by Jointly Learning to Align and Translate" showed attention could dramatically improve translation quality. But attention was treated as an add-on, not the whole architecture.

Historical Record

On the WMT 2014 English-to-German translation benchmark, the original Transformer model achieved 28.4 BLEU — a new state of the art at the time — while training in 3.5 days on 8 P100 GPUs, compared to weeks for comparable recurrent models. The parallel training advantage was not marginal; it was transformative.

The Core Insight: Attention as the Whole Architecture

The Transformer's key conceptual move was to ask: what if, instead of passing a hidden state through time, we let every token directly attend to every other token in a single parallel operation? This is self-attention: a mechanism that computes, for each token, a weighted sum of all other tokens' representations, where the weights express relevance.

The word "bank" in "river bank" needs to know about "river" to be interpreted correctly. Self-attention lets "bank" look directly at "river" in a single step, regardless of the distance between them. No recurrence. No vanishing gradient across long distances. And crucially, every token can do this simultaneously — the computation parallelizes across all positions at once.

This single insight, fully implemented, produced a model that was faster to train, scaled better with data, and handled long-range dependencies more reliably than anything before it.

Key Terms

TransformerThe neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), built entirely from attention mechanisms and feed-forward layers, without recurrence or convolution.

Self-AttentionA mechanism where each position in a sequence computes relevance scores against all other positions and produces a weighted combination of their representations.

RNN / LSTMRecurrent neural networks that process sequences step-by-step, maintaining a hidden state — the dominant paradigm before Transformers.

BLEU ScoreBilingual Evaluation Understudy — a standard metric for machine translation quality, measuring n-gram overlap with reference translations.

Why This Matters for You

Every AI tool you use today — ChatGPT, Claude, Gemini, Copilot, Midjourney's text encoder — runs on the Transformer architecture or a direct descendant. Understanding the original design is understanding the foundation beneath all of modern AI.

Lesson 1 Quiz

The Paper That Changed Everything

1. What was the title of the 2017 paper that introduced the Transformer architecture?

Correct. Vaswani et al. (2017) titled their landmark paper "Attention Is All You Need," introducing the Transformer architecture.

Not quite. The correct answer is "Attention Is All You Need" by Vaswani et al. (2017). The Bahdanau et al. paper introduced attention as an add-on to RNNs — not the Transformer itself.

2. What was the primary practical limitation of recurrent neural networks (RNNs/LSTMs) that the Transformer addressed?

Correct. RNNs process tokens one at a time, preventing parallel computation. This made training slow and caused information from distant tokens to fade — the vanishing gradient problem.

The core issue was sequential computation: RNNs process token-by-token, which prevents parallelization and causes long-range information loss via the vanishing gradient problem.

3. On the WMT 2014 English-to-German benchmark, what BLEU score did the original Transformer achieve?

Correct. The original Transformer achieved 28.4 BLEU on WMT 2014 English-German — a new state of the art at the time, achieved in 3.5 days on 8 P100 GPUs.

The correct figure is 28.4 BLEU. This was a new state of the art in 2017, achieved while training far faster than comparable recurrent models.

Lab 1 — The Architecture Origins

Explore why Transformers displaced RNNs. Ask at least 3 questions to complete the lab.

Your Mission

You have a direct line to an AI tutor specialized in Transformer history and architecture foundations. Use it to deepen your understanding of why the 2017 paper was so significant — and what problems it actually solved.

Suggested starting points: "Why couldn't researchers just make LSTMs bigger?" · "What did Bahdanau's attention paper contribute before Transformers?" · "What does it mean for computation to be parallelizable?"

Architecture Origins Tutor

L1 Lab

Welcome. I'm here to help you explore the historical context and motivations behind the Transformer architecture. The 2017 "Attention Is All You Need" paper didn't emerge from nowhere — it was a response to real bottlenecks that had frustrated researchers for years. What would you like to dig into first?

Module 1 · Lesson 2

Tokens, Embeddings, and Positional Encoding

How raw text becomes the mathematical objects a Transformer can process

If attention treats all positions symmetrically, how does the model know that "dog bites man" means something different from "man bites dog"?

Before a Transformer can process a single word, that word must be transformed into something mathematics can operate on. The journey from the string "The quick brown fox" to the first computation inside the model involves three distinct transformations — tokenization, embedding, and positional encoding — each solving a specific problem.

Step 1: Tokenization

Text is not fed character-by-character or word-by-word into modern Transformers. Instead, it is split into tokens — subword units produced by algorithms like Byte Pair Encoding (BPE, used in GPT models) or WordPiece (used in BERT). These algorithms were developed to balance vocabulary size against coverage of rare words.

OpenAI's GPT-4 uses a tokenizer called cl100k_base, which has a vocabulary of approximately 100,277 tokens. The word "unhappiness" might be tokenized as ["un", "happiness"] — two tokens. The word "cat" is a single token. An emoji might be 2–3 tokens. On average, one token corresponds to roughly 0.75 English words.

This matters practically: GPT-4's context window of 128,000 tokens corresponds to roughly 96,000 words — about the length of a full novel. Every token, not every word, consumes part of that window.

Real Data Point

When OpenAI released its tokenizer tool tiktoken in 2022, developers discovered that code is tokenized very differently from prose. Python's print("hello") — 15 characters — becomes approximately 6 tokens. Dense mathematical notation can tokenize extremely inefficiently, which partly explains why math was historically harder for LLMs than prose.

Step 2: Embeddings

Each token ID is mapped to a high-dimensional vector through an embedding matrix. In the original Transformer, this vector had 512 dimensions. In GPT-3, 12,288 dimensions. In practice, these vectors are learned during training: the model discovers, through gradient descent, that the vector for "king" should be positioned in embedding space such that "king − man + woman ≈ queen" — the famous demonstration from Word2Vec (Mikolov et al., 2013).

Embeddings encode semantic similarity geometrically. Words used in similar contexts end up with similar vectors. This is not programmed — it emerges from the training objective of predicting the next token accurately.

Step 3: Positional Encoding — The Critical Addition

Here is the problem: self-attention, as described, is permutation-invariant. Give it the tokens for "dog bites man" or "man bites dog" in any order, and the raw attention computation produces the same result — it has no notion of sequence. This would be catastrophic for language, where word order is meaning.

The original Transformer paper solved this with sinusoidal positional encoding: a set of sine and cosine functions at different frequencies, added directly to the embedding vectors before any processing. Each position gets a unique positional signal injected into its representation.

The formula uses sin(pos/10000^(2i/d)) and cos(pos/10000^(2i/d)) for alternating dimensions, where pos is the token position and d is the embedding dimension. The authors chose sinusoids specifically because they allow the model to generalize to longer sequences than it saw during training — a relative offset between positions is always expressible as a linear function of the encoding.

Later models, including GPT-2 and BERT, switched to learned positional embeddings — simply trainable parameters for each position, letting the model discover whatever positional signal works best. More recent architectures like LLaMA use Rotary Position Embeddings (RoPE), which encode relative rather than absolute position within the attention computation itself.

Key Terms

TokenThe basic unit of text that a language model processes — typically a subword unit produced by BPE or WordPiece tokenization, averaging ~0.75 English words.

EmbeddingA learned high-dimensional vector representing a token's meaning, positioned in space so that semantically similar tokens have similar vectors.

Positional EncodingA signal added to token embeddings to inject information about sequence order, compensating for attention's permutation-invariance.

BPEByte Pair Encoding — a tokenization algorithm that iteratively merges the most frequent character pairs, producing a subword vocabulary that handles rare words efficiently.

RoPERotary Position Embeddings — a technique used in LLaMA and other modern models that encodes relative position directly within attention computations.

Practical Implication

Understanding tokenization explains real AI behavior: why models sometimes "count" letters wrong (they never see individual characters), why code and math can confuse models (inefficient tokenization), and why context windows are measured in tokens, not words.

Lesson 2 Quiz

Tokens, Embeddings, and Positional Encoding

1. Why do modern Transformers use subword tokenization (like BPE) rather than character-by-character or word-by-word splits?

Correct. BPE and similar algorithms find a middle ground: a manageable vocabulary size (e.g. ~100K tokens for GPT-4) that still handles rare words by splitting them into known subword pieces.

The key benefit is balance: subword methods avoid a huge word-level vocabulary while still representing rare words (by decomposing them into known subwords). Pure character approaches work but produce very long sequences.

2. Why is positional encoding necessary in a Transformer, but NOT in an RNN?

Correct. Because RNNs process one token at a time in order, position is implicit in the computation. Self-attention attends to all tokens simultaneously and is blind to order without an explicit positional signal.

The key insight: RNNs process tokens sequentially, so order is naturally embedded in the computation. Self-attention treats all positions the same unless you explicitly inject positional information.

3. On average, how many English words does one token correspond to in GPT-style models?

Correct. One token ≈ 0.75 English words, or roughly 4 characters. This means a 128K token context window holds approximately 96,000 words.

The average is roughly 0.75 words (or ~4 characters) per token. Common short words are single tokens; longer or rarer words may be 2–3 tokens.

Lab 2 — Tokenization Explorer

Probe how tokenization and positional encoding shape model behavior. Ask at least 3 questions.

Your Mission

Explore the practical consequences of tokenization and positional encoding with your AI tutor. Many puzzling LLM behaviors — letter-counting errors, math struggles, context length limits — trace back to these input representations.

Suggested starting points: "Why do LLMs sometimes fail to count the letters in 'strawberry'?" · "How does RoPE improve on sinusoidal positional encoding?" · "What happens when you give a model input longer than its context window?"

Tokenization & Embeddings Tutor

L2 Lab

Ready to explore how text becomes numbers. Tokenization and positional encoding explain a surprising number of LLM quirks that confuse practitioners. What would you like to investigate?

Module 1 · Lesson 3

Self-Attention: Queries, Keys, and Values

The mathematical heart of the Transformer — how every token looks at every other token

How does a word decide which other words in the sentence are most relevant to its own meaning?

The self-attention mechanism is arguably the most important algorithm in contemporary AI. Understanding it — not just knowing it exists, but understanding how it actually computes — unlocks the ability to reason about what language models can and cannot do, why they sometimes fail, and how they are being extended by new research.

The Q, K, V Framework

Each token's embedding is projected into three separate vectors through learned linear transformations: a Query (Q), a Key (K), and a Value (V). These names come from a loose analogy with database retrieval: the Query is what you're looking for, Keys are what's available to match against, and Values are the actual information retrieved.

For each token, attention is computed as follows: take the token's Query vector and compute its dot product with the Key vectors of every other token. This produces a raw score expressing how relevant each other token is. Divide by √d_k (the square root of the key dimension) — this scaling prevents the dot products from becoming so large that the softmax function produces near-zero gradients. Apply softmax to normalize these scores into a probability distribution. Finally, compute a weighted sum of all Value vectors using these normalized scores.

The result: each token's new representation is a blend of all other tokens' Value vectors, weighted by relevance. In a single layer, "bank" in "river bank" can already incorporate strong signal from "river."

The Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

This is the entire self-attention computation. Every frontier language model runs billions of these operations per forward pass. The elegance is real: it is a differentiable, parallelizable lookup.

Multi-Head Attention

A single attention computation can only capture one kind of relationship at a time. The original Transformer used multi-head attention: run the Q, K, V projection and attention computation h times in parallel, each with different learned projection matrices. Concatenate the results and project them back down to the model dimension.

In the original paper, the base model used h=8 heads with d_k=64. Why? Because different heads learn to attend to different types of relationships. Research by Voita et al. (2019) at Yandex analyzed trained Transformer heads and found that different heads specialized: some tracked syntactic dependencies, others tracked positional patterns, others tracked coreference (which "it" refers to).

This is not programmed in — it emerges from training. The multi-head structure gives the model enough capacity to simultaneously represent multiple types of inter-token relationships.

The Attention Pattern: What Models Actually Look At

Researchers at Google Brain (Clark et al., 2019, "What Does BERT Look At?") visualized attention patterns in trained BERT models. They found that certain heads consistently attended to specific linguistic structures: delimiter tokens like [SEP], the next/previous token, and words in specific syntactic relationships. One head almost perfectly tracked direct objects of verbs across a range of sentences.

This analysis was important because it demonstrated that Transformers were not black boxes in the sense of being completely opaque — the attention weights provide a partial window into what the model is computing, though interpreting attention weights as "what the model uses" remains an active research debate (Jain & Wallace, 2019, argued attention is not explanation).

Key Terms

Query (Q)The projection of a token's embedding used to "ask" which other tokens are relevant — what this token is looking for.

Key (K)The projection used to "advertise" a token's relevance to queries — what this token offers to be found by.

Value (V)The actual information content of a token that gets passed to attending tokens — what is retrieved when the match succeeds.

Multi-Head AttentionRunning h parallel attention computations with separate projection matrices, allowing the model to attend to multiple relationship types simultaneously.

SoftmaxA function that converts a vector of raw scores into a probability distribution (all values positive, summing to 1), used to normalize attention scores.

Emerging Insight

The scaling factor √d_k in the attention formula is easy to overlook but critical. Without it, with large d_k, dot products grow large and softmax saturates — pushing all probability mass onto one token and producing near-zero gradients that prevent learning. This single detail explains a class of early Transformer training instabilities.

Lesson 3 Quiz

Self-Attention: Queries, Keys, and Values

1. In the attention formula Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V, what is the purpose of dividing by √d_k?

Correct. Without the √d_k scaling, large-dimension dot products grow large, pushing softmax into regions where gradients are near zero — making training unstable.

The √d_k divisor is a stabilization technique: as key dimension grows, dot products grow proportionally, which saturates softmax. Scaling brings them back to a regime where gradients flow properly.

2. What did Voita et al. (2019) discover about multi-head attention heads in trained Transformers?

Correct. Voita et al. (2019) found that attention heads specialize — some track syntax, some coreference, some positional patterns — emerging purely from training, not explicit programming.

Voita et al. found specialization: different heads learn different types of relationships. This was important evidence that Transformers develop structured internal representations.

3. In the Q, K, V framework, if "river" is the Query and "bank" is a Key, what does the resulting high attention score lead to?

Correct. High attention score → high weight in the softmax distribution → the Value vector of "bank" contributes strongly to the output representation of "river," blending context into each token.

Attention weights are used to compute a weighted sum of Value vectors. A high score between "river" (Q) and "bank" (K) means "bank"'s Value vector heavily influences "river"'s output representation.

Lab 3 — Attention Mechanics Deep Dive

Work through the Q, K, V computation with your AI tutor. Ask at least 3 questions.

Your Mission

The self-attention formula is deceptively compact. In this lab, push your understanding of how it actually works: what the three projections represent, why multi-head attention matters, and what interpretability research has revealed about what trained attention heads do.

Suggested starting points: "Walk me through the attention computation step by step for a simple example" · "What's the difference between self-attention and cross-attention?" · "How do attention patterns differ between early and late layers in a Transformer?"

Attention Mechanics Tutor

L3 Lab

Let's dig into the core algorithm. Self-attention is the engine of the Transformer — understanding it deeply will pay dividends throughout this course and in your practical AI work. Where would you like to start?

Module 1 · Lesson 4

Feed-Forward Layers, Layer Norms, and the Full Stack

The complete Transformer block — and why depth matters more than width

Attention captures relationships between tokens — but where does the model store and apply factual knowledge?

Attention handles routing: it decides which information from which tokens is relevant and mixes it together. But attention alone cannot store or apply knowledge. The feed-forward layers — often overlooked in popular explanations — are where the model's learned factual associations actually live. Research by Geva et al. (2021) at Tel Aviv University demonstrated that feed-forward layers in Transformers function as key-value memories, with each neuron encoding specific input patterns and their associated outputs.

The Feed-Forward Sublayer

Every Transformer layer contains two sublayers: the multi-head self-attention sublayer (which we covered in Lesson 3) and a position-wise feed-forward network (FFN). The FFN is applied independently and identically to each token position — there is no mixing between positions here.

The original architecture used a two-layer fully-connected network with a ReLU activation: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. The inner layer dimension was 4× the model dimension — so in the base model with d_model=512, the FFN expanded to 2048, then contracted back. This expansion-contraction pattern persists across virtually all Transformer variants.

Modern models like GPT-4 use a variant of this called SwiGLU (Shazeer, 2020), which uses a different gating activation. LLaMA, Mistral, and most open-source models use SwiGLU or similar gated FFN variants, finding them to improve performance at the same parameter count.

Research Finding

Geva et al. (2021), "Transformer Feed-Forward Layers Are Key-Value Memories," showed that individual neurons in FFN layers respond to specific input patterns (keys) and promote specific output tokens (values). A neuron that fires strongly for "Paris is the capital of" tends to amplify the probability of "France" in the output. This is direct empirical evidence that factual knowledge is stored in FFN weights, not attention weights.

Layer Normalization

Between each sublayer, the original Transformer applied Layer Normalization (Ba et al., 2016). Layer Norm normalizes the activations across the feature dimension (not the batch dimension, as Batch Norm does), then applies learned scale and shift parameters. This stabilizes training by preventing internal covariate shift — the phenomenon where the distribution of activations changes unpredictably during training, making each layer adapt to a moving target.

The original paper used post-layer norm (norm applied after the residual addition). Most modern models — GPT-2 onwards — switched to pre-layer norm (norm applied before each sublayer). This change, analyzed in depth by Liu et al. (2020), substantially improves training stability and allows larger models to train without warmup. LLaMA and most 2023+ open models use RMSNorm, a simplified normalization that drops the mean-centering step and is slightly faster.

Residual Connections: The Information Highway

Both sublayers (attention and FFN) use residual connections — the input to each sublayer is added back to its output: output = sublayer(x) + x. This pattern, borrowed from ResNets (He et al., 2015), allows gradients to flow directly back to early layers without being multiplied through every transformation in the stack. Without residuals, very deep networks fail to train; with them, 96-layer models (GPT-3 has 96 layers) train stably.

The residual connection also means each layer is learning a residual function — the difference from the identity. If a layer learns nothing, it can simply output zeros and pass the input unchanged. This initialization-friendly property makes deep Transformers surprisingly robust to the specific initialization of individual layers.

The Full Transformer Block

A complete Transformer decoder block (as used in GPT models) processes input x through:

1. Layer Norm → Masked Multi-Head Self-Attention → add residual
2. Layer Norm → Feed-Forward Network → add residual

This block is stacked N times (12 in GPT-2 small, 96 in GPT-3). The output of the final block goes through a final Layer Norm, then a linear projection to vocabulary size, then a softmax to produce next-token probabilities. That is the complete architecture of every GPT-style model ever deployed.

Output Probabilities (softmax over vocabulary)

↑

Final Linear Projection + Layer Norm

↑

Feed-Forward Network (×N blocks)

↑

Masked Multi-Head Self-Attention (×N blocks)

↑

Token Embeddings + Positional Encoding

↑

Input Tokens

Simplified GPT-style Transformer Decoder Stack

Key Terms

Feed-Forward Network (FFN)The position-wise fully-connected sublayer in each Transformer block that expands then contracts the representation, functioning as a key-value memory for factual associations.

Layer NormalizationA normalization technique applied across feature dimensions that stabilizes training by ensuring consistent activation distributions through the network.

Residual ConnectionAdding the sublayer input directly to its output (output = sublayer(x) + x), enabling stable gradient flow through deep networks.

Pre-Layer NormApplying layer normalization before (rather than after) each sublayer — the standard in modern GPT-style models, improving training stability.

SwiGLUA gated activation function used in modern FFN layers (LLaMA, GPT-4 generation models) that improves performance over the original ReLU-based FFN.

The Practical Upshot

If you want to understand where a language model stores a fact (e.g., "Paris is the capital of France"), it is largely in the FFN weights distributed across layers. If you want to understand how it routes and combines information in context, look to attention. Both components are necessary; neither is sufficient alone.

Lesson 4 Quiz

Feed-Forward Layers, Layer Norms, and the Full Stack

1. According to Geva et al. (2021), what function do feed-forward layers in Transformers primarily serve?

Correct. Geva et al. (2021) showed empirically that FFN neurons respond to specific input patterns and promote specific outputs — functioning as distributed factual memory.

Geva et al. found that FFN layers act as key-value memories. Individual neurons fire for specific patterns (e.g., "capital of France") and boost specific outputs (e.g., "Paris"). Routing between positions is attention's job.

2. What is the key advantage of residual connections in deep Transformer networks?

Correct. Residual connections (output = sublayer(x) + x) create a direct gradient path through all layers, solving the vanishing gradient problem in deep networks — a technique borrowed from ResNets.

Residual connections add the input directly to the sublayer output. This creates a "highway" for gradients to flow back to early layers without being multiplied through all the intermediate transformations.

3. In the inner dimension of the Transformer's feed-forward sublayer, what expansion ratio does the original paper use relative to the model dimension?

Correct. The original Transformer used a 4× expansion in the FFN inner layer (e.g., model dim 512 → inner dim 2048 → back to 512). This 4× ratio is still common in most modern architectures.

The original paper used a 4× expansion. With d_model=512, the FFN expanded to 2048 internally before projecting back. This ratio persists in most Transformer variants.

Lab 4 — The Complete Architecture

Integrate all four components of the Transformer with your AI tutor. Ask at least 3 questions.

Your Mission

You've now seen all the components of the Transformer: tokenization, embeddings, positional encoding, self-attention, feed-forward layers, layer norm, and residual connections. In this final lab, work with your tutor to connect them into a coherent picture — and begin thinking about how this architecture scales.

Suggested starting points: "How does information flow from input tokens to output probabilities in a GPT model?" · "Why does GPT-3 have 96 layers — what does depth buy you?" · "What is the difference between an encoder-only model like BERT and a decoder-only model like GPT?"

Full Architecture Tutor

L4 Lab

You've built up the full picture now — tokens, embeddings, positional encoding, attention, feed-forward layers, normalization, and residuals. Let's put it all together. What aspect of the complete architecture would you like to explore or clarify?

Module 1 Test

The Transformer Architecture — 15 questions · Pass mark: 80%

1. In what year was the Transformer architecture introduced, and in what paper?

Correct. Vaswani et al. published "Attention Is All You Need" in 2017.

The Transformer was introduced in 2017 in "Attention Is All You Need" by Vaswani et al.

2. What is the fundamental reason the Transformer can train faster than RNNs on the same data?

Correct. Parallelization is the key training advantage — all positions are processed simultaneously rather than one at a time.

Parallel processing is the key: self-attention computes all positions at once; RNNs must wait for each token to process the next.

3. What tokenization algorithm is used in GPT-4's cl100k_base tokenizer?

Correct. GPT models use BPE (Byte Pair Encoding). BERT uses WordPiece — a similar but distinct algorithm.

GPT models (including GPT-4) use Byte Pair Encoding. WordPiece is used by BERT. Both are subword tokenization methods.

4. Why is positional encoding necessary in Transformers but not in RNNs?

Correct. RNNs process tokens sequentially, so order is implicit. Self-attention is order-agnostic and requires explicit positional signals.

Self-attention treats all positions identically without positional encoding. RNNs encode order inherently through sequential processing.

5. In the original Transformer (base model), what is the embedding dimension (d_model)?

Correct. The original base Transformer uses d_model=512. The large model uses 1024. GPT-3's 12288 is much later.

The original base model uses d_model=512. The large variant uses 1024. Modern large models like GPT-3 use 12288.

6. What is the complete self-attention formula?

Correct. softmax(QKᵀ / √d_k) · V is the canonical self-attention formula from Vaswani et al. (2017).

The correct formula is Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V. Queries dot with Keys, scaled, softmaxed, then applied to Values.

7. In the original Transformer base model, how many attention heads (h) are used in multi-head attention?

Correct. The original base Transformer uses h=8 heads with d_k=64 each (512 / 8 = 64).

The base model uses h=8 heads. With d_model=512 and 8 heads, each head has d_k=64 dimensions.

8. What does the research finding from Geva et al. (2021) tell us about where factual knowledge is stored in a Transformer?

Correct. Geva et al. showed FFN layers are key-value memories — individual neurons respond to specific input patterns and boost associated output tokens.

Geva et al. found that FFN layers store factual knowledge as key-value memories. Attention handles routing; FFN handles knowledge retrieval.

9. What is the inner-layer expansion ratio in the original Transformer's feed-forward sublayer?

Correct. The FFN expands 4× (d_model=512 → inner=2048) before projecting back down.

The expansion is 4×. With d_model=512, the FFN inner dimension is 2048.

10. What modern positional encoding technique does LLaMA use, and what is its key advantage?

Correct. LLaMA uses RoPE (Rotary Position Embeddings), which encodes relative rather than absolute position directly within the attention computation.

LLaMA uses RoPE — Rotary Position Embeddings — which encode relative positions within attention, enabling better generalization to sequence lengths beyond training.

11. Why did researchers switch from post-layer norm to pre-layer norm in models like GPT-2 and later?

Correct. Liu et al. (2020) showed pre-layer norm markedly improves training stability, especially important as models scaled to hundreds of billions of parameters.

Pre-layer norm stabilizes training — especially critical at scale. Most modern models (GPT-2 onward) use it for this reason.

12. On average, how many English words does one token represent in GPT-style models?

Correct. One token ≈ 0.75 English words. A 128K token context window holds roughly 96,000 words.

The average is ~0.75 words per token. Common short words are one token; longer/rarer words may be 2–3 tokens.

13. What did Voita et al. (2019) discover about different attention heads in trained Transformer models?

Correct. Voita et al. found emergent specialization — different heads track syntax, coreference, positional patterns, etc., without being explicitly programmed to do so.

Voita et al. found emergent specialization: heads diverge to track different linguistic relationships purely through training.

14. What is the purpose of residual connections in Transformer blocks?

Correct. By adding input to output (x + sublayer(x)), residuals create a direct gradient path, making very deep networks trainable.

Residuals (output = x + sublayer(x)) create a direct gradient highway to early layers, solving vanishing gradients in deep networks.

15. What BLEU score did the original Transformer achieve on WMT 2014 English-German translation, establishing a new state of the art?

Correct. 28.4 BLEU on WMT 2014 English-German — achieved in 3.5 days on 8 P100 GPUs, outperforming all prior models including ensembles.

The correct figure is 28.4 BLEU. This was a new state of the art, achieved with significantly faster training than comparable recurrent models.