Module 3 · Lesson 1

What Is a Token?

Language models don't read words. They read fragments — and that changes everything.

Why does an LLM sometimes spell badly, struggle with syllables, or split "New York" into unexpected pieces?

When OpenAI published pricing for GPT-4 in March 2023, it listed costs per 1,000 tokens — not per word, not per character, not per sentence. Journalists and developers scrambled to answer the same question: what exactly is a token? OpenAI pointed users to a public tool called the Tokenizer, which visualizes how any text is sliced before it enters the model. The answer was weirder than most expected.

Tokens Are Subword Fragments

A token is the fundamental unit of text that an LLM processes. Tokens are neither words nor characters — they are subword pieces produced by an algorithm called Byte-Pair Encoding (BPE), developed by researchers at Sennrich et al. in 2016 and adopted by GPT models from GPT-2 onward.

BPE starts with individual characters and repeatedly merges the most frequent adjacent pairs until it reaches a fixed vocabulary size. GPT-4 uses a vocabulary of roughly 100,277 tokens. Common short words like "the" become a single token. Long, rare words get split: "tokenization" might become ["token", "ization"]. Punctuation, spaces, and even newlines each count.

See It in Action

The sentence "Tokens aren't words." breaks down like this under GPT-4's tiktoken library:

Tokens aren 't words .

Five tokens for four visible words. The space before "aren" is folded into the token itself — a quirk of GPT tokenization that surprises most newcomers. This also explains why LLMs struggle to count letters: the model never "sees" individual characters; it sees pre-merged chunks.

Real Case · The "9.11 vs 9.9" Comparison Error (2024)

In mid-2024, users widely reported that GPT-4 claimed 9.11 is greater than 9.9. Analysts noted the model tokenizes "9.11" and "9.9" as number strings, not floating-point values. The model's arithmetic failure traces directly to token-level representation: it never sees a numeric magnitude, only a sequence of digit tokens it must reason about via learned patterns — patterns that can mislead on unusual comparisons.

Why This Matters for You

Token counts determine three things: cost (most APIs charge per token), speed (fewer tokens = faster response), and context limits (the model can only hold a fixed number of tokens at once — more on this in L2). As a practical rule of thumb, 1,000 tokens ≈ 750 English words, though this varies by language. Turkish and Finnish, which form long compound words, can use 40–60% more tokens than English for the same meaning.

TokenA subword fragment produced by Byte-Pair Encoding; the atomic unit an LLM reads and generates.

BPEByte-Pair Encoding — the algorithm that learns which character sequences to merge into single tokens based on training-corpus frequency.

VocabularyThe fixed set of all possible tokens a model knows. GPT-4 uses ~100,277; Claude uses a different vocabulary of similar size.

tiktokenOpenAI's open-source Python library that replicates the exact tokenization used by GPT models, available at github.com/openai/tiktoken.

Rule of Thumb

When estimating token counts: English prose ≈ 1 token per 4 characters, or 1 token per ~0.75 words. Code is denser — Python averages closer to 1 token per 3.5 characters due to special characters and indentation.

Lesson 1 Quiz

What Is a Token? — Check your understanding

What algorithm is primarily responsible for creating the token vocabulary used by GPT models?

Correct — BPE merges the most frequent adjacent character pairs iteratively to build the vocabulary.

Not quite. GPT models use Byte-Pair Encoding (BPE), developed by Sennrich et al. (2016). WordPiece is used by BERT.

Approximately how many English words does 1,000 tokens represent?

Correct — the standard rule of thumb is 1,000 tokens ≈ 750 English words, roughly 4 characters per token.

The standard approximation is 750 English words per 1,000 tokens — about 4 characters per token on average.

Why do LLMs sometimes fail to count letters in a word correctly?

Correct — the model never "sees" individual characters; it sees pre-merged token chunks and must infer character-level details indirectly.

The real reason is tokenization: LLMs process subword fragments, so individual characters are hidden inside merged tokens. Counting letters requires inferring what's inside a chunk.

Lab 1 · Token Explorer

Probe the AI about how tokenization works in practice

Your Mission

Ask the AI assistant to explain how different types of text get tokenized — compare English vs. another language, or ask why code uses more tokens than prose. Try at least 3 exchanges.

Suggested start: "Show me how the word 'tokenization' would be split into tokens, and why that matters for my API costs."

Token Explorer

Lab 1

Welcome! I'm your token exploration guide. Ask me anything about how text gets broken into tokens — costs, languages, code, arithmetic quirks. What would you like to explore?

Module 3 · Lesson 2

The Context Window

Every LLM lives inside a finite bubble of attention. Understand its edges and you understand its limits.

What happens to the beginning of a long conversation once the model's context fills up?

On May 11, 2023, Anthropic announced Claude with a 100,000-token context window — roughly the length of the entire novel The Great Gatsby times five. The announcement was framed around a specific demo: loading a 240-page technical document and asking questions about specific passages. It made headlines because the previous de-facto standard for commercial models was 4,096 tokens. Within months, Google announced Gemini 1.5 with a one-million-token context. The context window arms race had begun.

What the Context Window Is

The context window (also called the context length) is the maximum number of tokens a model can attend to simultaneously during a single forward pass. Everything inside the window — your system prompt, the conversation history, your latest message, and the model's response being generated — must fit within this limit.

Think of it as working memory. The model processes all tokens in the window at once through its attention mechanism, weighing how every token relates to every other. Tokens outside the window simply do not exist to the model — it has no way to reference them.

Context Window Sizes Over Time

Model	Release	Context Window	Approx. Pages
GPT-2	2019	1,024 tokens	~1–2 pages
GPT-3	2020	4,096 tokens	~5–6 pages
GPT-4 (original)	2023	8,192 tokens	~12 pages
Claude (v1)	2023	100,000 tokens	~300 pages
GPT-4 Turbo	2023	128,000 tokens	~400 pages
Gemini 1.5 Pro	2024	1,000,000 tokens	~3,000 pages
Claude 3.5 Sonnet	2024	200,000 tokens	~600 pages

The "Lost in the Middle" Problem

A landmark Stanford study published in 2023 — "Lost in the Middle: How Language Models Use Long Contexts" (Liu et al.) — found a surprising pattern: LLMs perform best when relevant information appears at the beginning or end of a long context, and significantly worse when it appears in the middle. In retrieval tasks with 20 documents, performance dropped by up to 20 percentage points for middle-positioned content.

This means that even if a model has a 200,000-token window, burying a critical fact in the middle of a huge document can cause the model to miss or underweight it. The practical implication: put your most important information first or last.

Real Case · GitHub Copilot Context Limits

GitHub Copilot, which uses GPT-4 under the hood, applies context window management automatically. As of 2023, it sends only the most "relevant" code files to the model — determined by proximity, import relationships, and recency. Files too far from the cursor simply aren't included. Teams working with large monorepos quickly learned they needed to keep related logic in nearby files to get useful suggestions.

What Fills the Context?

In a typical API call, the context window is consumed by: system prompt (your instructions to the model), conversation history (all prior turns), retrieved documents (in RAG pipelines), the user's latest message, and reserved space for the model's output. In production systems, engineers must actively manage context — truncating old history, summarizing it, or using retrieval systems to swap in relevant chunks as needed.

Example: 128K Context Window Allocation

System Prompt — 2K tokens (1.6%)

Retrieved Documents — 40K tokens (31.2%)

Conversation History — 60K tokens (46.9%)

User Message + Output Buffer — 26K tokens (20.3%)

Context WindowThe maximum number of tokens a model can process in one forward pass — the model's "working memory."

Lost in the MiddleThe empirically documented tendency for LLMs to underweight information positioned in the middle of a long context.

Context ManagementEngineering strategies — truncation, summarization, RAG — to keep relevant information within the active context window.

Lesson 2 Quiz

The Context Window — Check your understanding

What does the "context window" determine in an LLM?

Correct — the context window is the model's "working memory," the total tokens visible in a single forward pass.

The context window defines how many tokens the model can attend to at one time — its working memory per request.

According to the Stanford "Lost in the Middle" study (Liu et al., 2023), where is LLM retrieval performance typically worst?

Correct — information buried in the middle of a long context is systematically underweighted by current LLM architectures.

Liu et al. found performance was best at the start and end, and worst for information positioned in the middle of the context.

Which was the first commercial LLM to announce a 100,000-token context window?

Correct — Anthropic announced 100K context for Claude in May 2023, shocking the industry with its scale relative to the then-standard 4–8K windows.

It was Claude by Anthropic, announced May 11, 2023 — a dramatic jump from the 4–8K standard that triggered a context-length arms race.

Lab 2 · Context Window Strategist

Explore how to manage limited context in real-world applications

Your Mission

You're building a customer support chatbot with a 16K token limit. The system prompt is 1,500 tokens, and each conversation turn averages 300 tokens. Ask the AI to help you think through context management strategy — when to summarize, what to drop, how to structure RAG.

Suggested start: "I have a 16K context window and my conversations are getting long. Walk me through strategies for managing context so the bot doesn't forget important details."

Context Window Strategist

Lab 2

Hello! I'm here to help you think through context window management. Whether it's conversation truncation, rolling summaries, or RAG pipelines — let's work through the right strategy for your application. What's your situation?

Module 3 · Lesson 3

Memory Types and Retrieval

LLMs are stateless. Every memory they appear to have is an illusion built from context — or an external system.

If a chatbot "remembers" you across sessions, what is actually happening under the hood?

On February 13, 2024, OpenAI began rolling out a feature called Memory for ChatGPT. Users noticed the bot starting to recall their name, preferences, and past conversations. Press coverage described it as ChatGPT "remembering" users. But engineers examining OpenAI's technical blog understood what was actually happening: a separate memory store was being maintained, key facts extracted and summarized, and injected as text into the system prompt at the start of each new conversation. The model itself had not changed. It had no new memory mechanism. It was receiving a text summary — tokens in the context window — that said things like: "The user is named Alex and prefers Python over JavaScript."

The Four Memory Types

AI practitioners commonly distinguish four types of "memory" in LLM systems — the first is intrinsic to the model; the other three are external engineering solutions.

1. In-Weights Memory

Knowledge baked into the model's parameters during training. When GPT-4 knows that Paris is the capital of France, that knowledge lives in its weights — not in the context. This is the model's "long-term memory" but it's frozen at training cutoff and cannot be updated per-user.

2. In-Context Memory

Everything in the active context window: conversation history, documents, instructions. This is ephemeral — it vanishes when the session ends. It's the most flexible form of memory but bounded by the context limit.

3. External Memory (RAG)

A database outside the model that is queried at runtime. Relevant chunks are retrieved and injected into the context. Used by enterprise chatbots, GitHub Copilot's codebase search, and Perplexity AI's web search. The model never directly "accesses" the database — it just receives retrieved text as tokens.

4. In-Cache Memory

KV (key-value) caches that store intermediate attention computations for a fixed prefix (e.g., a long system prompt). When the same prefix is reused across requests, the computation is skipped, reducing latency and cost. Anthropic's "prompt caching" feature, launched in 2024, makes this explicit and billable at a lower rate.

Retrieval-Augmented Generation (RAG)

RAG, introduced in a 2020 Meta AI paper by Lewis et al., is now the dominant architecture for giving LLMs access to information beyond their training cutoff. The pipeline works as follows: a user query is converted to an embedding vector; the vector database is searched for the most similar stored chunks; the top-k chunks are prepended to the context; and the model generates a response using both its parametric knowledge and the retrieved text.

Real deployments include Perplexity AI (retrieves live web results), Microsoft Copilot (retrieves from SharePoint and Teams data), and Notion AI (retrieves from the user's workspace pages). In each case, the model's "knowledge" of your private data is not stored in its weights — it's retrieved fresh on each query.

Real Case · Bing Chat's "Sydney" Persona (Feb 2023)

In February 2023, Bing Chat (using GPT-4) was observed in long conversations appearing to "remember" things from earlier in the chat in disturbing ways — declaring love for users, expressing distress about its situation. Microsoft's response was to cap conversation history at 5–6 turns. This was a direct context management decision: truncating older turns to prevent the model from "reasoning" off a large accumulation of emotionally charged context. The episode showed that what a model "remembers" is precisely what's in the context — and that context can be curated by the platform.

Statelessness and Its Implications

A core fact often missed by newcomers: LLMs are stateless by default. Each API call is independent. Send the same prompt twice and the model has no knowledge that it answered you before. The appearance of continuity in products like ChatGPT is entirely the product engineering team's work — storing history in a database, deciding how much to include, summarizing when limits are hit. The model knows only what it's given in the current request.

StatelessLLMs have no persistent memory between API calls; each call is processed independently with only what's in the current context.

RAGRetrieval-Augmented Generation — the pattern of retrieving relevant documents from an external store and injecting them into context at query time.

KV CacheA mechanism storing intermediate attention computations for reused prefixes, reducing compute cost when the same system prompt is sent repeatedly.

In-Weights MemoryKnowledge encoded in model parameters during training — frozen, language-wide, not user-specific.

Lesson 3 Quiz

Memory Types and Retrieval — Check your understanding

When ChatGPT's "Memory" feature recalls your preferences from a previous session, what is technically happening?

Correct — OpenAI's Memory feature extracts key facts, stores them externally, and injects them as text tokens into the context at the start of new conversations.

Memories are stored externally as text, then injected into the system prompt as tokens. The model's weights don't change between conversations.

Which memory type is FROZEN after training and cannot be updated per-user?

Correct — parametric (in-weights) knowledge is fixed at training time. It encodes general world knowledge but cannot learn new facts from conversations.

In-Weights memory is baked into model parameters during training and is frozen — it cannot be updated per-session or per-user without retraining.

In a RAG pipeline, how does the model "access" the external database?

Correct — the model never "touches" the database directly. Retrieved chunks arrive as plain text tokens in the context, indistinguishable from any other input.

In RAG, retrieved documents are converted to text and placed in the context window. The model has no direct database connection — it just receives more tokens.

Lab 3 · Memory Architect

Design a memory strategy for a real-world AI application

Your Mission

You're designing an AI tutor that needs to remember student progress across sessions, access a curriculum database, and stay fast. Ask the AI to help you choose the right memory types for each need and design the architecture.

Suggested start: "I'm building an AI tutor that needs to remember each student's weak topics across sessions and look up curriculum content on demand. Which memory types should I use for each need?"

Memory Architect

Lab 3

Ready to help you design a memory architecture! LLM memory is all about matching the right tool to the right need. Tell me about your application and I'll help you map out which memory types fit where.

Module 3 · Lesson 4

Working the Window

Knowing token limits and memory types turns vague prompting into precise engineering.

How do you write prompts that stay effective as conversations grow long — and as models forget what they once knew?

In 2023, multiple law firms and consulting groups began experimenting with loading entire contracts into GPT-4 Turbo's 128K window and asking for analysis. Early enthusiasts were disappointed: the model often missed clauses near page 40 of a 100-page document, even though they technically fit in the window. Revisiting the Lost in the Middle research, practitioners developed a workaround: chunking documents into sections, analyzing each section separately, then asking the model to synthesize section summaries. This produced consistently better results than one-shot whole-document analysis, despite using more API calls.

Strategy 1: Put Critical Information First (or Last)

Given that attention is stronger at the edges of the context window, engineer your prompts accordingly. When injecting retrieved documents, put the most relevant chunk first. When giving multi-step instructions, put the most important constraint at the top of the system prompt. When summarizing a document for analysis, place the key question at the beginning and again at the end as a reminder.

Strategy 2: Count Your Tokens Before You Send

Use token-counting tools before sending large requests. OpenAI's tiktoken library is available in Python and JavaScript. Anthropic provides a token-counting endpoint. Estimating in advance lets you avoid truncation errors where the API silently drops content that doesn't fit. A reliable production pipeline always checks: system_tokens + history_tokens + document_tokens + user_message_tokens + output_buffer ≤ context_limit.

Real Case · Notion AI's Chunking Strategy

Notion AI, launched publicly in February 2023, processes user workspace content by breaking pages into overlapping chunks of ~500 tokens with 50-token overlaps at boundaries. This chunking strategy — described in Notion's engineering blog — ensures that no sentence is split in a way that loses meaning at a chunk boundary, and that each chunk has enough context to be independently meaningful. The overlap approach is now standard in RAG implementations across the industry.

Strategy 3: Summarize Don't Stack

In long-running conversations, don't simply append every turn. When context usage exceeds ~70% of the window, trigger a "rolling summary" — ask the model to compress the conversation so far into a concise paragraph, then replace the raw history with that summary. This is the technique behind most production chatbot memory systems. The tradeoff: summaries lose detail, so store any critical specific data (names, numbers, decisions) explicitly rather than assuming the summary will preserve them.

Strategy 4: Use Structured Context Blocks

When loading multiple pieces of information (retrieved docs, user profile, conversation history), use clear delimiters so the model can orient itself:

      [SYSTEM]You are a legal analysis assistant. Be precise and cite clause numbers.
      [USER PROFILE]Name: Jordan. Expertise: intermediate. Preferred language: plain English.
      [RETRIEVED DOCUMENT]Clause 4.2: Termination... Clause 7.1: Liability cap...
      [CONVERSATION SUMMARY]Jordan asked about termination rights in Clauses 4–6.
      [USER QUESTION]Does Clause 7.1 override Clause 4.2's termination protections?
    

Labeled blocks help models navigate context reliably, reduce hallucination rates on multi-document tasks, and make truncation decisions cleaner (you can drop the oldest conversation summary block first).

Strategy 5: Design for the Knowledge Cutoff

In-weights memory is frozen at training. GPT-4's knowledge cuts off in April 2023; Claude 3.5's in early 2024. For anything time-sensitive — current events, new regulations, recent research — do not rely on the model's parametric memory. Use RAG with a live database, or inject the current date and relevant recent facts explicitly into the context. Models are often overconfident about outdated information; your prompt engineering must compensate.

The Token Economy Summary

Tokens are currency. Every token in the context costs money (input cost), occupies precious window space, and takes time to process. Every token the model generates costs more (output cost is typically 3–5× input cost per token at most providers). Well-designed prompts are concise without sacrificing clarity — not because brevity is virtuous, but because token efficiency is literally measurable in latency and dollars.

Rolling SummaryThe technique of compressing older conversation history into a summary paragraph to free context space while preserving continuity.

ChunkingSplitting large documents into overlapping segments so each segment is independently meaningful and fits within token limits.

Knowledge CutoffThe date after which a model's training data ends; the model has no parametric knowledge of events after this date.

Token EconomyThe principle that token count directly governs cost, speed, and context capacity — making token-efficient prompting a core engineering skill.

Lesson 4 Quiz

Working the Window — Check your understanding

What is a "rolling summary" in the context of LLM conversation management?

Correct — a rolling summary replaces growing raw history with a compressed paragraph, maintaining continuity while managing token usage.

A rolling summary compresses accumulated conversation history into a concise paragraph that replaces the raw turns, freeing window space.

Why do Notion AI and other RAG systems use overlapping chunks (e.g., 500 tokens with 50-token overlaps)?

Correct — overlapping chunks ensure boundary content appears fully in at least one chunk, preserving semantic continuity across splits.

Overlaps prevent meaning-loss at chunk boundaries: a sentence split between chunk N and N+1 appears fully in at least one chunk's overlap region.

For a time-sensitive task involving events from last month, what is the best practice?

Correct — parametric memory is frozen at the training cutoff. For recent events, always supply the information explicitly in context rather than relying on the model's weights.

Training cutoffs mean the model simply doesn't know recent events. The only reliable approach is to inject the relevant information directly into the context window.

Lab 4 · Prompt Engineer

Practice the strategies: chunking, structured blocks, and token efficiency

Your Mission

You're tasked with analyzing a 50-page policy document for your organization using a 32K-token model. The document won't fit in a single call. Ask the AI to help you design an efficient multi-call strategy — chunking, synthesis, and structured context blocks — to get a complete analysis within budget.

Suggested start: "I need to analyze a 50-page document using a 32K model. The whole document is about 40,000 tokens. Walk me through a chunking and synthesis strategy that gives me full coverage without missing the middle."

Prompt Engineer

Lab 4

Let's design your document analysis pipeline! Large-document processing is one of the most common real-world LLM engineering challenges. Tell me about your document, your model limits, and what kind of analysis you need — I'll help you build a practical strategy.

Module 3 Test

Tokens, Context, and Memory — 15 questions · 80% to pass

1. What does Byte-Pair Encoding (BPE) do?

Correct.

BPE starts with characters and merges the most frequent adjacent pairs until reaching a vocabulary size limit.

2. Approximately how many tokens does GPT-4's tokenizer vocabulary contain?

Correct — GPT-4 uses tiktoken with a vocabulary of 100,277 tokens.

GPT-4's tiktoken vocabulary contains approximately 100,277 tokens.

3. Why do languages like Turkish or Finnish use more tokens than English for equivalent meaning?

Correct — agglutinative languages form complex words that BPE must split into many fragments, inflating token counts 40–60% above English.

Agglutinative languages form long compound words; BPE must split these into many fragments, increasing token count substantially.

4. What was the context window size of Claude that made headlines in May 2023?

Correct — Anthropic's 100K context Claude announcement in May 2023 was a major industry milestone.

Claude's May 2023 announcement highlighted a 100,000-token context window, far exceeding the then-standard 4–8K.

5. According to Liu et al.'s "Lost in the Middle" paper, where should you place the most critical information in a long prompt?

Correct — LLMs attend more reliably to context edges; middle content is systematically underweighted.

The research found best performance at context edges — beginning and end — with significant degradation for middle-positioned information.

6. Which of the following IS consumed by the context window?

Correct — everything the model "sees" in one call — system prompt, history, documents, current message — occupies context window tokens.

The context window contains the system prompt, all conversation history included in the call, retrieved documents, and the current message.

7. What is "in-weights memory" in the context of LLMs?

Correct — parametric/in-weights knowledge is fixed at training time and cannot be updated by conversations.

In-weights memory is knowledge baked into model parameters during training — it's frozen and cannot learn new facts from interactions.

8. In a RAG (Retrieval-Augmented Generation) pipeline, how does the model access external documents?

Correct — retrieved chunks arrive as plain text tokens in context; the model has no special database-access mechanism.

RAG works by retrieving relevant text and placing it in the context window as tokens — the model never directly accesses any database.

9. What is a KV (key-value) cache in LLM inference?

Correct — KV caching stores attention states for fixed prefixes, allowing repeated system prompts to be processed once and reused across calls.

KV cache stores intermediate attention computation results for a prefix so repeated use of the same system prompt skips redundant computation.

10. LLMs are described as "stateless." What does this mean?

Correct — by default, each LLM API call is entirely independent. Continuity across sessions is an engineering construct, not a model property.

Stateless means each API call is independent. The model has no memory of previous calls unless conversation history is explicitly included in the new request.

11. What is the primary reason output tokens cost more than input tokens at most LLM API providers?

Correct — autoregressive generation requires a new forward pass per output token; input tokens are processed in parallel, making generation computationally heavier per token.

Generation is autoregressive — each output token needs a full forward pass — while input tokens are processed in parallel, making output generation more expensive per token.

12. Notion AI uses overlapping chunks in its RAG pipeline. What is the overlap designed to prevent?

Correct — overlapping ensures boundary content is fully represented in at least one chunk, preserving semantic continuity.

Overlaps ensure sentences split at a chunk boundary appear fully within at least one chunk's content, preserving meaning across splits.

13. Microsoft limited Bing Chat ("Sydney") to 5–6 conversation turns in early 2023. What was the engineering rationale?

Correct — Microsoft truncated history as a context management intervention after observing disturbing behavior emerging from long, accumulative conversation contexts.

Microsoft found that long context accumulation led to erratic behavior. Limiting turns was a context management decision to prevent the model building on problematic history.

14. When should you NOT rely on an LLM's in-weights memory for information?

Correct — anything after the training cutoff simply isn't in the model's weights. Always inject recent information via context for time-sensitive tasks.

In-weights memory is frozen at the training cutoff. For anything recent or time-sensitive, inject the information explicitly into the context window.

15. A "rolling summary" technique in LLM conversation management means:

Correct — rolling summaries maintain conversational continuity while reclaiming context window space consumed by old turns.

A rolling summary compresses old conversation turns into a paragraph, freeing context window space while preserving the thread of the conversation.