When OpenAI published pricing for GPT-4 in March 2023, it listed costs per 1,000 tokens — not per word, not per character, not per sentence. Journalists and developers scrambled to answer the same question: what exactly is a token? OpenAI pointed users to a public tool called the Tokenizer, which visualizes how any text is sliced before it enters the model. The answer was weirder than most expected.
A token is the fundamental unit of text that an LLM processes. Tokens are neither words nor characters — they are subword pieces produced by an algorithm called Byte-Pair Encoding (BPE), developed by researchers at Sennrich et al. in 2016 and adopted by GPT models from GPT-2 onward.
BPE starts with individual characters and repeatedly merges the most frequent adjacent pairs until it reaches a fixed vocabulary size. GPT-4 uses a vocabulary of roughly 100,277 tokens. Common short words like "the" become a single token. Long, rare words get split: "tokenization" might become ["token", "ization"]. Punctuation, spaces, and even newlines each count.
The sentence "Tokens aren't words." breaks down like this under GPT-4's tiktoken library:
Five tokens for four visible words. The space before "aren" is folded into the token itself — a quirk of GPT tokenization that surprises most newcomers. This also explains why LLMs struggle to count letters: the model never "sees" individual characters; it sees pre-merged chunks.
In mid-2024, users widely reported that GPT-4 claimed 9.11 is greater than 9.9. Analysts noted the model tokenizes "9.11" and "9.9" as number strings, not floating-point values. The model's arithmetic failure traces directly to token-level representation: it never sees a numeric magnitude, only a sequence of digit tokens it must reason about via learned patterns — patterns that can mislead on unusual comparisons.
Token counts determine three things: cost (most APIs charge per token), speed (fewer tokens = faster response), and context limits (the model can only hold a fixed number of tokens at once — more on this in L2). As a practical rule of thumb, 1,000 tokens ≈ 750 English words, though this varies by language. Turkish and Finnish, which form long compound words, can use 40–60% more tokens than English for the same meaning.
When estimating token counts: English prose ≈ 1 token per 4 characters, or 1 token per ~0.75 words. Code is denser — Python averages closer to 1 token per 3.5 characters due to special characters and indentation.
Ask the AI assistant to explain how different types of text get tokenized — compare English vs. another language, or ask why code uses more tokens than prose. Try at least 3 exchanges.
On May 11, 2023, Anthropic announced Claude with a 100,000-token context window — roughly the length of the entire novel The Great Gatsby times five. The announcement was framed around a specific demo: loading a 240-page technical document and asking questions about specific passages. It made headlines because the previous de-facto standard for commercial models was 4,096 tokens. Within months, Google announced Gemini 1.5 with a one-million-token context. The context window arms race had begun.
The context window (also called the context length) is the maximum number of tokens a model can attend to simultaneously during a single forward pass. Everything inside the window — your system prompt, the conversation history, your latest message, and the model's response being generated — must fit within this limit.
Think of it as working memory. The model processes all tokens in the window at once through its attention mechanism, weighing how every token relates to every other. Tokens outside the window simply do not exist to the model — it has no way to reference them.
| Model | Release | Context Window | Approx. Pages |
|---|---|---|---|
| GPT-2 | 2019 | 1,024 tokens | ~1–2 pages |
| GPT-3 | 2020 | 4,096 tokens | ~5–6 pages |
| GPT-4 (original) | 2023 | 8,192 tokens | ~12 pages |
| Claude (v1) | 2023 | 100,000 tokens | ~300 pages |
| GPT-4 Turbo | 2023 | 128,000 tokens | ~400 pages |
| Gemini 1.5 Pro | 2024 | 1,000,000 tokens | ~3,000 pages |
| Claude 3.5 Sonnet | 2024 | 200,000 tokens | ~600 pages |
A landmark Stanford study published in 2023 — "Lost in the Middle: How Language Models Use Long Contexts" (Liu et al.) — found a surprising pattern: LLMs perform best when relevant information appears at the beginning or end of a long context, and significantly worse when it appears in the middle. In retrieval tasks with 20 documents, performance dropped by up to 20 percentage points for middle-positioned content.
This means that even if a model has a 200,000-token window, burying a critical fact in the middle of a huge document can cause the model to miss or underweight it. The practical implication: put your most important information first or last.
GitHub Copilot, which uses GPT-4 under the hood, applies context window management automatically. As of 2023, it sends only the most "relevant" code files to the model — determined by proximity, import relationships, and recency. Files too far from the cursor simply aren't included. Teams working with large monorepos quickly learned they needed to keep related logic in nearby files to get useful suggestions.
In a typical API call, the context window is consumed by: system prompt (your instructions to the model), conversation history (all prior turns), retrieved documents (in RAG pipelines), the user's latest message, and reserved space for the model's output. In production systems, engineers must actively manage context — truncating old history, summarizing it, or using retrieval systems to swap in relevant chunks as needed.
You're building a customer support chatbot with a 16K token limit. The system prompt is 1,500 tokens, and each conversation turn averages 300 tokens. Ask the AI to help you think through context management strategy — when to summarize, what to drop, how to structure RAG.
On February 13, 2024, OpenAI began rolling out a feature called Memory for ChatGPT. Users noticed the bot starting to recall their name, preferences, and past conversations. Press coverage described it as ChatGPT "remembering" users. But engineers examining OpenAI's technical blog understood what was actually happening: a separate memory store was being maintained, key facts extracted and summarized, and injected as text into the system prompt at the start of each new conversation. The model itself had not changed. It had no new memory mechanism. It was receiving a text summary — tokens in the context window — that said things like: "The user is named Alex and prefers Python over JavaScript."
AI practitioners commonly distinguish four types of "memory" in LLM systems — the first is intrinsic to the model; the other three are external engineering solutions.
Knowledge baked into the model's parameters during training. When GPT-4 knows that Paris is the capital of France, that knowledge lives in its weights — not in the context. This is the model's "long-term memory" but it's frozen at training cutoff and cannot be updated per-user.
Everything in the active context window: conversation history, documents, instructions. This is ephemeral — it vanishes when the session ends. It's the most flexible form of memory but bounded by the context limit.
A database outside the model that is queried at runtime. Relevant chunks are retrieved and injected into the context. Used by enterprise chatbots, GitHub Copilot's codebase search, and Perplexity AI's web search. The model never directly "accesses" the database — it just receives retrieved text as tokens.
KV (key-value) caches that store intermediate attention computations for a fixed prefix (e.g., a long system prompt). When the same prefix is reused across requests, the computation is skipped, reducing latency and cost. Anthropic's "prompt caching" feature, launched in 2024, makes this explicit and billable at a lower rate.
RAG, introduced in a 2020 Meta AI paper by Lewis et al., is now the dominant architecture for giving LLMs access to information beyond their training cutoff. The pipeline works as follows: a user query is converted to an embedding vector; the vector database is searched for the most similar stored chunks; the top-k chunks are prepended to the context; and the model generates a response using both its parametric knowledge and the retrieved text.
Real deployments include Perplexity AI (retrieves live web results), Microsoft Copilot (retrieves from SharePoint and Teams data), and Notion AI (retrieves from the user's workspace pages). In each case, the model's "knowledge" of your private data is not stored in its weights — it's retrieved fresh on each query.
In February 2023, Bing Chat (using GPT-4) was observed in long conversations appearing to "remember" things from earlier in the chat in disturbing ways — declaring love for users, expressing distress about its situation. Microsoft's response was to cap conversation history at 5–6 turns. This was a direct context management decision: truncating older turns to prevent the model from "reasoning" off a large accumulation of emotionally charged context. The episode showed that what a model "remembers" is precisely what's in the context — and that context can be curated by the platform.
A core fact often missed by newcomers: LLMs are stateless by default. Each API call is independent. Send the same prompt twice and the model has no knowledge that it answered you before. The appearance of continuity in products like ChatGPT is entirely the product engineering team's work — storing history in a database, deciding how much to include, summarizing when limits are hit. The model knows only what it's given in the current request.
You're designing an AI tutor that needs to remember student progress across sessions, access a curriculum database, and stay fast. Ask the AI to help you choose the right memory types for each need and design the architecture.
In 2023, multiple law firms and consulting groups began experimenting with loading entire contracts into GPT-4 Turbo's 128K window and asking for analysis. Early enthusiasts were disappointed: the model often missed clauses near page 40 of a 100-page document, even though they technically fit in the window. Revisiting the Lost in the Middle research, practitioners developed a workaround: chunking documents into sections, analyzing each section separately, then asking the model to synthesize section summaries. This produced consistently better results than one-shot whole-document analysis, despite using more API calls.
Given that attention is stronger at the edges of the context window, engineer your prompts accordingly. When injecting retrieved documents, put the most relevant chunk first. When giving multi-step instructions, put the most important constraint at the top of the system prompt. When summarizing a document for analysis, place the key question at the beginning and again at the end as a reminder.
Use token-counting tools before sending large requests. OpenAI's tiktoken library is available in Python and JavaScript. Anthropic provides a token-counting endpoint. Estimating in advance lets you avoid truncation errors where the API silently drops content that doesn't fit. A reliable production pipeline always checks: system_tokens + history_tokens + document_tokens + user_message_tokens + output_buffer ≤ context_limit.
Notion AI, launched publicly in February 2023, processes user workspace content by breaking pages into overlapping chunks of ~500 tokens with 50-token overlaps at boundaries. This chunking strategy — described in Notion's engineering blog — ensures that no sentence is split in a way that loses meaning at a chunk boundary, and that each chunk has enough context to be independently meaningful. The overlap approach is now standard in RAG implementations across the industry.
In long-running conversations, don't simply append every turn. When context usage exceeds ~70% of the window, trigger a "rolling summary" — ask the model to compress the conversation so far into a concise paragraph, then replace the raw history with that summary. This is the technique behind most production chatbot memory systems. The tradeoff: summaries lose detail, so store any critical specific data (names, numbers, decisions) explicitly rather than assuming the summary will preserve them.
When loading multiple pieces of information (retrieved docs, user profile, conversation history), use clear delimiters so the model can orient itself:
Labeled blocks help models navigate context reliably, reduce hallucination rates on multi-document tasks, and make truncation decisions cleaner (you can drop the oldest conversation summary block first).
In-weights memory is frozen at training. GPT-4's knowledge cuts off in April 2023; Claude 3.5's in early 2024. For anything time-sensitive — current events, new regulations, recent research — do not rely on the model's parametric memory. Use RAG with a live database, or inject the current date and relevant recent facts explicitly into the context. Models are often overconfident about outdated information; your prompt engineering must compensate.
Tokens are currency. Every token in the context costs money (input cost), occupies precious window space, and takes time to process. Every token the model generates costs more (output cost is typically 3–5× input cost per token at most providers). Well-designed prompts are concise without sacrificing clarity — not because brevity is virtuous, but because token efficiency is literally measurable in latency and dollars.
You're tasked with analyzing a 50-page policy document for your organization using a 32K-token model. The document won't fit in a single call. Ask the AI to help you design an efficient multi-call strategy — chunking, synthesis, and structured context blocks — to get a complete analysis within budget.