In late 2023, Cursor AI — the AI-powered code editor — faced a widely discussed cost crisis. Early versions sent entire codebases as context on nearly every keystroke-triggered completion. Engineering teams analyzing their Claude and GPT-4 API bills reported per-user monthly costs exceeding $50–$100 at scale, a figure that made the $20/month subscription model algebraically impossible. The team's public post-mortems named indiscriminate token usage as the core issue: they were paying for tokens the model never needed. Aggressive context windowing and retrieval-augmented generation cut their effective cost per session by more than 60% without measurable quality regression.
This is not unique to Cursor. Every agent system that survives past prototype stage confronts the same arithmetic: tokens are the unit of production cost, and production cost determines whether your product exists.
LLM APIs price separately on input tokens (everything you send: system prompt, conversation history, retrieved documents, tool schemas) and output tokens (what the model generates). As of mid-2024, Anthropic's Claude 3.5 Sonnet charges $3 per million input tokens and $15 per million output tokens. OpenAI's GPT-4o charges $5 input / $15 output. These numbers shift, but the asymmetry is consistent: output costs 3–5× more per token than input.
For an agent that makes 10 LLM calls per user task, with an average context of 4,000 tokens per call and 500 output tokens, a single task costs approximately 40,000 input + 5,000 output tokens. At Claude 3.5 Sonnet pricing that is $0.12 + $0.075 = $0.195 per task. At 10,000 daily tasks, that is $1,950/day — $712,000/year — for one agent workflow. Token optimization is therefore a core business function, not an engineering nicety.
Agent systems compound token costs in ways single-call applications do not. Retry logic, tool call roundtrips, chain-of-thought reasoning, and multi-agent orchestration all multiply the base token count. A naive ReAct-style agent solving a 5-step task may issue 8–12 LLM calls, each carrying the full prior context. The effective token spend per "task" can be 10–20× what a naive estimate predicts.
Context window size has increased dramatically — Claude 3.5 supports 200K tokens, GPT-4o supports 128K — but larger windows do not solve cost. They enable new capabilities while simultaneously enabling new waste. A 200K context filled carelessly is simply a more expensive mistake.
Observability teams at companies running large agent deployments have consistently identified the same categories of waste. Understanding these patterns is the prerequisite to eliminating them.
A 2024 analysis by the team at LangSmith (LangChain's observability product) found that across monitored agent deployments, 38% of input tokens were classified as "low-relevance" — content that appeared in context but had near-zero attention weight influence on outputs. This is not theoretical waste; it is measured waste.
Production agent engineers think in token budgets the same way frontend engineers think in kilobytes. Every component of your prompt gets a budget: system prompt ≤ 500 tokens, retrieved context ≤ 2,000 tokens, conversation history ≤ 1,500 tokens, tool schemas ≤ 600 tokens. The sum is your call budget. Anything over budget triggers compression before the call is made.
Anthropic's documentation on prompt engineering explicitly recommends auditing your system prompt for redundancy before deployment. The OpenAI Cookbook contains a worked example showing a customer service system prompt reduced from 1,847 to 612 tokens with identical benchmark performance. These are not edge cases — they are the norm when prompts are written without a cost lens.
Below is a realistic (bloated) system prompt used by an early-stage customer support agent. Your job is to work with the AI tutor to:
Ask the tutor to walk through the audit with you, or start by naming one waste pattern you can already see.
In 2024, the Cognition AI team — builders of the Devin software engineering agent — published details on how they managed context for long-running coding sessions. Devin's tasks regularly spanned hours and involved hundreds of tool calls, file reads, terminal outputs, and browser observations. Naively accumulating all of this would have exceeded any model's context window within the first 30 minutes of a complex task.
Their solution was a tiered context architecture: a small "active scratchpad" of recent actions, a compressed "episode summary" of completed subtasks, and a retrieval layer for specific earlier details on demand. The active scratchpad contained the last 3–5 actions verbatim. Older content was compressed into structured summaries before being stored. This allowed sessions of arbitrary length while keeping per-call token counts bounded and predictable.
Context compression refers to any technique that reduces the token count of information before it enters the model's context window, while preserving the semantic content needed for the current task. The main approaches each have different trade-offs.
MemGPT (now Letta), published by researchers at UC Berkeley in 2023, implemented a hierarchical memory system for LLM agents inspired by OS virtual memory. Main context holds active working memory; a compressed archival memory stores older summaries; retrieval queries fetch from archival on demand. Their benchmarks showed context windows of 2K tokens supporting tasks that naive approaches required 128K+ tokens to complete.
The core tension in context compression is between token reduction and information fidelity. Aggressive compression that discards critical details produces agents that forget important constraints, repeat questions users already answered, or contradict their own earlier statements. The goal is not minimum tokens — it is minimum tokens sufficient for the task.
Several empirical findings from research help calibrate this trade-off. A 2024 paper from Stanford NLP ("LongAgent") found that for multi-hop reasoning tasks, selectively including the 3–5 most relevant prior turns produced answers of equal quality to full-history inclusion 87% of the time, at 23% of the token cost. The 13% quality gap appeared only on tasks requiring integration of widely separated information — a detectable pattern that can trigger a fallback to fuller context retrieval.
Design your compression tier to be task-aware, not one-size-fits-all. Simple Q&A tasks tolerate aggressive windowing. Multi-document synthesis tasks require more complete context. Instrument your agent to track task type and apply different compression policies accordingly. This is standard practice at companies like Cohere and AI21 Labs for their production conversational AI deployments.
A critical implementation detail: never compress in a way that destroys tool call results or structured data returned from external systems. These are often the highest-density information in a context window — small token count, high semantic value. System prompts and verbose LLM reasoning are the right compression targets; tool outputs and user statements are not.
You are building a research assistant agent that helps analysts work through complex multi-day research projects. The agent can search the web, read documents, and maintain notes across sessions that can last hours.
Work with the tutor to design a complete context compression architecture for this agent. Address:
The tutor will challenge your design decisions and push you toward concrete token budget numbers.
Anthropic's Claude.ai introduced memory summarization for long conversations in 2024. When a conversation exceeds a threshold, the system generates a structured summary of earlier turns and replaces those turns in the context. The challenge that Anthropic's team documented in their system prompt research: naive summarization consistently lost what they called "soft constraints" — user preferences stated casually early in a conversation ("I prefer bullet points" or "don't suggest solutions, just explore the problem") that weren't repeated but were load-bearing for user satisfaction. Their solution was a hybrid approach: a structured slot-filling pass for named entities and explicit facts, followed by a narrative summary that captured tone and implicit preferences.
The same problem appears in any production summarization pipeline. Summarization that preserves facts but loses intent creates agents that know what happened but behave as if they don't understand why.
A summarization pipeline is triggered when accumulated context exceeds a threshold, and it produces a compressed representation to replace the original. The pipeline has three distinct phases, each with its own failure modes.
Phase 1 — Segmentation. Divide the context into summarization units. Naive approaches summarize everything older than N turns uniformly. Better approaches identify natural boundaries: completed subtasks, topic shifts, or time gaps. Summarizing within a coherent episode produces better results than summarizing across episode boundaries.
Phase 2 — Multi-pass Extraction. Before the abstractive summary pass, run a structured extraction to capture: named entities (people, systems, files), explicit decisions made, user preferences stated, constraints given, and open questions. This extraction can be done with a much cheaper, smaller model — even a locally-running model — because the task is classification and extraction, not generation.
Use a cheap local model (e.g., Llama 3 8B running via Ollama) for the extraction pass and reserve the expensive frontier model only for the abstractive narrative summary. This hybrid approach cuts summarization cost by 60–80% while maintaining quality on the extraction pass where a weaker model is sufficient.
Phase 3 — Summary Generation. The abstractive summary synthesizes extracted facts into a coherent narrative. Critical: include a summary header that explicitly lists user preferences, open constraints, and the current state of the task. These are the elements most commonly lost in naive summarization and most consequential when lost.
Production summarization pipelines require quality evaluation — you cannot trust that every summary is faithful without verification. The most practical approach is a lightweight "summary audit" prompt that presents the original context and the summary to a model and asks it to flag any factual discrepancies or omitted constraints. This adds tokens but prevents a class of subtle, hard-to-debug agent failures.
The open-source agent framework AutoGen (Microsoft Research) implements a ConversableAgent class with a built-in summarization hook called max_consecutive_auto_reply. When this limit is hit, a summary is generated. Teams extending AutoGen for production use — including published case studies from enterprise users at Microsoft — universally add a constraint-preservation pass before the summary to avoid lost negations and user preferences.
Finally: always store the original context segments before summarizing, at least temporarily. Summaries are lossy by design. When an agent starts behaving unexpectedly, being able to inspect the original context is essential for debugging. A cheap object storage bucket (S3, GCS) for raw conversation logs is a standard production pattern at companies running LLM agents at scale.
You need to write the actual summarization prompt that your agent will use when a conversation hits the compression threshold. The conversation below contains several soft constraints that a naive summarizer would lose:
Share your summarization prompt with the tutor. It should: preserve soft constraints verbatim in a dedicated section, handle negations explicitly, and produce a summary an agent could use in a fresh context window without losing critical preferences.
This lesson explores l4: reducing llm calls — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to l4: reducing llm calls.