🎯 Advanced

Token Cost Reality

Understanding what tokens actually cost in production agent systems — and why unchecked usage destroyed real products.

In late 2023, Cursor AI — the AI-powered code editor — faced a widely discussed cost crisis. Early versions sent entire codebases as context on nearly every keystroke-triggered completion. Engineering teams analyzing their Claude and GPT-4 API bills reported per-user monthly costs exceeding $50–$100 at scale, a figure that made the $20/month subscription model algebraically impossible. The team's public post-mortems named indiscriminate token usage as the core issue: they were paying for tokens the model never needed. Aggressive context windowing and retrieval-augmented generation cut their effective cost per session by more than 60% without measurable quality regression.

This is not unique to Cursor. Every agent system that survives past prototype stage confronts the same arithmetic: tokens are the unit of production cost, and production cost determines whether your product exists.

How Token Pricing Actually Works

LLM APIs price separately on input tokens (everything you send: system prompt, conversation history, retrieved documents, tool schemas) and output tokens (what the model generates). As of mid-2024, Anthropic's Claude 3.5 Sonnet charges $3 per million input tokens and $15 per million output tokens. OpenAI's GPT-4o charges $5 input / $15 output. These numbers shift, but the asymmetry is consistent: output costs 3–5× more per token than input.

For an agent that makes 10 LLM calls per user task, with an average context of 4,000 tokens per call and 500 output tokens, a single task costs approximately 40,000 input + 5,000 output tokens. At Claude 3.5 Sonnet pricing that is $0.12 + $0.075 = $0.195 per task. At 10,000 daily tasks, that is $1,950/day — $712,000/year — for one agent workflow. Token optimization is therefore a core business function, not an engineering nicety.

The Hidden Multiplier

Agent systems compound token costs in ways single-call applications do not. Retry logic, tool call roundtrips, chain-of-thought reasoning, and multi-agent orchestration all multiply the base token count. A naive ReAct-style agent solving a 5-step task may issue 8–12 LLM calls, each carrying the full prior context. The effective token spend per "task" can be 10–20× what a naive estimate predicts.

Context window size has increased dramatically — Claude 3.5 supports 200K tokens, GPT-4o supports 128K — but larger windows do not solve cost. They enable new capabilities while simultaneously enabling new waste. A 200K context filled carelessly is simply a more expensive mistake.

Token Waste Patterns in Real Agent Systems

Observability teams at companies running large agent deployments have consistently identified the same categories of waste. Understanding these patterns is the prerequisite to eliminating them.

History Bloat: Appending every message to context verbatim. A 30-turn conversation accumulates 15,000–30,000 tokens of history, most of which is irrelevant to the current query.
Over-retrieved Context: RAG pipelines that return top-20 chunks when top-3 would suffice. Langchain's default retrieval settings in early versions returned 4 documents; teams using Pinecone reported routinely switching to 2 and seeing no quality drop.
Verbose System Prompts: System prompts that explain the same constraint five different ways. A 2,000-token system prompt that could be 400 tokens adds $0.048 per 1,000 calls — $48 per million calls that buys nothing.
Tool Schema Flooding: Injecting all available tool definitions every call. Each OpenAI function definition averages 60–120 tokens. Twenty tools add 1,200–2,400 tokens to every single call in the pipeline.
Redundant Reasoning: Prompts that ask the model to "think step by step" without gating that output — generating 300–800 tokens of chain-of-thought that is never used downstream.

Industry Benchmark

A 2024 analysis by the team at LangSmith (LangChain's observability product) found that across monitored agent deployments, 38% of input tokens were classified as "low-relevance" — content that appeared in context but had near-zero attention weight influence on outputs. This is not theoretical waste; it is measured waste.

Building a Token Budget Mentality

Production agent engineers think in token budgets the same way frontend engineers think in kilobytes. Every component of your prompt gets a budget: system prompt ≤ 500 tokens, retrieved context ≤ 2,000 tokens, conversation history ≤ 1,500 tokens, tool schemas ≤ 600 tokens. The sum is your call budget. Anything over budget triggers compression before the call is made.

Anthropic's documentation on prompt engineering explicitly recommends auditing your system prompt for redundancy before deployment. The OpenAI Cookbook contains a worked example showing a customer service system prompt reduced from 1,847 to 612 tokens with identical benchmark performance. These are not edge cases — they are the norm when prompts are written without a cost lens.

→ Lesson 1 Quiz

🎯 Advanced

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.

1. Why do agent systems often have 10–20× higher token costs than a simple estimate predicts?

✓ Correct — ✅ Correct. Each call in a ReAct or chain-of-thought loop carries the full prior history, multiplying costs with every step.

❌ Not quite. The key multiplier is that every LLM call in a multi-step agent carries the growing full context forward.

2. In the Cursor AI cost case, what was the primary technical cause of runaway token spend?

✓ Correct — ✅ Correct. Sending full codebase context indiscriminately — not model choice or output length — was the primary driver.

❌ The documented cause was indiscriminate context inclusion — full codebases sent per request — not model tier or output size.

3. According to LangSmith's 2024 analysis of monitored agent deployments, approximately what percentage of input tokens were classified as low-relevance?

✓ Correct — ✅ Correct. 38% of input tokens across monitored deployments had near-zero influence on outputs — measured, not estimated, waste.

❌ The figure was 38% — a substantial fraction of all input spend producing no measurable effect on model outputs.

←Back to Lesson →Lab 1

🎯 Advanced

Lab 1: Token Cost Audit

Analyze a bloated agent prompt and identify concrete waste categories.

Your Challenge

Below is a realistic (bloated) system prompt used by an early-stage customer support agent. Your job is to work with the AI tutor to:

Identify every token waste category present in the prompt
Estimate how many tokens each waste category is costing per call
Propose a compressed version and quantify the savings

"You are a helpful customer support assistant. You are always helpful. You should always be helpful and polite. When a user asks you a question, you should answer it helpfully. You are an assistant for AcmeCorp. AcmeCorp sells software. AcmeCorp's software helps businesses. You should never say anything harmful. You should never be rude. Always be professional. Always respond in English. If the user writes in another language, respond in English. Our return policy is 30 days. Refunds take 5–10 business days. For billing issues contact billing@acmecorp.com. For technical support contact support@acmecorp.com. You have access to the following tools: [search_kb, lookup_order, escalate_ticket, send_email, check_status, update_record, create_ticket, close_ticket, add_note, get_history]"

Ask the tutor to walk through the audit with you, or start by naming one waste pattern you can already see.

🧪 Token Cost Lab — Tutor AI Tutor

←Back to Quiz →Lesson 2

🎯 Advanced

Context Compression

Techniques for shrinking what enters the context window without losing what the model needs.

In 2024, the Cognition AI team — builders of the Devin software engineering agent — published details on how they managed context for long-running coding sessions. Devin's tasks regularly spanned hours and involved hundreds of tool calls, file reads, terminal outputs, and browser observations. Naively accumulating all of this would have exceeded any model's context window within the first 30 minutes of a complex task.

Their solution was a tiered context architecture: a small "active scratchpad" of recent actions, a compressed "episode summary" of completed subtasks, and a retrieval layer for specific earlier details on demand. The active scratchpad contained the last 3–5 actions verbatim. Older content was compressed into structured summaries before being stored. This allowed sessions of arbitrary length while keeping per-call token counts bounded and predictable.

The Context Compression Toolkit

Context compression refers to any technique that reduces the token count of information before it enters the model's context window, while preserving the semantic content needed for the current task. The main approaches each have different trade-offs.

Sliding Window: Keep only the most recent N turns of conversation verbatim. Simple, deterministic, cheap. Loses older information entirely. Best for tasks where only recent context matters (e.g., step-by-step workflows).
Extractive Summarization: Select sentences or spans from prior turns that are most relevant to the current query, discarding the rest. Preserves original wording. Can be done with lightweight local models (e.g., a small BERT-based extractive summarizer) without an LLM call.
Abstractive Summarization: Use an LLM to produce a condensed restatement of prior context. Highest quality but costs tokens itself. Amortized over many subsequent calls, the investment pays off.
Structured State: Replace free-text history with a structured JSON or YAML state object that tracks key entities, decisions, and facts. A 2,000-token conversation becomes a 200-token state object. Used heavily in tool-calling agents.
Selective Retrieval: Store prior context in a vector database and retrieve only relevant chunks. Requires an embedding call but eliminates the need to carry history in-context at all.

Real Implementation

MemGPT (now Letta), published by researchers at UC Berkeley in 2023, implemented a hierarchical memory system for LLM agents inspired by OS virtual memory. Main context holds active working memory; a compressed archival memory stores older summaries; retrieval queries fetch from archival on demand. Their benchmarks showed context windows of 2K tokens supporting tasks that naive approaches required 128K+ tokens to complete.

Compression Without Quality Loss

The core tension in context compression is between token reduction and information fidelity. Aggressive compression that discards critical details produces agents that forget important constraints, repeat questions users already answered, or contradict their own earlier statements. The goal is not minimum tokens — it is minimum tokens sufficient for the task.

Several empirical findings from research help calibrate this trade-off. A 2024 paper from Stanford NLP ("LongAgent") found that for multi-hop reasoning tasks, selectively including the 3–5 most relevant prior turns produced answers of equal quality to full-history inclusion 87% of the time, at 23% of the token cost. The 13% quality gap appeared only on tasks requiring integration of widely separated information — a detectable pattern that can trigger a fallback to fuller context retrieval.

Compression Strategy

Design your compression tier to be task-aware, not one-size-fits-all. Simple Q&A tasks tolerate aggressive windowing. Multi-document synthesis tasks require more complete context. Instrument your agent to track task type and apply different compression policies accordingly. This is standard practice at companies like Cohere and AI21 Labs for their production conversational AI deployments.

A critical implementation detail: never compress in a way that destroys tool call results or structured data returned from external systems. These are often the highest-density information in a context window — small token count, high semantic value. System prompts and verbose LLM reasoning are the right compression targets; tool outputs and user statements are not.

←Lab 1 →Lesson 2 Quiz

🎯 Advanced

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.

1. What was the core architectural approach Cognition AI (Devin) used to manage context for long-running coding sessions?

✓ Correct — ✅ Correct. Tiered context — active scratchpad, episode summary, retrieval layer — kept per-call tokens bounded regardless of session length.

❌ Cognition used a three-tier approach: verbatim recent actions, compressed summaries of older subtasks, and on-demand retrieval for specifics.

2. Which compression technique involves replacing free-text conversation history with a compact JSON or YAML representation of key facts and decisions?

✓ Correct — ✅ Correct. Structured state converts verbose conversation history into dense structured data — often achieving 10:1 compression ratios.

❌ The technique described — replacing conversation with structured key-value state — is called structured state.

3. According to the Stanford NLP LongAgent research, what fraction of the time did selecting only 3–5 relevant prior turns match full-history quality on multi-hop reasoning tasks?

✓ Correct — ✅ Correct. 87% quality parity at 23% of the token cost — a compelling trade-off that justifies selective context inclusion in most production scenarios.

❌ The figure was 87% quality parity — meaning selective inclusion worked as well as full history on the large majority of tasks.

←Back to Lesson →Lab 2

🎯 Advanced

Lab 2: Design a Compression Architecture

Architect a tiered context system for a real agent scenario.

Your Challenge

You are building a research assistant agent that helps analysts work through complex multi-day research projects. The agent can search the web, read documents, and maintain notes across sessions that can last hours.

Work with the tutor to design a complete context compression architecture for this agent. Address:

What goes in the active context window verbatim (and how large is this budget)?
What gets compressed into structured state vs. abstractive summaries?
What triggers a retrieval call vs. relying on in-context history?
How do you handle cross-session memory without unbounded growth?

The tutor will challenge your design decisions and push you toward concrete token budget numbers.

🧪 Context Architecture Lab — Tutor AI Tutor

←Back to Quiz →Lesson 3

🎯 Advanced

Summarization Pipelines

Building production-grade systems that condense agent history without creating new failure modes.

Anthropic's Claude.ai introduced memory summarization for long conversations in 2024. When a conversation exceeds a threshold, the system generates a structured summary of earlier turns and replaces those turns in the context. The challenge that Anthropic's team documented in their system prompt research: naive summarization consistently lost what they called "soft constraints" — user preferences stated casually early in a conversation ("I prefer bullet points" or "don't suggest solutions, just explore the problem") that weren't repeated but were load-bearing for user satisfaction. Their solution was a hybrid approach: a structured slot-filling pass for named entities and explicit facts, followed by a narrative summary that captured tone and implicit preferences.

The same problem appears in any production summarization pipeline. Summarization that preserves facts but loses intent creates agents that know what happened but behave as if they don't understand why.

Anatomy of a Production Summarization Pipeline

A summarization pipeline is triggered when accumulated context exceeds a threshold, and it produces a compressed representation to replace the original. The pipeline has three distinct phases, each with its own failure modes.

Phase 1 — Segmentation. Divide the context into summarization units. Naive approaches summarize everything older than N turns uniformly. Better approaches identify natural boundaries: completed subtasks, topic shifts, or time gaps. Summarizing within a coherent episode produces better results than summarizing across episode boundaries.

Phase 2 — Multi-pass Extraction. Before the abstractive summary pass, run a structured extraction to capture: named entities (people, systems, files), explicit decisions made, user preferences stated, constraints given, and open questions. This extraction can be done with a much cheaper, smaller model — even a locally-running model — because the task is classification and extraction, not generation.

Cost Architecture

Use a cheap local model (e.g., Llama 3 8B running via Ollama) for the extraction pass and reserve the expensive frontier model only for the abstractive narrative summary. This hybrid approach cuts summarization cost by 60–80% while maintaining quality on the extraction pass where a weaker model is sufficient.

Phase 3 — Summary Generation. The abstractive summary synthesizes extracted facts into a coherent narrative. Critical: include a summary header that explicitly lists user preferences, open constraints, and the current state of the task. These are the elements most commonly lost in naive summarization and most consequential when lost.

Summary Quality Evaluation and Failure Recovery

Production summarization pipelines require quality evaluation — you cannot trust that every summary is faithful without verification. The most practical approach is a lightweight "summary audit" prompt that presents the original context and the summary to a model and asks it to flag any factual discrepancies or omitted constraints. This adds tokens but prevents a class of subtle, hard-to-debug agent failures.

Hallucinated facts: Summaries occasionally introduce facts not present in the original, especially dates, numbers, and proper nouns. Always verify these categories.
Lost negations: "Do NOT contact the client before Tuesday" becomes "Contact the client on Tuesday" in a careless summary. Negations and temporal constraints require explicit verification.
Preference drift: Summarizing preferences multiple times through a pipeline can cause them to drift toward generic defaults. Preserve user preference statements verbatim in a dedicated slot rather than summarizing them.

Real Implementation Note

The open-source agent framework AutoGen (Microsoft Research) implements a ConversableAgent class with a built-in summarization hook called max_consecutive_auto_reply. When this limit is hit, a summary is generated. Teams extending AutoGen for production use — including published case studies from enterprise users at Microsoft — universally add a constraint-preservation pass before the summary to avoid lost negations and user preferences.

Finally: always store the original context segments before summarizing, at least temporarily. Summaries are lossy by design. When an agent starts behaving unexpectedly, being able to inspect the original context is essential for debugging. A cheap object storage bucket (S3, GCS) for raw conversation logs is a standard production pattern at companies running LLM agents at scale.

←Lab 2 →Lesson 3 Quiz

🎯 Advanced

Lesson 3 Quiz

3 questions — free, untracked, retake anytime.

1. What category of information did Anthropic find was most commonly lost in naive conversational summarization?

✓ Correct — ✅ Correct. Soft constraints — like "I prefer bullet points" stated once early — were the hardest to preserve and the most consequential when lost.

❌ Anthropic's research identified "soft constraints" — casually stated preferences not repeated — as the most commonly lost and most impactful category.

2. In a multi-pass summarization pipeline, why is the extraction pass well-suited to a cheap local model rather than a frontier model?

✓ Correct — ✅ Correct. Classifying and extracting named entities, decisions, and preferences is a constrained task — small models are sufficient and far cheaper.

❌ The key is task type: extraction is classification work (bounded outputs), which small models handle as well as frontier models at a fraction of the cost.

3. Which specific summarization failure mode involves "Do NOT contact the client before Tuesday" becoming "Contact the client on Tuesday"?

✓ Correct — ✅ Correct. Lost negations are a distinct and dangerous failure mode — summaries that invert constraints while preserving surface meaning.

❌ This is the "lost negations" failure mode — negations and temporal constraints are particularly vulnerable in abstractive summarization.

←Back to Lesson →Lab 3

🎯 Advanced

Lab 3: Build a Summarization Prompt

Write and evaluate a production-quality summarization prompt that preserves soft constraints.

Your Challenge

You need to write the actual summarization prompt that your agent will use when a conversation hits the compression threshold. The conversation below contains several soft constraints that a naive summarizer would lose:

Turn 1 (User): "Can you help me draft an email to my team about the Q3 planning delay? Keep it brief — I hate long emails. And don't make it sound like we're panicking."

Turn 3 (User): "Also, whenever I'm writing for my team, always use 'we' language, never 'I'. It's a culture thing."

Turn 7 (User): "One more thing — don't mention the vendor issue in any external communications. Internal only."

[... 20 more turns of email drafting and revision ...]

Share your summarization prompt with the tutor. It should: preserve soft constraints verbatim in a dedicated section, handle negations explicitly, and produce a summary an agent could use in a fresh context window without losing critical preferences.

🧪 Summarization Prompt Lab — Tutor AI Tutor

←Back to Quiz →Lesson 4

Building AI Agents V — Optimization · Module 3 · Lesson 4

L4: Reducing LLM Calls

Advanced concepts, real-world applications, and practical implications

Core Concepts

This lesson explores l4: reducing llm calls — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

L4: Reducing LLM Calls

What is the primary focus of L4: Reducing LLM Calls?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from L4: Reducing LLM Calls through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to l4: reducing llm calls.

Try: "How would the concepts from this lesson apply to a real-world scenario in this field?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 3 Test

Token Optimization Strategies · 15 Questions · 70% to Pass

Score: 0/15

1. What is the core objective of Token Optimization Strategies?

2. How should practitioners approach applying concepts from this module?

3. Which best describes the relationship between theory and practice in Building AI Agents V — Optimization?

4. What distinguishes expert practitioners from novices in this field?

5. How does Token Optimization Strategies build on previous modules?

6. What role do constraints play in practical implementation?

7. When applying frameworks from this module, what is most important?

8. How should practitioners handle conflicting perspectives in this field?

9. What makes the concepts in Token Optimization Strategies relevant beyond their immediate context?

10. How should practitioners continue developing expertise after completing this module?

11. What is the relationship between understanding Building AI Agents V — Optimization concepts and making decisions?

12. How do the lessons from this module apply to novel situations?

13. What is the value of understanding multiple perspectives on {course_title}?

14. How should practitioners evaluate new information or developments in this field?

15. What is the ultimate goal of learning Token Optimization Strategies?