In November 2023, Klarna deployed an AI assistant built on OpenAI's GPT-4. By early 2024, the company reported it handled the equivalent workload of 700 full-time customer service agents. But the cost structure was not straightforward. Each customer interaction involved multiple API calls: one for intent classification, one for context retrieval, one for response generation, and sometimes a follow-up call for quality checking. A conversation that looked like a single "chat" was actually 3–6 separate model invocations. The per-conversation cost was roughly $0.12–$0.18 at peak GPT-4 pricing — which at scale across millions of interactions required rigorous cost accounting that Klarna's engineering team had not initially anticipated.
Every LLM API call has two cost components: input tokens and output tokens. Input tokens are almost always cheaper — on Claude 3.5 Sonnet, input tokens cost $3 per million while output tokens cost $15 per million as of mid-2024. This asymmetry matters enormously in agent design because output generation is where the expense compounds.
Agents amplify these costs through a mechanism called the context window carry. In a multi-turn agent loop, each new call typically includes the full conversation history plus tool call results plus system prompt. A system prompt of 800 tokens plus a 2,000-token conversation history plus 500 tokens of tool output means your "new" 50-token question actually costs 3,350 input tokens. After ten turns, you may be paying for 8,000+ input tokens per call even though the user's actual instruction was tiny.
Tool calls add a second multiplier. Each tool invocation — searching a database, calling a web API, reading a file — returns results that re-enter the context. A single agent task requiring five tool calls might result in a context that grows from 1,000 tokens to 6,000 tokens by the final synthesis step. The last call in the chain is your most expensive, not your first.
True call cost = (system_prompt_tokens + cumulative_history_tokens + tool_results_tokens + new_input_tokens) × input_rate + (generated_output_tokens) × output_rate. Most cost estimates that ignore cumulative context are off by 3–10×.
Model selection creates a 10–50× cost spread. GPT-4o at $5/$15 per million (input/output) versus GPT-3.5-turbo at $0.50/$1.50 creates a 10× delta. Claude Haiku at $0.25/$1.25 versus Claude Opus at $15/$75 creates a 60× delta. Choosing the wrong model tier for a task is the single largest cost lever available to an agent architect.
Token cost is only part of the picture. Production agent systems carry at least four additional cost categories that are routinely underestimated during design.
Anthropic's published guidance on agent design explicitly warns that "costs can scale rapidly" in agentic contexts and recommends building cost visibility into the agent loop itself — not treating it as an afterthought. This means instrumenting token counts at every step and surfacing them to both the engineering team and, in enterprise contexts, the end customer.
A reasonable rule of thumb from multiple production deployments: budget 2.5× the naive per-call token cost to cover retries, orchestration overhead, embeddings, and context accumulation. Teams that skip this buffer routinely go over budget by the second week of production traffic.
You're going to build a cost model for a specific agent scenario. The AI will walk you through calculating the true token cost of a multi-step agent interaction, including context accumulation and orchestration overhead.
When Notion launched its AI features in February 2023, the team faced a cost estimation challenge with no clean precedent. They were offering AI writing assistance at $10/month per user — a flat rate — but had no reliable data on how frequently users would invoke AI features or how long their documents would be. Within the first month, power users were generating 3–4× more AI calls than the median user, meaning the top 20% of users were consuming roughly 70% of the AI compute budget. Notion had to retroactively model usage percentiles and adjust their pricing — eventually introducing usage caps — because their initial flat-fee model assumed a distribution that didn't match reality.
Accurate cost estimation before deployment requires modeling three interacting variables: call volume, token distribution, and error rate. Getting any one of these wrong by 2× will throw your budget off significantly.
Call volume modeling starts with user count and feature engagement rate, not with raw traffic. If you have 10,000 users and expect 30% to use an AI feature daily, and each use triggers an average of 4 agent calls, you have 12,000 calls/day as your baseline. But you need to model the distribution, not just the mean. P95 users (the heavy users in the 95th percentile) will use the feature 5–8× more than median users, as Notion discovered. Use a log-normal distribution for user behavior modeling — it fits most SaaS usage patterns well.
Daily cost = (daily_active_users × feature_engagement_rate × avg_calls_per_session × avg_tokens_per_call × token_price) × 1.25 (tail overhead). Run this at P50, P90, and P99 to get a cost range, not a point estimate.
Token distribution modeling requires sampling your actual input data before launch. If you're building a document analysis agent, sample 200 real documents from your corpus and measure their length distribution. A mean of 2,000 tokens tells you very little if the P95 is 15,000 tokens — a document at P95 costs 7.5× more than the average, and those are exactly the documents where users expect AI to provide the most value.
The most reliable estimation method is a load simulation: run 500–1,000 representative task samples through your actual agent pipeline before launch, record every token count at every step, and build a cost-per-task histogram. This gives you an empirical distribution rather than a theoretical one, catching surprises like unexpectedly verbose tool outputs or prompts that reliably trigger long chain-of-thought responses.
Cost estimation without scenario planning produces a single number that is almost certainly wrong. Professional cost models include at least three scenarios: base case (expected adoption, expected token usage), bull case (rapid adoption, heavy power users), and stress case (adoption spike plus adversarial inputs).
The 2023 launch of GitHub Copilot Chat illustrates stress case planning. When Copilot Chat reached large codebases, users began submitting entire files as context — sometimes 8,000–10,000 tokens in a single input. Microsoft had modeled average code snippet lengths (400–600 tokens) but not the tail of users who would paste full files. This drove per-user costs significantly above projections during the first quarter.
Before launching any agent feature: (1) Run 1,000 real task samples to build an empirical cost histogram. (2) Model P50, P90, P99 cost scenarios. (3) Set hard budget gates with automated alerts. (4) Define per-user cost thresholds that trigger rate limiting. Skipping any of these is a known path to budget overruns.
You'll build a structured pre-production cost model for a document analysis agent with the AI tutor's help. Work through the scenario below and build out all three planning scenarios.
In March 2024, a well-documented incident at a mid-sized AI startup (reported in detail on the LessWrong engineering blog and subsequently discussed at AI Engineer World's Fair 2024) involved a recursive agent loop that went undetected for approximately 11 hours. The agent, tasked with code review, entered a cycle where each output triggered a new self-review call. The team had monitoring on API error rates and latency but had no real-time cost tracking. The incident generated approximately $38,000 in API charges before an engineer noticed the anomaly in the monthly billing dashboard — which only updated every 24 hours. Post-incident, the team implemented per-minute cost tracking with a hard circuit breaker at $500/hour, which would have caught the incident within 6 minutes.
Cost tracking for agents is not the same as cost tracking for traditional software. In traditional systems, a resource spike is usually visible as CPU or memory utilization on infrastructure dashboards. In LLM agent systems, compute happens at the model provider, not in your infrastructure — your servers can be completely idle while you're spending $1,000/minute at the API layer. This means you need a purpose-built cost observability layer.
The instrumentation layer has four components. First: token counting at every call. Every LLM API response returns token usage in the response metadata. Capture this as structured data: timestamp, agent_id, task_id, user_id, model, prompt_tokens, completion_tokens, calculated_cost. This creates an immutable cost audit trail.
Second: cost aggregation with short windows. Aggregate your token logs into cost metrics at 1-minute intervals, not 24-hour intervals. The incident above cost $38,000 precisely because the aggregation window was too long. OpenAI's usage dashboard and Anthropic's usage API both support programmatic querying, but you should build your own real-time layer on top of them using the response metadata rather than relying on provider dashboards.
Minimum required fields per API call log: { timestamp, session_id, agent_name, step_name, model, input_tokens, output_tokens, cost_usd, task_id, user_id, retry_count, tool_calls_count }. This schema enables attribution, anomaly detection, and per-feature cost reporting.
Third: cost attribution by dimension. Raw total cost is nearly useless for decision-making. You need cost broken down by agent type, by feature, by user cohort, and by task category. A code review agent and a document summarization agent running in the same system may have 5× different cost-per-task profiles — knowing this lets you make intelligent model selection and routing decisions. LangSmith, the observability platform from LangChain, and Helicone both offer cost attribution by run type, though both require careful configuration to capture the full context window cost rather than just the new tokens.
Instrumentation without action is logging. The production-grade pattern is to couple your cost tracking with automated circuit breakers — hard stops and throttles that fire before costs become catastrophic.
The standard pattern: Warn at 70% of budget threshold → Throttle new requests at 90% → Hard stop at 100%. This triple-layer approach prevents sudden cutoffs (which create terrible user experience) while still enforcing cost controls. The warn state is the most operationally valuable — it gives engineers time to investigate before a forced shutdown.
The fourth instrumentation component is cost forecasting from real-time data. Given your current burn rate, project the end-of-day and end-of-month cost. If you're 18 hours into the day and have already spent 75% of the daily budget, surface that projection to your on-call engineer before you hit the limit, not after. This is standard practice in cloud infrastructure cost management (AWS Cost Explorer, GCP Budget Alerts) but is rarely implemented at the API call level for LLM systems in practice.
You'll design an end-to-end cost observability system for a production agent deployment. The AI will help you think through the instrumentation schema, aggregation strategy, and circuit breaker configuration.
This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.