It's one thing to build an agent that works in a demo. It's a different thing entirely to run that agent at production scale — a thousand times per minute, across a million customers, inside an SLA, on a budget your CFO approves.
At scale, every design choice compounds. A prompt that's 200 tokens longer than necessary costs you thousands of dollars a month. A tool call that takes two extra seconds is the difference between a usable product and a frustrating one. A model that's 3% better on accuracy but 40% more expensive may or may not be worth it — and the answer depends on which specific task you're running.
This fifth course in the Agents series is about making agents fast, cheap, and reliable once they're in production. It covers latency optimization, cost engineering, caching strategies, batching, model-selection policies, the economics of different inference providers, and the evaluation discipline that lets you keep shipping improvements without regressions.
If you finish every module, here's who you become:
What observability actually means for agents — spans, traces, and the instrumentation stack that exposes every step of a run.
In late 2023, Langchain's team published a post-mortem on a production agent that was taking an average of 47 seconds per user request. The engineering team assumed the LLM itself was the bottleneck. When they added OpenTelemetry spans across the full execution graph, they discovered 34 of those 47 seconds were spent in synchronous tool calls to a vector database — calls that were being made sequentially when they could have been parallelized. The LLM inference itself took under 4 seconds total. Without distributed tracing, every optimization guess would have targeted the wrong layer entirely.
A trace is a structured record of every operation that occurred during a single agent run, from the moment the input arrives to the moment the final output is returned. Each discrete operation — an LLM call, a tool invocation, a retrieval step, a parsing pass — becomes a span. Spans have a start timestamp, an end timestamp, parent-child relationships that encode the call hierarchy, and arbitrary key-value metadata attached at instrumentation time.
The parent-child structure is critical. A single user request might produce a root span representing the overall request, with child spans for each reasoning step, and grandchild spans for each tool call triggered within a step. This tree structure, when visualized in tools like Langsmith, Arize Phoenix, or Honeycomb, shows exactly where wall-clock time accumulates and where costs compound across nested calls.
Logging tells you what happened. Tracing tells you how long each thing took, what triggered it, and how it relates to everything else in the same request. For agents with non-linear execution paths, only tracing reveals the actual performance shape of a run.
The OpenTelemetry specification (OTel) has become the dominant standard for agent instrumentation. Frameworks including LangChain, LlamaIndex, and CrewAI all export OTel-compatible telemetry natively as of 2024. The semantic conventions for GenAI spans — covering attributes like gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens — were ratified into the OTel spec in late 2024, giving the ecosystem a shared schema for cross-framework comparison.
There are three instrumentation layers an advanced practitioner needs to understand: automatic instrumentation, callback-based instrumentation, and manual span creation.
Automatic instrumentation is what you get when you install an observability SDK and it patches framework internals via monkey-patching or import hooks. OpenLLMetry, for example, automatically wraps the OpenAI client so every API call generates a span without any code changes. This captures the majority of data with zero effort but misses business-logic-level context — you see the token counts but not which user query triggered them.
Callback-based instrumentation is the pattern LangChain uses via its BaseCallbackHandler interface. Callbacks fire at well-defined lifecycle events: on LLM start, on LLM end, on tool start, on tool end, on chain start, on chain end, on agent action, on agent finish. Each callback receives the full context of the event. This lets you attach your own metadata — user ID, session ID, experiment variant — to every span without touching framework internals.
Manual span creation via tracer.start_as_current_span() lets you wrap any arbitrary block of code. This is essential for custom tool logic, pre/post-processing steps, or any operation the framework doesn't automatically instrument. The discipline here is consistent naming: a span hierarchy only stays readable if span names follow a convention like agent.step.{n}, tool.{tool_name}, retriever.{source}.
The Anthropic engineering team documented in their 2024 cookbook that agents with more than five distinct tool types require explicit span naming conventions enforced at code review time. Without it, trace UIs become unreadable within weeks as tool count grows and naming drifts.
The most actionable traces go beyond timestamps and token counts. At minimum, every LLM span should record: model name and version, temperature and sampling parameters, full prompt character count (not just tokens — tokens vary by model), input token count, output token count, finish reason (stop, length, tool_use, content_filter), and latency broken down into time-to-first-token (TTFT) and total generation time.
Tool spans should record: tool name, input parameters (sanitized of PII), success or failure status, error type if failed, and the downstream latency of any network or disk operation the tool performed. This last point is where most teams underinstrument — the tool span records the total tool execution time but not the internal breakdown of time spent marshaling inputs versus time spent waiting on an external API.
3 questions — free, untracked, retake anytime.
Work through instrumentation decisions with an AI tutor trained on this lesson.
You're instrumenting a multi-step research agent that uses three tools: a web search API, a vector store retriever, and a code executor. The agent runs up to 8 reasoning steps.
Work through these challenges with the tutor:
Breaking down exactly where tokens accumulate in an agent run — system prompts, tool schemas, conversation history, and the hidden compounding effect of multi-step loops.
In Q1 2024, the engineering team at Cognition AI — builders of the Devin software agent — described in a public technical discussion how early versions of their agent were consuming 6–8x more tokens than expected on complex coding tasks. The primary culprit was context accumulation: every agent step re-sent the entire conversation history, all tool outputs from prior steps, and the full system prompt (which included extensive coding guidelines). A single 20-step task could consume over 400,000 tokens when the actual reasoning content was under 50,000. The fix required explicit context window management — summarizing completed steps rather than passing full transcripts.
When you look at total token consumption for a multi-step agent, the cost comes from four distinct sources that compound across steps in ways single-call thinking doesn't prepare you for.
Bucket 1: System prompt tokens. Paid on every single LLM call. If your system prompt is 2,000 tokens and your agent takes 15 steps, you've paid 30,000 tokens just in repeated system prompt overhead — before a single word of reasoning. Many teams don't realize this because they measure prompt length once and don't multiply by step count. The 2024 OpenAI API documentation introduced prompt caching specifically because this was the dominant cost driver for agentic workloads.
Bucket 2: Tool schema tokens. When you bind tools to an LLM via function calling, every tool's JSON schema is included in the context on every call. A well-described tool with parameter descriptions, type annotations, and examples might be 300–500 tokens. Ten tools means 3,000–5,000 tokens of schema overhead per step. Teams running 20+ tool agents routinely discover tool schemas are their second-largest cost center.
Anthropic's Claude tool use documentation notes that a typical tool definition with a name, description, and three parameters runs approximately 200–400 input tokens. For an agent with 15 tools running 10 steps, that's 15 × 10 × 300 = 45,000 tokens of schema cost — at Claude 3.5 Sonnet pricing (as of mid-2024, $3 per million input tokens), roughly $0.135 per run just in tool schema overhead.
Bucket 3: Conversation history tokens. The compounding problem. Step 1 sends 1,000 tokens of history. Step 2 sends 1,000 + step-1 output tokens. Step 3 sends step-1 + step-2 outputs. By step 10 of a verbose agent, the history alone can exceed 50,000 tokens. This is the bucket Cognition discovered was driving their 6–8x cost overrun.
Bucket 4: Tool output tokens. Often overlooked. When a tool returns data — search results, code execution output, API responses — that data enters the context as input tokens for the next step. A web search returning 10 results at 200 words each generates ~2,500 tokens. Five searches across a task run adds 12,500 tokens of tool output cost, all billed as input tokens to the LLM.
A profiling discipline that separates production-grade teams from hobbyists: measuring token cost at step granularity, not just task granularity. Task-level cost tells you what you spent. Step-level cost tells you where you spent it and what you can cut.
The standard practice is to accumulate two counters per step: step_input_tokens and step_output_tokens, and to break step_input_tokens into its four buckets using your span metadata. This requires your instrumentation to record the pre-call context length at each step, which most frameworks expose via the callback's serialized parameter or through token counting utilities.
Once you have step-level data, you can compute marginal cost per step — how much more a step costs than the previous one, purely due to history accumulation. When marginal cost is rising steeply (more than 15–20% per step), you've crossed the threshold where context compression becomes economically necessary.
When your per-task token cost exceeds 10x the expected single-call cost for equivalent reasoning, context compounding is almost certainly the cause. This 10x threshold was documented by the LangChain team in their 2024 agent cost optimization guide as the practical alert threshold for production systems.
3 questions — free, untracked, retake anytime.
Model and optimize token costs for a realistic agent configuration with an AI tutor.
You're running a customer support agent with: a 3,500-token system prompt, 12 tool schemas averaging 350 tokens each, and conversation turns averaging 800 tokens of user input and 600 tokens of assistant output per step. The agent runs up to 10 steps.
Work through these calculations and decisions with the tutor:
Decomposing end-to-end agent latency into its measurable components — and identifying which ones actually yield to optimization.
When Fixie.ai (acquired by Adept in 2023) published their agent latency analysis, they documented that their median tool-augmented agent response took 8.2 seconds end-to-end. Breaking it down by component: LLM inference was 2.1 seconds, tool execution was 4.8 seconds (dominated by a single slow API call to an external CRM), serialization and deserialization overhead was 0.6 seconds, and framework overhead (routing, callback processing, state management) was 0.7 seconds. The 4.8 seconds in tool execution was cut to 1.1 seconds by switching to an async tool execution pattern and caching the CRM responses with a 60-second TTL. The LLM inference time — the component most developers focus on — remained untouched and was never the real bottleneck.
End-to-end latency for an agent request has seven distinct measurable components. Most teams measure only one or two of them and then make optimization decisions based on an incomplete picture.
1. Request serialization. The time to serialize the request payload, encode it, and open the HTTP connection to the LLM API. Usually 5–50ms and rarely worth optimizing, but worth measuring to rule out connection pool exhaustion in high-throughput scenarios.
2. Time-to-first-token (TTFT). From when the API receives your request to when it starts streaming the first token. This includes queuing time on the API provider's infrastructure, prompt processing (KV cache lookup or full attention pass), and sampling setup. TTFT is what users perceive as "how long until it starts responding." It scales with input context length — longer prompts produce higher TTFT, which is why context compression improves perceived latency even when it doesn't reduce total generation time.
3. Time-per-output-token (TPOT). The decode rate — how fast tokens stream out after the first token. On modern GPUs this is fairly stable at 20–80 tokens/second depending on model size and infrastructure load. TPOT multiplied by output length gives total generation time.
A 500-token input processed at an API with 300ms TTFT and 40 tokens/second TPOT, generating 200 output tokens, produces: 300ms TTFT + (200/40 × 1000ms) = 300ms + 5000ms = 5.3 seconds total generation time. Cutting input to 200 tokens might reduce TTFT to 150ms — a 150ms improvement on a 5.3-second total. Cutting output length to 100 tokens (by changing the prompt) saves 2.5 seconds. Context compression helps perceived latency; output length reduction helps total latency.
4. Tool execution latency. The time from when the LLM finishes generating a tool call to when the tool returns its result. This includes deserialization of the tool call, parameter validation, the actual tool operation (network I/O, disk I/O, compute), and serialization of the result. As the Fixie.ai case shows, this is frequently the dominant latency component and the highest-leverage optimization target.
5. Context injection latency. The time to take the tool result and construct the next prompt. In large context windows with many prior turns, this string construction and tokenization step can take 50–200ms and is often completely invisible in naive profiling.
6. Framework overhead. Callback processing, state serialization, routing logic, and any middleware layers in your agent framework. LangChain's callback system introduces 20–100ms per step depending on the number of registered callbacks and their complexity.
7. Inter-step coordination latency. In multi-agent systems, the time to route between agents, serialize handoff state, and initialize the downstream agent's context. This can range from negligible (in-process coordination) to seconds (cross-service coordination with cold start).
The standard technique for visualizing agent latency is the waterfall chart: a horizontal bar chart where each span is drawn as a bar starting at its start timestamp and ending at its end timestamp, with child spans nested below parent spans. This is the view you get in Jaeger, Honeycomb, or Langsmith's trace UI.
Reading a waterfall for agents requires knowing what a healthy shape looks like versus a pathological one. A healthy sequential agent shows: one root span, child spans that tile sequentially with minimal gaps, LLM spans that are the widest bars (as expected), and tool spans that complete quickly relative to LLM spans. A pathological trace shows wide gaps between spans (framework overhead accumulating), sequential tool spans that could be parallel, or LLM spans that are surprisingly narrow relative to long tool execution bars.
Latency profiles vary significantly with API provider load. The Anthropic status page documented API p99 latency fluctuations of up to 3x between off-peak and peak hours in 2024. Always measure latency distributions (p50, p95, p99) across multiple runs and time periods — a single trace is anecdote, not data.
3 questions — free, untracked, retake anytime.
Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.
This lesson explores lesson 4: optimization decisions — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4: optimization decisions.