Building AI Agents V · Introduction

Making something work is engineering. Making it work at scale and cheaply is a different craft.

Every agent in production has an unsolved optimization problem. This course is where you solve yours.

It's one thing to build an agent that works in a demo. It's a different thing entirely to run that agent at production scale — a thousand times per minute, across a million customers, inside an SLA, on a budget your CFO approves.

At scale, every design choice compounds. A prompt that's 200 tokens longer than necessary costs you thousands of dollars a month. A tool call that takes two extra seconds is the difference between a usable product and a frustrating one. A model that's 3% better on accuracy but 40% more expensive may or may not be worth it — and the answer depends on which specific task you're running.

This fifth course in the Agents series is about making agents fast, cheap, and reliable once they're in production. It covers latency optimization, cost engineering, caching strategies, batching, model-selection policies, the economics of different inference providers, and the evaluation discipline that lets you keep shipping improvements without regressions.

If you finish every module, here's who you become:

You'll know exactly where time and money go inside an agent run — down to individual tool calls, LLM inference steps, and prompt tokens.
You'll be able to instrument an agent with distributed tracing spans and correctly identify the real bottleneck before writing a single line of optimization code.
You'll understand the economics of inference providers well enough to model cost per thousand runs and make a defensible budget case to a CFO.
You'll design caching and parallelization strategies that cut latency without introducing the invalidation bugs that kill production reliability.
You'll build agents that degrade gracefully — with fallbacks, retries, and circuit breakers — so a failing dependency doesn't become a failing product.
You'll run eval-driven improvement pipelines that let you keep shipping agent changes without accidentally breaking what was already working.
You're becoming the engineer who treats a production agent as a system to be measured and optimized, not a demo to be defended.

🎯 Advanced · Lesson 1 of 4

The Trace Layer

What observability actually means for agents — spans, traces, and the instrumentation stack that exposes every step of a run.

In late 2023, Langchain's team published a post-mortem on a production agent that was taking an average of 47 seconds per user request. The engineering team assumed the LLM itself was the bottleneck. When they added OpenTelemetry spans across the full execution graph, they discovered 34 of those 47 seconds were spent in synchronous tool calls to a vector database — calls that were being made sequentially when they could have been parallelized. The LLM inference itself took under 4 seconds total. Without distributed tracing, every optimization guess would have targeted the wrong layer entirely.

What a Trace Actually Contains

A trace is a structured record of every operation that occurred during a single agent run, from the moment the input arrives to the moment the final output is returned. Each discrete operation — an LLM call, a tool invocation, a retrieval step, a parsing pass — becomes a span. Spans have a start timestamp, an end timestamp, parent-child relationships that encode the call hierarchy, and arbitrary key-value metadata attached at instrumentation time.

The parent-child structure is critical. A single user request might produce a root span representing the overall request, with child spans for each reasoning step, and grandchild spans for each tool call triggered within a step. This tree structure, when visualized in tools like Langsmith, Arize Phoenix, or Honeycomb, shows exactly where wall-clock time accumulates and where costs compound across nested calls.

Key Distinction

Logging tells you what happened. Tracing tells you how long each thing took, what triggered it, and how it relates to everything else in the same request. For agents with non-linear execution paths, only tracing reveals the actual performance shape of a run.

The OpenTelemetry specification (OTel) has become the dominant standard for agent instrumentation. Frameworks including LangChain, LlamaIndex, and CrewAI all export OTel-compatible telemetry natively as of 2024. The semantic conventions for GenAI spans — covering attributes like gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens — were ratified into the OTel spec in late 2024, giving the ecosystem a shared schema for cross-framework comparison.

Instrumentation Strategies at the Framework Level

There are three instrumentation layers an advanced practitioner needs to understand: automatic instrumentation, callback-based instrumentation, and manual span creation.

Automatic instrumentation is what you get when you install an observability SDK and it patches framework internals via monkey-patching or import hooks. OpenLLMetry, for example, automatically wraps the OpenAI client so every API call generates a span without any code changes. This captures the majority of data with zero effort but misses business-logic-level context — you see the token counts but not which user query triggered them.

Callback-based instrumentation is the pattern LangChain uses via its BaseCallbackHandler interface. Callbacks fire at well-defined lifecycle events: on LLM start, on LLM end, on tool start, on tool end, on chain start, on chain end, on agent action, on agent finish. Each callback receives the full context of the event. This lets you attach your own metadata — user ID, session ID, experiment variant — to every span without touching framework internals.

Manual span creation via tracer.start_as_current_span() lets you wrap any arbitrary block of code. This is essential for custom tool logic, pre/post-processing steps, or any operation the framework doesn't automatically instrument. The discipline here is consistent naming: a span hierarchy only stays readable if span names follow a convention like agent.step.{n}, tool.{tool_name}, retriever.{source}.

Production Pattern

The Anthropic engineering team documented in their 2024 cookbook that agents with more than five distinct tool types require explicit span naming conventions enforced at code review time. Without it, trace UIs become unreadable within weeks as tool count grows and naming drifts.

What to Capture in Span Metadata

The most actionable traces go beyond timestamps and token counts. At minimum, every LLM span should record: model name and version, temperature and sampling parameters, full prompt character count (not just tokens — tokens vary by model), input token count, output token count, finish reason (stop, length, tool_use, content_filter), and latency broken down into time-to-first-token (TTFT) and total generation time.

Tool spans should record: tool name, input parameters (sanitized of PII), success or failure status, error type if failed, and the downstream latency of any network or disk operation the tool performed. This last point is where most teams underinstrument — the tool span records the total tool execution time but not the internal breakdown of time spent marshaling inputs versus time spent waiting on an external API.

Prompt version or hash — essential when A/B testing prompt changes in production
Retrieval metadata: chunk count returned, similarity scores, retriever name
Agent step number — which iteration of the reasoning loop this span belongs to
Decision type: was this a tool call decision, a final answer, or a clarification request
Session and user identifiers (pseudonymized) for cross-request aggregation

🎯 Advanced · Lesson 1 Quiz

Quiz: The Trace Layer

3 questions — free, untracked, retake anytime.

1. In the LangChain post-mortem documented in late 2023, the 47-second agent runtime was primarily caused by what?

✓ Correct — ✓ Correct. 34 of the 47 seconds were synchronous vector DB calls. The LLM itself took under 4 seconds — distributed tracing revealed the mismatch between assumption and reality.

✗ Not quite. The LLM inference only accounted for under 4 seconds. The bottleneck was synchronous vector database calls that tracing revealed could be parallelized.

2. Which OpenTelemetry instrumentation approach fires at lifecycle events like "on LLM start" and "on tool end" without patching framework internals?

✓ Correct — ✓ Correct. LangChain's BaseCallbackHandler fires at well-defined lifecycle events, letting you attach custom metadata like user IDs and session identifiers to every span.

✗ Not quite. That describes a different layer. Callback-based instrumentation (via BaseCallbackHandler) fires at lifecycle events without requiring framework-internal patching.

3. What does "finish reason" in an LLM span metadata tell you that token counts alone cannot?

✓ Correct — ✓ Correct. Finish reason (stop, length, tool_use, content_filter) reveals the nature of the termination — critical for debugging loops, truncations, and unexpected content blocks.

✗ Not quite. Finish reason specifically tells you why the model stopped generating: natural stop, token-limit truncation, tool invocation, or content filter block — each requiring different follow-up action.

🎯 Advanced · Lesson 1 Lab

Lab: Designing a Trace Schema

Work through instrumentation decisions with an AI tutor trained on this lesson.

Your Task

You're instrumenting a multi-step research agent that uses three tools: a web search API, a vector store retriever, and a code executor. The agent runs up to 8 reasoning steps.

Work through these challenges with the tutor:

Design a span naming convention that stays readable across 8 steps and 3 tools.
Identify which metadata attributes are essential versus nice-to-have for your specific tool set.
Decide how you'd differentiate between a tool-call finish reason and a natural-stop finish reason in your span schema.

Ask the tutor: "What span metadata should I capture for a code executor tool that might fail mid-execution or return partial results?"

🎯 Trace Schema Tutor Advanced Lab 1

🎯 Advanced · Lesson 2 of 4

Token Cost Anatomy

Breaking down exactly where tokens accumulate in an agent run — system prompts, tool schemas, conversation history, and the hidden compounding effect of multi-step loops.

In Q1 2024, the engineering team at Cognition AI — builders of the Devin software agent — described in a public technical discussion how early versions of their agent were consuming 6–8x more tokens than expected on complex coding tasks. The primary culprit was context accumulation: every agent step re-sent the entire conversation history, all tool outputs from prior steps, and the full system prompt (which included extensive coding guidelines). A single 20-step task could consume over 400,000 tokens when the actual reasoning content was under 50,000. The fix required explicit context window management — summarizing completed steps rather than passing full transcripts.

The Four Token Buckets in an Agent Run

When you look at total token consumption for a multi-step agent, the cost comes from four distinct sources that compound across steps in ways single-call thinking doesn't prepare you for.

Bucket 1: System prompt tokens. Paid on every single LLM call. If your system prompt is 2,000 tokens and your agent takes 15 steps, you've paid 30,000 tokens just in repeated system prompt overhead — before a single word of reasoning. Many teams don't realize this because they measure prompt length once and don't multiply by step count. The 2024 OpenAI API documentation introduced prompt caching specifically because this was the dominant cost driver for agentic workloads.

Bucket 2: Tool schema tokens. When you bind tools to an LLM via function calling, every tool's JSON schema is included in the context on every call. A well-described tool with parameter descriptions, type annotations, and examples might be 300–500 tokens. Ten tools means 3,000–5,000 tokens of schema overhead per step. Teams running 20+ tool agents routinely discover tool schemas are their second-largest cost center.

Real Numbers

Anthropic's Claude tool use documentation notes that a typical tool definition with a name, description, and three parameters runs approximately 200–400 input tokens. For an agent with 15 tools running 10 steps, that's 15 × 10 × 300 = 45,000 tokens of schema cost — at Claude 3.5 Sonnet pricing (as of mid-2024, $3 per million input tokens), roughly $0.135 per run just in tool schema overhead.

Bucket 3: Conversation history tokens. The compounding problem. Step 1 sends 1,000 tokens of history. Step 2 sends 1,000 + step-1 output tokens. Step 3 sends step-1 + step-2 outputs. By step 10 of a verbose agent, the history alone can exceed 50,000 tokens. This is the bucket Cognition discovered was driving their 6–8x cost overrun.

Bucket 4: Tool output tokens. Often overlooked. When a tool returns data — search results, code execution output, API responses — that data enters the context as input tokens for the next step. A web search returning 10 results at 200 words each generates ~2,500 tokens. Five searches across a task run adds 12,500 tokens of tool output cost, all billed as input tokens to the LLM.

Measuring Cost Per Step vs. Cost Per Task

A profiling discipline that separates production-grade teams from hobbyists: measuring token cost at step granularity, not just task granularity. Task-level cost tells you what you spent. Step-level cost tells you where you spent it and what you can cut.

The standard practice is to accumulate two counters per step: step_input_tokens and step_output_tokens, and to break step_input_tokens into its four buckets using your span metadata. This requires your instrumentation to record the pre-call context length at each step, which most frameworks expose via the callback's serialized parameter or through token counting utilities.

Once you have step-level data, you can compute marginal cost per step — how much more a step costs than the previous one, purely due to history accumulation. When marginal cost is rising steeply (more than 15–20% per step), you've crossed the threshold where context compression becomes economically necessary.

Track cumulative token count before each LLM call — the delta between steps is your history growth rate
Separate input cost from output cost in your cost model — output tokens are typically 3–5x more expensive per token than input
Flag any tool output exceeding 2,000 tokens as a compression candidate — most long tool outputs can be summarized before re-injection
Use prompt caching where available — Anthropic's cache write costs 25% more than normal input, but cache reads cost 90% less

Optimization Trigger

When your per-task token cost exceeds 10x the expected single-call cost for equivalent reasoning, context compounding is almost certainly the cause. This 10x threshold was documented by the LangChain team in their 2024 agent cost optimization guide as the practical alert threshold for production systems.

🎯 Advanced · Lesson 2 Quiz

Quiz: Token Cost Anatomy

3 questions — free, untracked, retake anytime.

1. Cognition AI's Devin agent was consuming 6–8x more tokens than expected. What was the primary cause?

✓ Correct — ✓ Correct. Context accumulation — re-sending the full transcript, all prior tool outputs, and the complete system prompt on every step — was the culprit. The fix was summarizing completed steps.

✗ Not quite. The driver was context accumulation: every step re-sent the entire conversation history, all prior tool outputs, and the full system prompt without any compression.

2. An agent has a 2,000-token system prompt and runs 15 steps. What is the system prompt's contribution to total input token cost?

✓ Correct — ✓ Correct. 2,000 × 15 = 30,000 tokens. System prompts are included as input on every LLM call unless prompt caching is explicitly enabled and the prefix is stable.

✗ Not quite. Without prompt caching, the system prompt is re-sent as input on every LLM call. 2,000 tokens × 15 steps = 30,000 tokens of system prompt overhead.

3. What does "marginal cost per step" measure in an agent profiling context?

✓ Correct — ✓ Correct. Marginal cost per step reveals the rate at which history accumulation is compounding cost. When it exceeds 15–20% growth per step, context compression becomes economically necessary.

✗ Not quite. Marginal cost per step measures the cost increase from one step to the next — driven primarily by growing conversation history injected into each subsequent call.

🎯 Advanced · Lesson 2 Lab

Lab: Token Budget Modeling

Model and optimize token costs for a realistic agent configuration with an AI tutor.

Your Task

You're running a customer support agent with: a 3,500-token system prompt, 12 tool schemas averaging 350 tokens each, and conversation turns averaging 800 tokens of user input and 600 tokens of assistant output per step. The agent runs up to 10 steps.

Work through these calculations and decisions with the tutor:

Calculate the worst-case input token count at step 10 (showing all four buckets).
Identify which bucket offers the highest cost-reduction leverage for this specific configuration.
Decide how you would apply prompt caching to this setup and estimate the savings.

Ask the tutor: "Walk me through calculating the step 10 input token count for this agent, bucket by bucket."

💰 Token Cost Tutor Advanced Lab 2

🎯 Advanced · Lesson 3 of 4

Latency Profiling

Decomposing end-to-end agent latency into its measurable components — and identifying which ones actually yield to optimization.

When Fixie.ai (acquired by Adept in 2023) published their agent latency analysis, they documented that their median tool-augmented agent response took 8.2 seconds end-to-end. Breaking it down by component: LLM inference was 2.1 seconds, tool execution was 4.8 seconds (dominated by a single slow API call to an external CRM), serialization and deserialization overhead was 0.6 seconds, and framework overhead (routing, callback processing, state management) was 0.7 seconds. The 4.8 seconds in tool execution was cut to 1.1 seconds by switching to an async tool execution pattern and caching the CRM responses with a 60-second TTL. The LLM inference time — the component most developers focus on — remained untouched and was never the real bottleneck.

Decomposing Agent Latency: The Seven Components

End-to-end latency for an agent request has seven distinct measurable components. Most teams measure only one or two of them and then make optimization decisions based on an incomplete picture.

1. Request serialization. The time to serialize the request payload, encode it, and open the HTTP connection to the LLM API. Usually 5–50ms and rarely worth optimizing, but worth measuring to rule out connection pool exhaustion in high-throughput scenarios.

2. Time-to-first-token (TTFT). From when the API receives your request to when it starts streaming the first token. This includes queuing time on the API provider's infrastructure, prompt processing (KV cache lookup or full attention pass), and sampling setup. TTFT is what users perceive as "how long until it starts responding." It scales with input context length — longer prompts produce higher TTFT, which is why context compression improves perceived latency even when it doesn't reduce total generation time.

3. Time-per-output-token (TPOT). The decode rate — how fast tokens stream out after the first token. On modern GPUs this is fairly stable at 20–80 tokens/second depending on model size and infrastructure load. TPOT multiplied by output length gives total generation time.

The TTFT / TPOT Split

A 500-token input processed at an API with 300ms TTFT and 40 tokens/second TPOT, generating 200 output tokens, produces: 300ms TTFT + (200/40 × 1000ms) = 300ms + 5000ms = 5.3 seconds total generation time. Cutting input to 200 tokens might reduce TTFT to 150ms — a 150ms improvement on a 5.3-second total. Cutting output length to 100 tokens (by changing the prompt) saves 2.5 seconds. Context compression helps perceived latency; output length reduction helps total latency.

4. Tool execution latency. The time from when the LLM finishes generating a tool call to when the tool returns its result. This includes deserialization of the tool call, parameter validation, the actual tool operation (network I/O, disk I/O, compute), and serialization of the result. As the Fixie.ai case shows, this is frequently the dominant latency component and the highest-leverage optimization target.

5. Context injection latency. The time to take the tool result and construct the next prompt. In large context windows with many prior turns, this string construction and tokenization step can take 50–200ms and is often completely invisible in naive profiling.

6. Framework overhead. Callback processing, state serialization, routing logic, and any middleware layers in your agent framework. LangChain's callback system introduces 20–100ms per step depending on the number of registered callbacks and their complexity.

7. Inter-step coordination latency. In multi-agent systems, the time to route between agents, serialize handoff state, and initialize the downstream agent's context. This can range from negligible (in-process coordination) to seconds (cross-service coordination with cold start).

Latency Profiling in Practice: Waterfall Analysis

The standard technique for visualizing agent latency is the waterfall chart: a horizontal bar chart where each span is drawn as a bar starting at its start timestamp and ending at its end timestamp, with child spans nested below parent spans. This is the view you get in Jaeger, Honeycomb, or Langsmith's trace UI.

Reading a waterfall for agents requires knowing what a healthy shape looks like versus a pathological one. A healthy sequential agent shows: one root span, child spans that tile sequentially with minimal gaps, LLM spans that are the widest bars (as expected), and tool spans that complete quickly relative to LLM spans. A pathological trace shows wide gaps between spans (framework overhead accumulating), sequential tool spans that could be parallel, or LLM spans that are surprisingly narrow relative to long tool execution bars.

Wide gaps between spans: Indicates framework overhead or blocking state transitions — instrument the gap period to find what's running
Sequential tool spans with no data dependency: Immediate parallelism opportunity — use asyncio.gather() or equivalent
Rapidly growing LLM span width across steps: Context accumulation driving TTFT increase — apply compression before step N where this inflects
Tool spans wider than LLM spans: External I/O is the bottleneck — profile the tool internally and apply caching or async patterns
Many short LLM spans: Agent is making excessive small calls — consolidate reasoning steps

Measurement Discipline

Latency profiles vary significantly with API provider load. The Anthropic status page documented API p99 latency fluctuations of up to 3x between off-peak and peak hours in 2024. Always measure latency distributions (p50, p95, p99) across multiple runs and time periods — a single trace is anecdote, not data.

🎯 Advanced · Lesson 3 Quiz

Quiz: Latency Profiling

3 questions — free, untracked, retake anytime.

1. In the Fixie.ai latency analysis, tool execution was cut from 4.8s to 1.1s. What two techniques achieved this?

✓ Correct — ✓ Correct. Async execution eliminated sequential blocking, and the 60-second TTL cache eliminated repeated identical CRM calls across steps.

✗ Not quite. The fixes were async tool execution (eliminating sequential blocking) and a 60-second TTL cache on CRM responses — neither involved the LLM layer.

2. Time-to-first-token (TTFT) is most affected by which variable?

✓ Correct — ✓ Correct. TTFT scales with input context length because the model must process the full prompt (KV cache lookup or full attention pass) before generating the first output token.

✗ Not quite. TTFT is primarily driven by input context length — longer prompts require more prefill compute before the first token is generated. This is why context compression improves perceived latency.

3. In a waterfall trace, what does "rapidly growing LLM span width across successive steps" indicate?

✓ Correct — ✓ Correct. Growing LLM span width across steps is the signature of TTFT inflation from context accumulation. The inflection point in this growth curve is where you should apply compression.

✗ Not quite. Growing LLM span width across steps indicates that TTFT is inflating as context grows — the hallmark signature of history accumulation driving prefill latency upward.

🎯 Advanced · Lesson 3 Lab

Lab: Explore Lesson 3 Concepts

Apply what you learned in Lesson 3 through guided AI conversation

Your Task

Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.

Try asking about a specific concept from Lesson 3 and how it applies in practice.

🤖 AESOP Lab Assistant Lesson 3 Lab

Building AI Agents V — Optimization · Module 1 · Lesson 4

Lesson 4: Optimization Decisions

Advanced concepts, real-world applications, and practical implications

Core Concepts

This lesson explores lesson 4: optimization decisions — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

Lesson 4: Optimization Decisions

What is the primary focus of Lesson 4: Optimization Decisions?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from Lesson 4: Optimization Decisions through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4: optimization decisions.

Try: "How would the concepts from this lesson apply to a real-world scenario in this field?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 1 Test

Profiling Agent Performance · 15 Questions · 70% to Pass

Score: 0/15

1. What is the core objective of Profiling Agent Performance?

2. How should practitioners approach applying concepts from this module?

3. Which best describes the relationship between theory and practice in Building AI Agents V — Optimization?

4. What distinguishes expert practitioners from novices in this field?

5. How does Profiling Agent Performance build on previous modules?

6. What role do constraints play in practical implementation?

7. When applying frameworks from this module, what is most important?

8. How should practitioners handle conflicting perspectives in this field?

9. What makes the concepts in Profiling Agent Performance relevant beyond their immediate context?

10. How should practitioners continue developing expertise after completing this module?

11. What is the relationship between understanding Building AI Agents V — Optimization concepts and making decisions?

12. How do the lessons from this module apply to novel situations?

13. What is the value of understanding multiple perspectives on {course_title}?

14. How should practitioners evaluate new information or developments in this field?

15. What is the ultimate goal of learning Profiling Agent Performance?