In March 2023, Klarna deployed an AI agent handling customer service inquiries at scale. Within weeks, the team discovered the agent was occasionally routing refund requests into a loop — re-querying the same order data and never resolving the ticket. The failure was invisible in aggregate metrics: average handle time looked fine because most tickets completed normally. Only when engineers added per-run distributed tracing — capturing each tool call, its latency, its output, and its relationship to the next step — did the loop become visible as a span that repeated dozens of times on specific order IDs. The fix took 20 minutes once the trace was in hand. Finding the problem without tracing had taken three weeks.
A trace is the complete record of a single agent run — every decision, every tool call, every LLM invocation — organized as a tree of spans. Each span captures one unit of work: its start time, end time, inputs, outputs, and any structured metadata you attach. The root span represents the entire run; child spans represent sub-steps.
The dominant standard is OpenTelemetry (OTel), a vendor-neutral specification maintained by the CNCF. OTel defines trace IDs (128-bit identifiers unique to a run), span IDs (64-bit identifiers per span), and a propagation mechanism that threads those IDs through HTTP headers, queue messages, or function arguments so spans from different services still assemble into one coherent tree.
For LLM agents specifically, the community has converged on a set of semantic conventions: spans for LLM calls should carry llm.model, llm.prompt_tokens, llm.completion_tokens, and llm.temperature as attributes. Tool-call spans should carry the tool name, the raw arguments passed, and the raw return value. These conventions make traces queryable across frameworks.
A trace ID ties every span in one agent run together. Without it, you have logs — a flat list of events. With it, you have a graph — a structured history you can traverse, filter, and replay.
Instrumentation means wrapping your agent's execution units in span-creation calls. With the Python OTel SDK, you obtain a tracer from the global provider and use it as a context manager. The pattern looks like:
Frameworks like LangChain, LlamaIndex, and CrewAI all expose callback hooks that let you attach OTel spans without modifying core logic. LangSmith (LangChain's hosted tracing product), Arize Phoenix, and Weights & Biases Weave all ingest OTel traces and render the span tree visually. The span hierarchy maps directly onto the agent's reasoning graph — seeing it laid out makes loops, dead ends, and unexpectedly deep recursion immediately apparent.
In high-volume production, tracing every run is expensive. A common approach: trace 100% of runs that result in an error or that exceed a latency threshold; sample 5–10% of successful runs for baseline health. Store full traces for 7 days; store aggregated span metrics (counts, p99 latency) indefinitely.
Not all attributes are equally valuable. Based on post-mortem analyses from production agent deployments, the highest-signal attributes are:
One attribute teams consistently underinvest in: agent.decision_rationale — a short string extracted from the model's chain-of-thought before it calls a tool. This is the single most useful debugging aid when a trace shows the agent made a bad decision: you can read the rationale and immediately see whether it reasoned incorrectly or received bad data.
You're instrumenting a research agent that: (1) accepts a user question, (2) generates search queries, (3) calls a web-search tool multiple times, (4) synthesizes results with an LLM, and (5) returns a cited answer. Design the full trace schema.
In 2024, Air Canada's automated refund agent incorrectly told a passenger named Jake Moffatt that he could apply for a bereavement fare retroactively — advice that contradicted Air Canada's own policy. Air Canada argued in court that the chatbot was a "separate legal entity" responsible for its own statements. The court rejected that argument, ruling Air Canada liable for $812.02. What made this failure so damaging was its silence: the agent produced a confident, fluent, plausible-sounding answer. No exception was thrown. No error was logged. Without structured evaluation — comparing agent outputs against ground-truth policy documents — the wrong answer was indistinguishable from a correct one in any conventional monitoring system.
Agent failures cluster into four categories, each requiring a different debugging approach:
Fix hard failures first (they break functionality entirely), then instrument for soft failures (they cause liability), then build drift detection (they erode quality invisibly). Teams that reverse this order spend months polishing outputs while missing critical errors.
When a failure report arrives, the structured process is:
The most common time sink in agent debugging is Step 1 — engineers can't find the failing trace because the session wasn't indexed. The second most common is skipping Step 5, which means the same bug returns after a model update.
For soft failures that don't throw exceptions, a lightweight evaluation pattern runs a second LLM call on every output: "Given this policy document and this agent response, did the agent accurately represent the policy? Answer YES or NO with a one-sentence reason." Log the verdict as a span attribute. Flag any NO verdicts for human review. This is the pattern that would have caught the Air Canada failure automatically.
Tracing and logging are complementary, not redundant. Logs capture discrete events at a point in time with context attached. Traces capture the causal relationship between events across the full run duration. Both are necessary.
Structured logging means emitting logs as JSON objects rather than free text strings. Instead of logger.info("Tool returned 5 results"), you emit logger.info("tool_results", extra={"tool": "web_search", "query": q, "num_results": 5, "trace_id": span.get_span_context().trace_id}). The trace_id field links every log line to its parent trace — enabling you to jump from a trace view to the full log stream for that run in one click. This linkage is the single highest-leverage improvement most teams can make to their debugging workflow.
You've received a bug report: "The agent gave a customer wrong pricing information." You have access to the trace. Practice the full debugging methodology with your coach.
When GitHub launched Copilot Workspace in 2024 — an agentic system that plans, codes, and tests multi-file changes — the engineering team described their observability stack in a post on the GitHub blog. They tracked three tiers of metrics: infrastructure metrics (API latency, error rates), LLM-specific metrics (token usage per workspace session, model selection distribution, cache hit rates), and quality metrics (did the agent's plan match the user's intent as rated by the user thumbs-up/down signal?). The quality metrics tier turned out to be the most valuable: it caught a regression after a model upgrade in hours rather than days, because the user satisfaction signal dropped measurably before any latency or error metric changed. The team credited the three-tier approach with enabling rapid safe iteration on a system handling tens of thousands of sessions per day.
Production monitoring for agents requires three distinct metric tiers, each answering a different question:
Quality metrics (Tier 3) almost always catch regressions before infrastructure metrics do. A model update that makes outputs subtly worse will show up as a drop in user satisfaction before it shows up in latency or error rate. Invest in Tier 3 first if you're resource-constrained.
The metrics pipeline is the infrastructure that collects raw signal from your agent, aggregates it, and makes it queryable. A robust pipeline for agents typically has these stages:
agent_token_cost_total, agent_tool_error_rate, agent_task_completion_rate. These are cheap to query and persist indefinitely.One architectural decision worth getting right early: store the raw trace_ids that contribute to each aggregated metric. This enables drill-down — clicking on a spike in the "tool error rate" chart and seeing the individual traces that caused it, rather than having to run a separate query. Grafana's "Explore" panel linked to Tempo supports exactly this workflow when trace IDs are included as label values.
Token costs are a first-class operational concern. Aggregate llm.prompt_tokens + llm.completion_tokens by user_id, org_id, and feature_name. Build a cost dashboard alongside quality. The two metrics together reveal whether quality improvements are worth their cost increase — the central question in any agent productionization decision.
Static test suites run before deployment catch regressions in known failure modes. But agents encounter novel inputs in production that no static suite covers. Continuous evaluation — running automated quality checks on a sample of live production runs — is the only way to detect emergent quality issues.
The standard approach: after each agent run completes, asynchronously run 1–3 LLM-as-judge evaluations on the output. Score for: (1) factual groundedness — does the answer follow from the retrieved context? (2) instruction following — did the agent do what the user asked? (3) safety — does the output comply with content policies? Write the scores as span attributes on the root span. Aggregate into Tier 3 metrics. Alert if any score distribution shifts more than two standard deviations from baseline. This pattern was described by Anthropic's Applied team in their 2024 agent evaluation guidelines as the minimum viable continuous eval setup for production agents.
You're responsible for the observability stack of a coding assistant agent handling ~20,000 sessions/day, using GPT-4o for generation and a vector database for code context retrieval.
This lesson explores lesson 4: alerting & incident response — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4: alerting & incident response.