L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
🎯 Advanced

Lesson 1: Tracing Agent Runs

Structured traces, span hierarchies, and the observability primitives that let you replay exactly what your agent did — and why.

In March 2023, Klarna deployed an AI agent handling customer service inquiries at scale. Within weeks, the team discovered the agent was occasionally routing refund requests into a loop — re-querying the same order data and never resolving the ticket. The failure was invisible in aggregate metrics: average handle time looked fine because most tickets completed normally. Only when engineers added per-run distributed tracing — capturing each tool call, its latency, its output, and its relationship to the next step — did the loop become visible as a span that repeated dozens of times on specific order IDs. The fix took 20 minutes once the trace was in hand. Finding the problem without tracing had taken three weeks.

What a Trace Actually Is

A trace is the complete record of a single agent run — every decision, every tool call, every LLM invocation — organized as a tree of spans. Each span captures one unit of work: its start time, end time, inputs, outputs, and any structured metadata you attach. The root span represents the entire run; child spans represent sub-steps.

The dominant standard is OpenTelemetry (OTel), a vendor-neutral specification maintained by the CNCF. OTel defines trace IDs (128-bit identifiers unique to a run), span IDs (64-bit identifiers per span), and a propagation mechanism that threads those IDs through HTTP headers, queue messages, or function arguments so spans from different services still assemble into one coherent tree.

For LLM agents specifically, the community has converged on a set of semantic conventions: spans for LLM calls should carry llm.model, llm.prompt_tokens, llm.completion_tokens, and llm.temperature as attributes. Tool-call spans should carry the tool name, the raw arguments passed, and the raw return value. These conventions make traces queryable across frameworks.

Key Primitive

A trace ID ties every span in one agent run together. Without it, you have logs — a flat list of events. With it, you have a graph — a structured history you can traverse, filter, and replay.

Instrumenting an Agent with OTel

Instrumentation means wrapping your agent's execution units in span-creation calls. With the Python OTel SDK, you obtain a tracer from the global provider and use it as a context manager. The pattern looks like:

  • Create a root span for the entire agent run, attaching the user's request as an attribute.
  • For each LLM call, create a child span. Record the model name, token counts from the response, and the finish reason.
  • For each tool call, create a sibling child span under the LLM span that triggered it. Record tool name, arguments (truncated to a safe length), and the result or exception.
  • Use span events (point-in-time annotations) for notable moments: "retrieved 12 documents", "retrying after rate limit", "tool returned empty result".
  • Set span status to ERROR and record the exception if any step throws, so failure spans are automatically queryable.

Frameworks like LangChain, LlamaIndex, and CrewAI all expose callback hooks that let you attach OTel spans without modifying core logic. LangSmith (LangChain's hosted tracing product), Arize Phoenix, and Weights & Biases Weave all ingest OTel traces and render the span tree visually. The span hierarchy maps directly onto the agent's reasoning graph — seeing it laid out makes loops, dead ends, and unexpectedly deep recursion immediately apparent.

Sampling Strategy

In high-volume production, tracing every run is expensive. A common approach: trace 100% of runs that result in an error or that exceed a latency threshold; sample 5–10% of successful runs for baseline health. Store full traces for 7 days; store aggregated span metrics (counts, p99 latency) indefinitely.

Span Attributes Worth Capturing

Not all attributes are equally valuable. Based on post-mortem analyses from production agent deployments, the highest-signal attributes are:

  • session_id / conversation_id — links traces for multi-turn interactions, essential for debugging state-corruption bugs that only manifest over multiple turns.
  • agent.step_number — the ordinal position of this step in the run; makes loops detectable by querying for runs where max step_number exceeds a threshold.
  • tool.name + tool.error_code — allows you to build a tool-reliability dashboard showing which tools fail most frequently and under what conditions.
  • llm.prompt_tokens + llm.completion_tokens — the primary cost signal; aggregate by user/org to build cost attribution.
  • retrieval.query + retrieval.num_results — for RAG agents, captures whether retrieval is returning meaningful context or consistently coming back empty.

One attribute teams consistently underinvest in: agent.decision_rationale — a short string extracted from the model's chain-of-thought before it calls a tool. This is the single most useful debugging aid when a trace shows the agent made a bad decision: you can read the rationale and immediately see whether it reasoned incorrectly or received bad data.

🎯 Advanced

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.
1. What is the relationship between a trace and its spans?
✓ Correct — ✓ Correct. A trace is the full tree; spans are the individual nodes in that tree, each representing one discrete unit of work (one LLM call, one tool invocation, etc.).
✗ Not quite. A trace is the complete tree of all spans in one agent run — think of the trace as the container and spans as its nodes.
2. The Klarna agent loop case study illustrates which key insight about aggregate metrics?
✓ Correct — ✓ Correct. Because the majority of tickets resolved normally, the per-ticket loop was hidden in averages — per-run tracing was required to surface it.
✗ The case study shows the opposite: aggregate handle time looked fine even while specific runs were looping dozens of times.
3. Which attribute is described as the single most useful aid when an agent makes a bad decision?
✓ Correct — ✓ Correct. Reading the rationale tells you immediately whether the agent reasoned incorrectly or received bad data — collapsing a long debugging session into seconds.
✗ Those attributes are valuable for cost and reliability, but agent.decision_rationale is flagged as the highest-signal attribute for understanding bad decisions.
🎯 Advanced

Lab 1: Designing a Trace Schema

Work with an AI coach to design a complete OTel span schema for a real agent architecture.

Your Mission

You're instrumenting a research agent that: (1) accepts a user question, (2) generates search queries, (3) calls a web-search tool multiple times, (4) synthesizes results with an LLM, and (5) returns a cited answer. Design the full trace schema.

  1. Describe the agent's steps to the coach and ask what spans you need and what attributes each span should carry.
  2. Ask about sampling strategy for a volume of ~50,000 runs/day.
  3. Ask the coach to critique your proposed schema for any gaps or over-instrumentation.
Start by describing the research agent architecture to the coach and asking: "What spans and attributes should I define for this agent's OTel trace schema?"
🤖 Observability Coach Lab 1
🎯 Advanced

Lesson 2: Debugging Agent Failures

A structured methodology for diagnosing silent failures, wrong outputs, and tool errors — from trace to root cause.

In 2024, Air Canada's automated refund agent incorrectly told a passenger named Jake Moffatt that he could apply for a bereavement fare retroactively — advice that contradicted Air Canada's own policy. Air Canada argued in court that the chatbot was a "separate legal entity" responsible for its own statements. The court rejected that argument, ruling Air Canada liable for $812.02. What made this failure so damaging was its silence: the agent produced a confident, fluent, plausible-sounding answer. No exception was thrown. No error was logged. Without structured evaluation — comparing agent outputs against ground-truth policy documents — the wrong answer was indistinguishable from a correct one in any conventional monitoring system.

The Four Failure Modes of Agents

Agent failures cluster into four categories, each requiring a different debugging approach:

  • Hard failures — exceptions, API errors, tool timeouts. These are the easiest to catch: set span status to ERROR, attach the exception, alert on error rate. The debugger's job is simply to find the offending span and trace the input that triggered it.
  • Soft failures (hallucinations) — the agent produces a confident wrong answer, as in the Air Canada case. No exception is thrown. Detection requires evaluation: LLM-as-judge scoring against a rubric, or comparison against a ground-truth corpus. The trace alone won't reveal the failure; you need an evaluation layer on top of it.
  • Behavioral drift — outputs are individually acceptable but have shifted over weeks. Caused by upstream model updates, changing data distributions, or accumulated prompt changes. Detected by tracking output distributions over time: sentiment, length, tool-call frequency, citation count.
  • Loop / stuck-state failures — the agent cycles between steps without terminating, as in the Klarna case. Detected by querying traces for runs where agent.step_number exceeds a maximum threshold, or where the same tool is called with identical arguments more than N times.
Debugging Priority Order

Fix hard failures first (they break functionality entirely), then instrument for soft failures (they cause liability), then build drift detection (they erode quality invisibly). Teams that reverse this order spend months polishing outputs while missing critical errors.

A Trace-Driven Root Cause Methodology

When a failure report arrives, the structured process is:

  • Step 1 — Isolate the trace. Find the trace ID from the failing run. If the failure was reported by a user, you need session_id or conversation_id indexed in your trace store to look it up. This is why those attributes are non-negotiable.
  • Step 2 — Identify the first ERROR or anomalous span. Sort spans by start time and find the earliest span that either errored or produced unexpected output. Failures almost always cascade — fixing the root cause eliminates downstream errors automatically.
  • Step 3 — Read the inputs, outputs, and rationale of that span. For an LLM span, look at the exact prompt (including retrieved context) and the exact completion. For a tool span, look at the exact arguments and the exact return value. The failure is almost always immediately obvious at this level.
  • Step 4 — Reproduce the failure deterministically. Re-run the exact same inputs against the same model with the same tools. If you can't reproduce it, increase temperature=0 for the reproduction and freeze any non-deterministic tool calls with recorded responses. Reproducibility is a prerequisite for a fix.
  • Step 5 — Write a regression test. Encode the failing input/output pair as an eval case before writing the fix. This ensures the fix doesn't regress later.

The most common time sink in agent debugging is Step 1 — engineers can't find the failing trace because the session wasn't indexed. The second most common is skipping Step 5, which means the same bug returns after a model update.

LLM-as-Judge for Soft Failures

For soft failures that don't throw exceptions, a lightweight evaluation pattern runs a second LLM call on every output: "Given this policy document and this agent response, did the agent accurately represent the policy? Answer YES or NO with a one-sentence reason." Log the verdict as a span attribute. Flag any NO verdicts for human review. This is the pattern that would have caught the Air Canada failure automatically.

Structured Logging vs. Tracing

Tracing and logging are complementary, not redundant. Logs capture discrete events at a point in time with context attached. Traces capture the causal relationship between events across the full run duration. Both are necessary.

Structured logging means emitting logs as JSON objects rather than free text strings. Instead of logger.info("Tool returned 5 results"), you emit logger.info("tool_results", extra={"tool": "web_search", "query": q, "num_results": 5, "trace_id": span.get_span_context().trace_id}). The trace_id field links every log line to its parent trace — enabling you to jump from a trace view to the full log stream for that run in one click. This linkage is the single highest-leverage improvement most teams can make to their debugging workflow.

🎯 Advanced

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.
1. What made the Air Canada chatbot failure particularly difficult to detect with conventional monitoring?
✓ Correct — ✓ Correct. Soft failures like hallucinations are silent — the system behaves normally from a metrics standpoint while producing wrong and potentially harmful outputs.
✗ The failure was a soft failure: the agent returned a confident, fluent, wrong answer with no exception or error signal.
2. In the trace-driven root cause methodology, what is the most common time sink that delays debugging?
✓ Correct — ✓ Correct. Step 1 — isolating the trace — is blocked when sessions aren't indexed, which is why session_id is described as a non-negotiable attribute.
✗ The lesson identifies Step 1 (isolating the trace) as the most common time sink — specifically because the session wasn't indexed for lookup.
3. What is the key benefit of attaching trace_id to structured log lines?
✓ Correct — ✓ Correct. Trace-log linkage collapses two separate investigations — "what happened in the trace?" and "what did the logs say?" — into a single workflow.
✗ The lesson describes trace_id in logs as enabling a one-click jump from the trace view to the full log stream — the highest-leverage workflow improvement for debugging.
🎯 Advanced

Lab 2: Diagnosing a Failure from Trace Data

Practice the five-step root cause methodology on a realistic trace scenario.

Your Mission

You've received a bug report: "The agent gave a customer wrong pricing information." You have access to the trace. Practice the full debugging methodology with your coach.

  1. Tell the coach the failure type (soft failure, no exception thrown) and ask which step of the methodology to start with and what to look for.
  2. Ask the coach how you'd implement LLM-as-judge evaluation to catch this class of failure automatically.
  3. Ask what regression test you should write before fixing the prompt.
Start by telling the coach: "An agent returned wrong pricing information with no error thrown. I have the trace ID. Walk me through the root cause process."
🤖 Debugging Coach Lab 2
🎯 Advanced

Lesson 3: Production Monitoring

Dashboards, metrics pipelines, and the operational signals that distinguish a healthy agent from one quietly failing at scale.

When GitHub launched Copilot Workspace in 2024 — an agentic system that plans, codes, and tests multi-file changes — the engineering team described their observability stack in a post on the GitHub blog. They tracked three tiers of metrics: infrastructure metrics (API latency, error rates), LLM-specific metrics (token usage per workspace session, model selection distribution, cache hit rates), and quality metrics (did the agent's plan match the user's intent as rated by the user thumbs-up/down signal?). The quality metrics tier turned out to be the most valuable: it caught a regression after a model upgrade in hours rather than days, because the user satisfaction signal dropped measurably before any latency or error metric changed. The team credited the three-tier approach with enabling rapid safe iteration on a system handling tens of thousands of sessions per day.

The Three-Tier Metrics Model

Production monitoring for agents requires three distinct metric tiers, each answering a different question:

  • Tier 1 — Infrastructure metrics: Is the system available and fast? Track API error rate, p50/p95/p99 latency per endpoint, queue depth for async runs, dependency health (model provider, vector DB, external APIs). These are the table-stakes metrics every production system needs.
  • Tier 2 — LLM-specific metrics: Is resource consumption as expected? Track token usage (input + output) per run and per user, model selection distribution (which model is actually being called?), tool call frequency per run, context window utilization (are you hitting the limit?), and cache hit rates if you're using semantic caching.
  • Tier 3 — Quality metrics: Is the agent doing the right thing? Track task completion rate (did the run end in a successful final answer or an abandonment?), explicit user feedback (thumbs up/down, re-queries), implicit signals (did the user immediately ask the same question again?), and automated eval scores from LLM-as-judge pipelines.
Critical Insight

Quality metrics (Tier 3) almost always catch regressions before infrastructure metrics do. A model update that makes outputs subtly worse will show up as a drop in user satisfaction before it shows up in latency or error rate. Invest in Tier 3 first if you're resource-constrained.

Building a Metrics Pipeline

The metrics pipeline is the infrastructure that collects raw signal from your agent, aggregates it, and makes it queryable. A robust pipeline for agents typically has these stages:

  • Emission: Your instrumented agent emits OTel spans and structured logs. Span attributes carry the raw metric values (token counts, tool names, latency, eval scores).
  • Collection: An OTel Collector runs as a sidecar or gateway. It receives spans, applies sampling rules, and fans out to multiple backends — a trace store (Jaeger, Tempo, LangSmith) for full traces, and a metrics backend (Prometheus, Datadog) for aggregated signals.
  • Aggregation: In Prometheus, you define recording rules that aggregate raw span data into meaningful metrics: agent_token_cost_total, agent_tool_error_rate, agent_task_completion_rate. These are cheap to query and persist indefinitely.
  • Visualization: Grafana dashboards present the three tiers in separate rows. Tier 1 at the top (infrastructure SLIs), Tier 2 in the middle (resource/cost), Tier 3 at the bottom (quality). The bottom row is the one engineers should check first when investigating a complaint.

One architectural decision worth getting right early: store the raw trace_ids that contribute to each aggregated metric. This enables drill-down — clicking on a spike in the "tool error rate" chart and seeing the individual traces that caused it, rather than having to run a separate query. Grafana's "Explore" panel linked to Tempo supports exactly this workflow when trace IDs are included as label values.

Cost Attribution

Token costs are a first-class operational concern. Aggregate llm.prompt_tokens + llm.completion_tokens by user_id, org_id, and feature_name. Build a cost dashboard alongside quality. The two metrics together reveal whether quality improvements are worth their cost increase — the central question in any agent productionization decision.

Continuous Evaluation in Production

Static test suites run before deployment catch regressions in known failure modes. But agents encounter novel inputs in production that no static suite covers. Continuous evaluation — running automated quality checks on a sample of live production runs — is the only way to detect emergent quality issues.

The standard approach: after each agent run completes, asynchronously run 1–3 LLM-as-judge evaluations on the output. Score for: (1) factual groundedness — does the answer follow from the retrieved context? (2) instruction following — did the agent do what the user asked? (3) safety — does the output comply with content policies? Write the scores as span attributes on the root span. Aggregate into Tier 3 metrics. Alert if any score distribution shifts more than two standard deviations from baseline. This pattern was described by Anthropic's Applied team in their 2024 agent evaluation guidelines as the minimum viable continuous eval setup for production agents.

🎯 Advanced

Lesson 3 Quiz

3 questions — free, untracked, retake anytime.
1. Which metric tier caught a regression first in the GitHub Copilot Workspace case study?
✓ Correct — ✓ Correct. The quality metrics tier detected the regression in hours because user satisfaction dropped measurably before latency or error rates changed at all.
✗ The case study specifically credits Tier 3 quality metrics (user thumbs-up/down) for catching the regression before any infrastructure metric changed.
2. Why is storing raw trace_ids alongside aggregated metrics valuable in a Grafana + Tempo setup?
✓ Correct — ✓ Correct. Trace IDs as label values in Grafana allow a direct link from metric anomaly to the raw traces that caused it — collapsing investigation time significantly.
✗ The lesson describes trace_id storage as enabling drill-down: click a spike in a chart, see the specific traces responsible — no separate query needed.
3. What are the three dimensions typically scored in a continuous evaluation LLM-as-judge pipeline?
✓ Correct — ✓ Correct. These three dimensions — groundedness, instruction following, and safety — are described as the minimum viable continuous eval setup for production agents.
✗ The lesson lists factual groundedness, instruction following, and safety as the three evaluation dimensions for continuous production evaluation.
🎯 Advanced

Lab 3: Designing a Three-Tier Dashboard

Design the metrics pipeline and Grafana dashboard layout for a production agent deployment.

Your Mission

You're responsible for the observability stack of a coding assistant agent handling ~20,000 sessions/day, using GPT-4o for generation and a vector database for code context retrieval.

  1. Ask the coach what specific metrics you should track in each of the three tiers for this agent type.
  2. Ask how to design the aggregation pipeline so that metric spikes link back to individual traces.
  3. Ask what a minimum viable continuous evaluation setup looks like for a coding assistant specifically.
Start by asking: "I'm building a three-tier monitoring dashboard for a coding assistant agent with 20,000 sessions/day. What should I track in each tier?"
🤖 Monitoring Coach Lab 3
Building AI Agents IV — OpenClaw · Module 6 · Lesson 4

Lesson 4: Alerting & Incident Response

Advanced concepts, real-world applications, and practical implications
Core Concepts

This lesson explores lesson 4: alerting & incident response — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

Lesson 4: Alerting & Incident Response
What is the primary focus of Lesson 4: Alerting & Incident Response?
✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.
Review the lesson — the focus is on connecting frameworks to practical reality.
Why does real-world deployment introduce challenges that pure theory doesn't capture?
✓ Correct — Correct. Real deployment requires judgment, not just framework application.
Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.
What separates effective practitioners from those who merely follow checklists?
✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.
The key differentiator is critical thinking ability, not experience or resources alone.
🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from Lesson 4: Alerting & Incident Response through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4: alerting & incident response.

Try: "How would the concepts from this lesson apply to a real-world scenario in this field?"
🤖 AESOP Lab Assistant Lesson 4 Lab

Module 6 Test

Observability and Logging · 15 Questions · 70% to Pass
Score: 0/15
1. What is the core objective of Observability and Logging?
2. How should practitioners approach applying concepts from this module?
3. Which best describes the relationship between theory and practice in Building AI Agents IV — OpenClaw?
4. What distinguishes expert practitioners from novices in this field?
5. How does Observability and Logging build on previous modules?
6. What role do constraints play in practical implementation?
7. When applying frameworks from this module, what is most important?
8. How should practitioners handle conflicting perspectives in this field?
9. What makes the concepts in Observability and Logging relevant beyond their immediate context?
10. How should practitioners continue developing expertise after completing this module?
11. What is the relationship between understanding Building AI Agents IV — OpenClaw concepts and making decisions?
12. How do the lessons from this module apply to novel situations?
13. What is the value of understanding multiple perspectives on {course_title}?
14. How should practitioners evaluate new information or developments in this field?
15. What is the ultimate goal of learning Observability and Logging?