Module 3 · Lesson 1

What Is a Trace? Spans, Trees, and Request Lineage

Every LLM call leaves a trail. Learning to read that trail turns invisible failures into actionable data.

How do you reconstruct exactly what happened inside a multi-step LLM pipeline when something goes wrong?

In March 2023, OpenAI experienced a series of ChatGPT service disruptions that affected millions of users. Engineers tracing the incidents discovered that cascading failures in upstream retrieval calls were masking the root cause — the visible symptom was slow completions, but the actual cause was a Redis cache layer timing out 400 ms before the LLM call even started. Without structured traces showing the full request lineage, the team initially chased the wrong component for over an hour. The incident post-mortem explicitly cited the need for end-to-end distributed tracing across all pipeline stages.

The Anatomy of a Trace

A trace is the complete record of a single request as it travels through your system. Think of it as a receipt that captures every operation performed on behalf of that one user interaction — from the moment the HTTP request arrived to the moment the final token was returned.

Every trace is composed of spans. A span represents a single named unit of work: one database query, one prompt construction step, one LLM API call, one post-processing step. Spans have a start time, an end time, and a set of attributes (key-value metadata). The critical structural feature is that spans can be nested — a parent span can have many child spans, forming a tree.

That tree structure is called the span tree or trace tree. The root span represents the entire user-visible operation. Children represent sub-operations. Grandchildren represent sub-sub-operations. The depth of this tree directly reflects the complexity of your pipeline.

TRACE id=a3f9c2 service=chat-api │ ├─ span name="handle_chat_request" duration=1842ms [ROOT] │ │ │ ├─ span name="retrieve_user_context" duration=412ms │ │ └─ span name="redis_get" duration=408ms ← SLOW │ │ │ ├─ span name="build_prompt" duration=23ms │ │ │ └─ span name="llm_completion" duration=1407ms │ ├─ model="gpt-4" tokens_in=812 tokens_out=241 │ └─ latency_to_first_token=389ms

The diagram above shows why tracing is so powerful: at a glance you can see that the Redis span consumed 408 ms before the LLM was even called. Without this tree structure, the only observable signal is the 1,842 ms total latency — which could implicate any component.

Trace IDs and Context Propagation

Every trace needs a globally unique trace ID. This ID is generated once at the entry point and then propagated through every downstream call. If your LLM pipeline calls a vector database, which calls an embedding service, all three operations must carry the same trace ID so they can later be stitched together into one coherent tree.

The mechanism for propagation is context carriers — typically HTTP headers. The W3C TraceContext specification standardized the traceparent header in 2019, giving distributed systems a vendor-neutral way to pass trace IDs across service boundaries. OpenTelemetry (OTel), which became the industry standard for observability instrumentation, implements TraceContext natively.

Each span also carries its own span ID and a parent span ID. The parent-child relationship is reconstructed by the observability backend (Jaeger, Honeycomb, Grafana Tempo, Langfuse, etc.) when it receives the spans after the fact.

Real-World Pattern — W3C traceparent header

Format: version-traceId-parentId-flags. Example: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. The 32-hex trace ID and 16-hex parent span ID are the fields your instrumentation libraries read and write automatically when you use OTel SDKs.

Spans vs. Events vs. Logs

It is worth distinguishing three related concepts that are often conflated:

Span A duration with a start and end time, representing a unit of work. Has attributes, events, and a status. Structured and queryable.

Span Event A timestamped annotation attached to a span. Records a discrete moment within a span — e.g., "token stream started", "function call returned". Zero duration.

Log A timestamped string message. Traditionally unstructured and not linked to a trace unless explicitly correlated via a trace ID in the log record.

Modern observability practice connects all three. A span can carry a log record; a log record can carry a trace ID. The OTel specification defines how all three signals — traces, metrics, and logs — interoperate. For LLM pipelines specifically, span events are extremely useful for marking moments like "first token received" or "tool call dispatched" within a longer completion span.

LLM-Specific Span Attributes

Generic distributed tracing was designed for microservices, not language models. LLM calls have unique properties that require additional span attributes. OpenTelemetry's GenAI semantic conventions, proposed in 2024 and progressively adopted, define standard attribute names for LLM spans:

Attribute	Type	Description
gen_ai.system	string	Provider name: "openai", "anthropic", "google_vertexai"
gen_ai.request.model	string	Requested model ID, e.g. "gpt-4o-mini"
gen_ai.response.model	string	Actual model used (may differ due to routing)
gen_ai.usage.input_tokens	int	Prompt tokens consumed
gen_ai.usage.output_tokens	int	Completion tokens generated
gen_ai.request.temperature	float	Sampling temperature used
gen_ai.response.finish_reasons	string[]	"stop", "length", "tool_calls", "content_filter"

Capturing these attributes consistently across every LLM call in your pipeline is the foundation of everything else in this module. You cannot alert on token overruns, debug finish_reason anomalies, or build cost dashboards without first ensuring they are recorded in your spans.

Key Takeaway

A trace is a tree of spans. Each span is a timed unit of work with structured attributes. The trace ID threads through every layer of your pipeline, making it possible to reconstruct the full causal chain of any request — including exactly which LLM call consumed how many tokens and returned which finish reason.

Lesson 1 Quiz

Traces, spans, and request lineage · 4 questions

A trace is best described as:

Correct. A trace is the full end-to-end record of one request, built from a tree of spans that each represent a unit of work.

Not quite. Individual operations are spans; traces are the complete tree of spans for one request. Logs and metrics are separate observability signals.

Which W3C specification standardized how trace context is propagated across HTTP service boundaries?

Correct. The W3C TraceContext spec, finalized in 2019, defines the traceparent header format that carries trace IDs across service boundaries in a vendor-neutral way.

The W3C TraceContext specification, using the traceparent HTTP header, is the standard. CORS, OpenAPI, and CSP serve entirely different purposes.

What does the OTel semantic convention attribute gen_ai.response.finish_reasons capture?

Correct. finish_reasons tells you why the model stopped — crucial for diagnosing truncated outputs (length), content policy blocks (content_filter), or tool use (tool_calls).

gen_ai.response.finish_reasons records why the model stopped generating tokens. HTTP codes, token counts, and temperature are separate attributes.

In the OpenAI March 2023 outage example, what was the actual root cause masked by the visible symptom of slow completions?

Correct. The Redis cache was timing out 400 ms before the LLM call, but without distributed tracing the team initially chased the LLM layer. End-to-end traces revealed the true upstream culprit.

The root cause was a Redis cache timeout upstream of the LLM call. The lesson is that without trace trees showing the full pipeline, engineers chase the visible symptom rather than the real cause.

Lab 1 — Reading Trace Trees

Practice interpreting span trees and identifying slow or failed spans

Your Task

You are an observability engineer reviewing traces from a production RAG (retrieval-augmented generation) chatbot. The AI assistant will present you with trace data and ask you to identify root causes, slow spans, and missing attributes.

Discuss at least 3 different trace scenarios with the assistant to complete this lab.

Ask: "Show me a trace with a problem and help me diagnose it" — or describe a real tracing challenge you want to explore.

Trace Analysis Lab

Welcome to the trace analysis lab. I'm going to act as your observability console — presenting you with realistic LLM pipeline traces and helping you develop the skill of reading span trees. Ready to look at your first trace? Just say "show me a trace" or ask about any tracing concept from Lesson 1.

Module 3 · Lesson 2

Instrumenting LLM Calls with OpenTelemetry

Automatic and manual instrumentation strategies — getting spans into your observability backend without rewriting your pipeline.

What is the minimum instrumentation code needed to get meaningful LLM traces flowing into a backend like Jaeger or Honeycomb?

When LangChain released its callback system in early 2023, tracing was an afterthought — developers were forced to implement custom callbacks and manually log every chain step. By mid-2023, teams at companies including Replit and Notion reported that debugging multi-step agents consumed a disproportionate share of engineering time because there was no standardized way to correlate a user complaint with a specific chain execution. LangChain subsequently integrated LangSmith in late 2023 and added OpenTelemetry export support in 2024, reducing mean time to diagnose agent failures by a documented factor of roughly 4× in internal benchmarks published at their developer conference.

OpenTelemetry SDK Basics

OpenTelemetry (OTel) is a CNCF project that provides a unified SDK for capturing traces, metrics, and logs. For Python — the most common language for LLM application development — the relevant packages are opentelemetry-sdk, opentelemetry-api, and an exporter such as opentelemetry-exporter-otlp.

The bootstrap sequence is always the same: create a TracerProvider, attach a Span Processor (which controls how spans are batched and exported), configure an Exporter (which specifies where spans go), and then call tracer.start_as_current_span() to instrument your code.

# Minimal OTel bootstrap — Python from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter provider = TracerProvider() provider.add_span_processor( BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317")) ) trace.set_tracer_provider(provider) tracer = trace.get_tracer("my-llm-app") # Wrapping an OpenAI call with tracer.start_as_current_span("llm_completion") as span: response = openai_client.chat.completions.create( model="gpt-4o", messages=messages ) span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens) span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens) span.set_attribute("gen_ai.response.finish_reasons", [response.choices[0].finish_reason])

Auto-Instrumentation Libraries

Manually wrapping every LLM call with context managers is tedious and error-prone. The OTel ecosystem addresses this through auto-instrumentation — libraries that monkey-patch the underlying SDK client at import time so every call is automatically wrapped.

As of 2024, the most production-tested options are:

Library	Covers	OTel GenAI Conventions
opentelemetry-instrumentation-openai	openai Python SDK (chat, embeddings, images)	Yes
opentelemetry-instrumentation-anthropic	anthropic Python SDK (messages, streams)	Yes
openinference (Arize)	LangChain, LlamaIndex, OpenAI, Anthropic	Partial
traceloop openllmetry	OpenAI, Anthropic, Cohere, Bedrock, LangChain	Partial
Langfuse SDK (decorator)	Any LLM call via Python decorator or callback	Own schema

The choice between these libraries comes down to: (1) whether you need OTel-standard span format (for routing to generic backends), or (2) whether you want a purpose-built LLM observability backend (Langfuse, Arize Phoenix, Helicone) that has its own richer schema. Many teams run both: OTel for infrastructure-level traces, Langfuse for LLM-specific analytics.

Sampling Strategies

At scale, tracing every single request is expensive. A production system processing 10,000 LLM calls per minute generates significant span data. OTel provides configurable samplers to control which traces are retained:

AlwaysOn Record every trace. Appropriate for development and staging. Not cost-effective at high production volume.

TraceIdRatio Record a fixed percentage of traces by hashing the trace ID. E.g., 0.1 = 10%. Simple, unbiased, but may miss rare errors.

ParentBased Respect the sampling decision made by the upstream caller. Essential for consistent trace trees across microservices.

Tail-Based Sampling Buffer all spans, then decide whether to keep the trace after it completes — allowing you to always keep error traces and high-latency traces regardless of volume. Requires a collector with tail sampling support (e.g., OTel Collector with tail_sampling processor).

Practical Recommendation

For LLM pipelines, tail-based sampling with error and latency rules is the gold standard. Always keep: any trace where a span has status ERROR, any trace where total duration exceeds your P99 threshold, and a configurable percentage of successful traces for baseline statistics. This ensures you never lose a failure trace while controlling storage costs.

The OTel Collector

Rather than exporting spans directly from your application to a backend, the recommended architecture interposes an OTel Collector — a standalone agent or sidecar that receives spans over OTLP, applies processing (batching, filtering, attribute redaction, tail sampling), and fans out to one or more backends.

This architecture decouples your application from your observability backend. You can switch from Jaeger to Grafana Tempo to Honeycomb without changing a line of application code — only the Collector's export configuration changes. For LLM pipelines that handle PII in prompts, the Collector is also the right place to apply attribute redaction — stripping or hashing sensitive fields before they leave your infrastructure boundary.

Key Takeaway

Instrument LLM calls with OTel using either manual span wrapping or auto-instrumentation libraries. Route spans through an OTel Collector for processing and fan-out. Use tail-based sampling to always retain error and high-latency traces. Standard GenAI semantic conventions ensure your spans are queryable in any compatible backend.

Lesson 2 Quiz

OTel instrumentation, sampling, and the Collector · 4 questions

What is the primary advantage of tail-based sampling over head-based (TraceIdRatio) sampling for LLM pipelines?

Correct. Tail-based sampling buffers spans and makes the keep/discard decision only after the trace completes, so it can guarantee retention of any trace that contains an error or exceeds a latency threshold.

Tail-based sampling's key advantage is post-completion inspection — it can always keep error and high-latency traces. Head-based sampling must decide before seeing the outcome, so rare errors in sampled-out traces are permanently lost.

Why is routing spans through an OTel Collector recommended rather than exporting directly from the application to the backend?

Correct. The Collector architecture decouples instrumentation from backend choice, applies span processing like attribute redaction (important for PII in prompts), and can fan out to multiple observability platforms simultaneously.

The Collector's value is decoupling, processing (including PII redaction for prompt data), and multi-backend fan-out. Direct export is supported but inflexible — you can't switch backends or add processing without code changes.

Which auto-instrumentation library was specifically designed to cover LangChain, LlamaIndex, and multiple LLM providers with OpenInference conventions?

Correct. Arize's OpenInference library instruments LangChain, LlamaIndex, OpenAI, and Anthropic using its own OpenInference span conventions, which extend OTel for LLM-specific attributes.

Arize's openinference library covers LangChain, LlamaIndex, and multiple providers. The openai/anthropic-specific OTel libraries cover only their respective SDKs. LangSmith is a separate product by LangChain.

In the OpenTelemetry SDK, what component controls how spans are batched and delivered to an exporter?

Correct. The Span Processor sits between the tracer and the exporter. BatchSpanProcessor queues spans and sends them in configurable batches, reducing overhead compared to SimpleSpanProcessor which exports synchronously on every span end.

The Span Processor (specifically BatchSpanProcessor for production) handles batching and delivery. The TracerProvider is the factory; the Propagator handles context injection/extraction; the Sampler decides what to record.

Lab 2 — Instrumentation Design

Design an instrumentation strategy for a realistic LLM pipeline

Your Task

You are designing observability for a production RAG pipeline that processes 5,000 requests per hour, handles user PII in prompts, and uses both OpenAI GPT-4o and Anthropic Claude as fallbacks. Work with the assistant to design your complete instrumentation strategy.

Cover at least 3 topics: library selection, sampling configuration, and PII handling. Complete 3+ exchanges to finish the lab.

Start by describing your pipeline architecture, or ask: "Help me choose between auto-instrumentation options for my setup."

Instrumentation Design Lab

Let's design your LLM pipeline observability from scratch. I'll help you think through library selection, sampling strategy, Collector configuration, and PII handling for your specific setup. Tell me about your pipeline — what frameworks are you using, what volume are you handling, and what are your biggest observability concerns right now?

Module 3 · Lesson 3

Capturing Prompt, Response, and Cost in Spans

Deciding what to record — and what not to — balancing debugging value against privacy, storage cost, and compliance risk.

Should you store the full prompt and response text in your spans, and if so, under what conditions and with what safeguards?

In 2023, GitHub faced scrutiny after researchers and enterprise customers discovered that Copilot's telemetry pipeline was, under certain configurations, retaining code snippets submitted as context to the model. Microsoft's subsequent privacy audit found that telemetry data paths were capturing more prompt content than intended for debugging purposes. The company updated its enterprise data handling policy in late 2023 to give organizations explicit controls over prompt retention — a direct consequence of observability instrumentation that was too broad. The episode highlighted a tension that every LLM platform team faces: you want prompt data to debug quality issues, but capturing it creates compliance and trust liabilities.

What to Capture: A Decision Framework

When instrumenting an LLM call, you have up to five categories of data you can record as span attributes or span events:

Data Category	Debugging Value	Privacy Risk	Storage Cost	Recommendation
Token counts (in/out)	High	None	Trivial	Always capture
Model ID, finish_reason, latency	High	None	Trivial	Always capture
System prompt (static template)	High	Low	Small	Capture by reference / hash
User-injected prompt content	High	High	Medium	Redact or encrypt; policy-gated
Full completion text	Medium	Medium	High	Sample or truncate; policy-gated

The practical implication: structural metadata (tokens, latency, model, finish reason) should always be captured because it has high debugging value and zero privacy risk. Prompt and response content requires a policy decision that depends on your regulatory environment, user consent model, and data residency requirements.

Prompt Hashing and Deduplication

A common pattern is to store a SHA-256 hash of the full prompt as a span attribute instead of the prompt itself. This gives you three capabilities: detecting when the exact same prompt is being retried (debugging retry storms), correlating incidents across users who hit the same template (quality issues are template-level, not user-level), and linking span data to separately-stored prompt records without embedding PII in your trace backend.

Some teams go further and store the prompt in an encrypted side-channel store (e.g., S3 with customer-managed keys), then record only the storage key as a span attribute. This approach satisfies "right to erasure" GDPR requests — deleting the encryption key renders the stored prompt unreadable without any changes to the trace backend.

Cost Attribution in Spans

Token counts in spans are the raw material for cost tracking. The standard practice is to compute cost at query time — either in your tracing library or in your observability backend — using a pricing lookup table keyed on (provider, model, token_type).

The formula is simple: cost = (input_tokens × input_price_per_token) + (output_tokens × output_price_per_token). What makes this powerful is that cost becomes a span attribute, meaning you can break down cost by any other span attribute: user ID, feature flag, prompt template ID, or time-of-day.

# Cost attribution example — computed in span PRICING = { ("openai", "gpt-4o"): {"input": 2.50/1e6, "output": 10.00/1e6}, ("openai", "gpt-4o-mini"): {"input": 0.15/1e6, "output": 0.60/1e6}, } with tracer.start_as_current_span("llm_completion") as span: resp = client.chat.completions.create(model="gpt-4o", messages=msgs) it = resp.usage.prompt_tokens ot = resp.usage.completion_tokens pricing = PRICING[("openai", "gpt-4o")] cost_usd = it * pricing["input"] + ot * pricing["output"] span.set_attribute("gen_ai.usage.cost_usd", round(cost_usd, 6)) span.set_attribute("gen_ai.usage.input_tokens", it) span.set_attribute("gen_ai.usage.output_tokens", ot) span.set_attribute("app.feature", "document_summarizer") span.set_attribute("app.user_tier", "free")

Streaming Responses and Span Timing

Streaming completions introduce a subtle instrumentation challenge. When the model streams tokens back in chunks, the span cannot be closed until the stream is exhausted — but the first token latency is often the metric users perceive most directly. The correct pattern is to use a span event (zero-duration timestamp within the span) to mark the moment the first token chunk arrives, then close the span only when the stream ends.

This gives you two distinct latency metrics within one span: time to first token (TTFT) captured as a span event timestamp minus span start time, and total generation time captured as the span duration. TTFT is what matters for perceived responsiveness; total generation time matters for cost modeling and capacity planning.

Streaming Span Pattern

Start span → make streaming call → on first chunk: record span event "first_token_received" → consume full stream, accumulate token count → close span. Compute TTFT = first_token_event.timestamp − span.start_time. Output token count may need to be counted from chunks if the provider doesn't send usage in the stream (Anthropic sends usage in the message_stop event; OpenAI sends it in the final chunk if stream_options.include_usage is True).

Key Takeaway

Always capture structural LLM metadata (tokens, latency, model, finish reason). Treat prompt and response content as sensitive data requiring a policy decision. Use span attributes for cost attribution keyed on (provider, model). Use span events to capture time-to-first-token within streaming spans.

Lesson 3 Quiz

Prompt capture, cost attribution, and streaming · 4 questions

What is the recommended approach for storing user-injected prompt content in spans given GDPR "right to erasure" requirements?

Correct. Storing prompts in encrypted side-channel storage with customer-managed keys satisfies right-to-erasure by allowing key deletion to render stored content unreadable, without requiring any changes to the trace backend.

The encrypted side-channel pattern is the correct approach: store prompt content separately with customer-managed encryption keys, record only the storage reference in the span. Key deletion satisfies erasure requests without touching trace data.

You want to detect when the exact same prompt template is causing errors across multiple users. What span attribute strategy enables this?

Correct. A SHA-256 hash of the prompt lets you group spans by identical prompt content without storing the content itself. Errors clustering around one hash point to a template-level issue rather than a user-specific one.

SHA-256 hashing the prompt provides correlation (same hash = same content) without PII exposure. Storing full prompts creates privacy risk; user_id grouping doesn't identify template-level patterns; finish_reason alone can't link errors to a specific template.

When instrumenting a streaming LLM response, what is the correct way to capture time-to-first-token (TTFT)?

Correct. A span event is a zero-duration timestamp within a span. Recording one at first-chunk-arrival lets you compute TTFT = event.timestamp − span.start_time without splitting the span or losing the association with token counts and finish_reason.

The span event pattern is correct. Closing and reopening loses data continuity; a child span adds complexity without benefit; a separate metric loses the association with this specific span's token counts and model attributes.

The formula for computing cost from an LLM span is: cost = (input_tokens × input_price) + (output_tokens × output_price). What additional span attributes make this cost data most useful for analysis?

Correct. Cost becomes actionable when you can break it down by business dimension — which feature is most expensive, whether free-tier users consume disproportionate tokens, which templates generate the longest outputs. Feature name, user tier, and template ID provide those dimensions.

Cost data becomes actionable when cross-referenced with business dimensions. Feature name, user tier, and template ID let you answer questions like "which feature costs most?" and "are free users driving disproportionate costs?" HTTP codes and Collector URLs don't add analytical value.

Lab 3 — Cost Attribution & Privacy Design

Design the span attributes and data handling policy for a real product scenario

Your Task

You're building a B2B SaaS product where enterprise customers submit financial documents for AI-powered analysis. The product must: track LLM costs per customer account, support EU GDPR data residency, and provide quality debugging when the model gives wrong answers.

Work with the assistant to design your complete span attribute schema and data handling policy. Discuss cost attribution, PII handling, and streaming instrumentation. Complete 3+ exchanges.

Start by describing what span attributes you'd capture, or ask: "What are the highest-priority span attributes for my B2B financial AI product?"

Cost & Privacy Design Lab

Let's design your span schema for a GDPR-compliant, cost-tracked B2B financial AI product. This is a genuinely tricky design problem — you need rich debugging data for quality issues, per-customer cost attribution, and EU data residency compliance, which pull in different directions. Tell me your current thinking on what you'd capture in spans, and I'll help you work through the tradeoffs.

Module 3 · Lesson 4

LLM-Specific Observability Tools: Langfuse, Helicone, and Arize Phoenix

Purpose-built platforms for LLM tracing — what they offer beyond generic APM and when each fits your stack.

When does a dedicated LLM observability platform justify its cost over a general-purpose tool like Datadog or Grafana?

Notion's AI team, which processes millions of LLM requests daily across its document generation, meeting summarization, and Q&A features, published a 2024 engineering blog post describing their observability evolution. They started with a Datadog-based approach, recording token counts as custom metrics. Over time, they found Datadog's query model was not well suited to the hierarchical, multi-step nature of their AI pipelines — they needed to correlate a user-reported quality issue with a specific prompt template version, model parameter set, and retrieval result, across a tree of spans. They migrated to a hybrid architecture: Datadog for infrastructure metrics, and a dedicated LLM observability layer (they evaluated both Langfuse and Arize) for prompt management, quality evals, and LLM-specific dashboards. The post noted that the specialized tool reduced the time to identify a prompt regression from "hours of manual log searching" to "minutes with the evaluation dashboard."

What Generic APM Cannot Do

Tools like Datadog, New Relic, and Grafana were designed for microservice infrastructure — latency histograms, error rates, and resource utilization. They are excellent at these tasks but lack LLM-specific primitives. Specifically, generic APM tools do not natively:

1. Render prompt/response diffs. When a prompt template changes, you want to see side-by-side what changed and how outputs changed. APM has no concept of a "prompt template version."

2. Run LLM-as-judge evaluations. You often want to programmatically assess the quality of model outputs (relevance, groundedness, toxicity) and surface quality scores alongside latency and cost. This requires another LLM call — a meta-evaluation — that generic APM doesn't support.

3. Manage prompt versions. Treating prompts as versioned artifacts with A/B testing, rollback, and performance history requires a prompt registry, which is outside the scope of APM.

4. Visualize agent execution graphs. Multi-step agents produce DAG-shaped execution trees that are better visualized as interactive graphs than as flat trace waterfalls.

Langfuse

Langfuse is an open-source LLM observability platform (MIT licensed, self-hostable, also offered as cloud SaaS) that has emerged as one of the most widely adopted purpose-built options. Its key capabilities:

Traces & Generations Langfuse uses "observations" as its span equivalent, with a specialized "generation" type for LLM calls that has first-class fields for prompt, response, token counts, and cost.

Prompt Management Versioned prompt templates stored in Langfuse, fetched at runtime via SDK. Traces link to specific prompt versions, enabling A/B analysis of template changes.

Evaluations (Evals) Attach quality scores to traces — either from human annotation, rule-based classifiers, or LLM-as-judge calls. Scores are queryable alongside cost and latency.

OTel Export As of 2024, Langfuse can export traces in OTel format to external backends, supporting the hybrid architecture where Langfuse handles LLM-specific data and OTel handles infrastructure.

Helicone

Helicone operates as an HTTP proxy rather than an SDK-based instrumentation library. You point your OpenAI (or Anthropic, etc.) API calls at https://oai.hconeai.com/v1 instead of https://api.openai.com/v1, and Helicone intercepts every request transparently. This approach has a notable advantage: zero code changes beyond a base URL swap. It also has a notable limitation: you cannot add business-context span attributes (user tier, feature name) without passing them as custom headers.

Helicone is well-suited for teams that want rapid time-to-value, B2C SaaS products with high-volume single-provider usage, and cost dashboards without engineering effort. It is less suited for complex multi-provider pipelines where agent orchestration spans many services.

Arize Phoenix

Arize Phoenix is an open-source LLM observability tool focused on evaluation and debugging rather than production monitoring. It runs locally or as a cloud service and is primarily used for offline analysis — loading a dataset of traces, running evals, and identifying systematic failure modes.

Phoenix uses the OpenInference span convention, which extends OTel with richer LLM-specific attribute namespaces. It has tight integration with LangChain, LlamaIndex, and DSPy. Its strongest differentiator is the ability to run embedding-based clustering on traces — grouping inputs by semantic similarity to surface clusters where the model consistently underperforms.

Platform	Instrumentation	Best For	Self-Host
Langfuse	SDK / OTel	Production monitoring + prompt management + evals	Yes (MIT)
Helicone	HTTP proxy	Rapid deployment, cost visibility, zero-code setup	Limited
Arize Phoenix	OpenInference / OTel	Offline eval, embedding clustering, agent debugging	Yes (OSS)
Datadog LLM Obs	Auto-instrument	Teams already on Datadog, unified infra + LLM	No

Choosing a Stack

The decision framework is straightforward once you identify your primary use case:

Startup, moving fast: Helicone proxy for immediate cost visibility, Langfuse cloud for quality evals. Migrate to self-hosted Langfuse when data residency becomes a requirement.

Enterprise with existing APM: Add OTel GenAI instrumentation to emit standard spans to your existing Datadog/Grafana stack. Add Langfuse (self-hosted) specifically for prompt management and LLM eval workflows. The OTel Collector fans out to both.

Research / offline analysis: Arize Phoenix locally for embedding analysis and systematic eval. Export production traces from Langfuse to Phoenix for deeper investigation of failure clusters.

Key Takeaway

Generic APM tools handle infrastructure metrics well but lack LLM-specific primitives: prompt versioning, evaluation scoring, and agent graph visualization. Purpose-built tools like Langfuse (production monitoring + prompts + evals), Helicone (proxy-based, zero-code), and Arize Phoenix (offline eval + embedding analysis) fill these gaps. Most mature teams run a hybrid: generic APM for infrastructure, a specialized LLM platform for AI-specific workflows.

Lesson 4 Quiz

LLM-specific observability tools · 4 questions

Helicone operates differently from SDK-based tools like Langfuse. What is its instrumentation mechanism?

Correct. Helicone is an HTTP proxy. You change only the base URL of your API client (e.g., oai.hconeai.com instead of api.openai.com) and it intercepts all traffic without any SDK changes.

Helicone is an HTTP proxy — the key characteristic that enables zero-code setup. The tradeoff is that adding business-context attributes requires custom HTTP headers rather than SDK calls.

Arize Phoenix's strongest differentiator for diagnosing systematic LLM failures is:

Correct. Phoenix's embedding clustering groups semantically similar inputs, revealing failure patterns that would be invisible when looking at individual traces — for example, discovering the model systematically fails on a specific topic category.

Phoenix's key differentiator is embedding-based clustering for offline failure analysis. Real-time alerting and HTTP proxy are Helicone characteristics; prompt versioning is a Langfuse feature.

According to Notion AI's 2024 observability blog post, what was the main limitation that drove them away from a pure Datadog approach?

Correct. Notion found that correlating a quality issue across a hierarchical pipeline to a specific prompt template version was not well-supported by Datadog's flat metric model — the specialized LLM tool reduced diagnosis time from hours to minutes.

Notion's issue was diagnostic friction: correlating user quality complaints with specific prompt template versions and multi-step pipeline states was slow in Datadog. A specialized LLM observability layer solved the hierarchical query problem.

What does Langfuse's prompt management feature enable that raw tracing alone cannot?

Correct. Prompt management in Langfuse means prompts are versioned artifacts. Each trace links to a specific version, so you can query "how did output quality change between prompt v3 and v4?" — something impossible with raw traces that only capture prompt text at call time.

Langfuse prompt management versions prompts and links traces to specific versions, enabling before/after performance comparison when you change a template. This is qualitatively different from just capturing prompt text in a span.

Lab 4 — Tool Selection & Architecture

Choose and justify a complete LLM observability stack for a real scenario

Your Task

You're a senior engineer at a healthcare AI company. Your LLM pipeline handles patient intake summaries, runs on AWS, processes ~50K requests/day, uses both OpenAI and Anthropic, and must meet HIPAA data handling requirements. Your team already uses Datadog for infrastructure.

Work with the assistant to select and justify a complete observability tool stack. Address: which tools to adopt, how they integrate, what data stays in each, and how you handle HIPAA constraints. Complete 3+ exchanges.

Start with your tool selection and rationale, or ask: "Walk me through how to evaluate Langfuse vs Helicone for HIPAA-regulated healthcare AI."

Tool Selection Lab

This is an excellent real-world architecture problem. Healthcare AI with HIPAA constraints, multi-provider setup, and an existing Datadog investment — each of these factors meaningfully shapes which observability tools make sense. Let's work through this systematically. What's your initial instinct on tool selection, and what's your biggest concern: HIPAA compliance, multi-provider coverage, prompt management, or evaluation quality?

Module 3 Test — Tracing LLM Calls

15 questions · Score 80% or higher to pass

1. A span in distributed tracing is best defined as:

Correct. A span is a named, timed unit of work — the atomic building block of a trace.

A span is the atomic timed unit. A trace is the full tree of spans. Logs and metrics are separate signals.

2. What is the purpose of a trace ID in distributed tracing?

Correct. The trace ID is propagated through all services and allows the observability backend to assemble all spans from a single request into one coherent trace tree.

The trace ID is a global correlation key that ties together all spans from one request, across all services.

3. The W3C TraceContext specification defines which HTTP header for trace propagation?

Correct. The traceparent header carries the version, trace ID, parent span ID, and flags in a standardized format.

The W3C TraceContext spec defines the traceparent header as the standard carrier for distributed trace context.

4. Which OTel semantic convention attribute captures the reason an LLM stopped generating tokens?

Correct. gen_ai.response.finish_reasons records values like "stop", "length", "tool_calls", or "content_filter" — essential for diagnosing truncated or blocked outputs.

gen_ai.response.finish_reasons is the correct attribute. It captures stop, length, tool_calls, and content_filter values.

5. In the OpenTelemetry SDK, the BatchSpanProcessor is preferred over SimpleSpanProcessor in production because:

Correct. BatchSpanProcessor is async and batches spans, dramatically reducing the latency impact on the hot path compared to SimpleSpanProcessor's synchronous per-span export.

BatchSpanProcessor's key advantage is async batched export. Simple is synchronous and adds latency to every span end.

6. Tail-based sampling decides whether to keep or discard a trace:

Correct. Tail-based sampling buffers spans until the trace is complete, then applies rules — allowing it to always keep error traces regardless of their position in the sampling rate.

Tail sampling happens after trace completion, enabling condition-based retention. Head-based sampling (TraceIdRatio) decides at trace start; ParentBased respects the upstream decision.

7. Which of the following is the correct location in the OTel architecture to apply PII redaction from LLM prompt attributes?

Correct. The OTel Collector is the right place for attribute redaction — it processes spans before they leave your infrastructure boundary, ensuring PII never reaches external backends.

The OTel Collector intercepts spans before export and is the correct place for PII redaction processors, keeping sensitive data within your infrastructure boundary.

8. The correct formula for computing the cost of an LLM call from span data is:

Correct. LLM providers charge separately for input and output tokens at different rates. The correct formula sums both components.

Input and output tokens are priced separately (and differently), so both must be included in the cost formula.

9. To capture time-to-first-token (TTFT) within a streaming LLM span, you should:

Correct. Span events are zero-duration timestamps within a span, perfect for marking moments like "first token received" without splitting the span or losing the association with total tokens and finish_reason.

Span events are the correct mechanism — they timestamp a moment within a span. This keeps TTFT associated with the same span that carries token counts and model metadata.

10. Storing a SHA-256 hash of a prompt as a span attribute (instead of the prompt text) enables which capability while protecting PII?

Correct. Hashes can be compared for equality without revealing content — identical hashes mean identical prompts, enabling you to detect when errors cluster around one template version.

SHA-256 is one-way — you can't reconstruct the prompt from the hash. But hashes are equal when content is equal, enabling correlation without PII exposure.

11. Which LLM observability tool uses an HTTP proxy as its primary instrumentation mechanism?

Correct. Helicone sits between your application and the LLM provider as an HTTP proxy, requiring only a base URL change with no SDK modifications.

Helicone is the proxy-based tool. Langfuse and Arize Phoenix use SDK instrumentation; OpenInference is a span convention, not a tool.

12. Arize Phoenix's most distinctive capability for diagnosing systematic LLM failures is:

Correct. Phoenix clusters trace inputs by embedding similarity, revealing failure patterns invisible at the individual trace level — for example, a model that consistently fails on a specific topic domain.

Embedding clustering is Phoenix's unique capability. Real-time alerting and proxy capture describe other tools; prompt versioning is Langfuse.

13. What does the OTel Collector's fan-out capability enable that direct SDK-to-backend export does not?

Correct. The Collector can receive spans once and export them to multiple backends in parallel — infrastructure APM, LLM-specific platforms, and long-term storage — all without any application code changes.

Fan-out means one ingestion point, multiple export destinations. This is one of the Collector's key architectural advantages.

14. The ParentBased sampler in OTel is critical for multi-service pipelines because:

Correct. If Service A decides to sample a trace, ParentBased ensures downstream Service B also records its spans for that trace — maintaining trace tree completeness across service boundaries.

ParentBased propagates the upstream sampling decision downstream, keeping trace trees intact. Without it, some services may record spans and others may not for the same trace.

15. According to Notion AI's 2024 case, what was the primary outcome of adding a purpose-built LLM observability layer alongside their existing Datadog setup?

Correct. The hybrid approach — Datadog for infrastructure, specialized LLM tool for AI-specific analysis — dramatically reduced diagnostic time for AI quality issues, which was the specific gap generic APM couldn't fill.

Notion kept Datadog for infrastructure and added a specialized LLM tool. The key outcome was reducing AI quality diagnosis time from hours to minutes.