In March 2023, OpenAI experienced a series of ChatGPT service disruptions that affected millions of users. Engineers tracing the incidents discovered that cascading failures in upstream retrieval calls were masking the root cause — the visible symptom was slow completions, but the actual cause was a Redis cache layer timing out 400 ms before the LLM call even started. Without structured traces showing the full request lineage, the team initially chased the wrong component for over an hour. The incident post-mortem explicitly cited the need for end-to-end distributed tracing across all pipeline stages.
A trace is the complete record of a single request as it travels through your system. Think of it as a receipt that captures every operation performed on behalf of that one user interaction — from the moment the HTTP request arrived to the moment the final token was returned.
Every trace is composed of spans. A span represents a single named unit of work: one database query, one prompt construction step, one LLM API call, one post-processing step. Spans have a start time, an end time, and a set of attributes (key-value metadata). The critical structural feature is that spans can be nested — a parent span can have many child spans, forming a tree.
That tree structure is called the span tree or trace tree. The root span represents the entire user-visible operation. Children represent sub-operations. Grandchildren represent sub-sub-operations. The depth of this tree directly reflects the complexity of your pipeline.
The diagram above shows why tracing is so powerful: at a glance you can see that the Redis span consumed 408 ms before the LLM was even called. Without this tree structure, the only observable signal is the 1,842 ms total latency — which could implicate any component.
Every trace needs a globally unique trace ID. This ID is generated once at the entry point and then propagated through every downstream call. If your LLM pipeline calls a vector database, which calls an embedding service, all three operations must carry the same trace ID so they can later be stitched together into one coherent tree.
The mechanism for propagation is context carriers — typically HTTP headers. The W3C TraceContext specification standardized the traceparent header in 2019, giving distributed systems a vendor-neutral way to pass trace IDs across service boundaries. OpenTelemetry (OTel), which became the industry standard for observability instrumentation, implements TraceContext natively.
Each span also carries its own span ID and a parent span ID. The parent-child relationship is reconstructed by the observability backend (Jaeger, Honeycomb, Grafana Tempo, Langfuse, etc.) when it receives the spans after the fact.
Format: version-traceId-parentId-flags. Example: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. The 32-hex trace ID and 16-hex parent span ID are the fields your instrumentation libraries read and write automatically when you use OTel SDKs.
It is worth distinguishing three related concepts that are often conflated:
Modern observability practice connects all three. A span can carry a log record; a log record can carry a trace ID. The OTel specification defines how all three signals — traces, metrics, and logs — interoperate. For LLM pipelines specifically, span events are extremely useful for marking moments like "first token received" or "tool call dispatched" within a longer completion span.
Generic distributed tracing was designed for microservices, not language models. LLM calls have unique properties that require additional span attributes. OpenTelemetry's GenAI semantic conventions, proposed in 2024 and progressively adopted, define standard attribute names for LLM spans:
| Attribute | Type | Description |
|---|---|---|
| gen_ai.system | string | Provider name: "openai", "anthropic", "google_vertexai" |
| gen_ai.request.model | string | Requested model ID, e.g. "gpt-4o-mini" |
| gen_ai.response.model | string | Actual model used (may differ due to routing) |
| gen_ai.usage.input_tokens | int | Prompt tokens consumed |
| gen_ai.usage.output_tokens | int | Completion tokens generated |
| gen_ai.request.temperature | float | Sampling temperature used |
| gen_ai.response.finish_reasons | string[] | "stop", "length", "tool_calls", "content_filter" |
Capturing these attributes consistently across every LLM call in your pipeline is the foundation of everything else in this module. You cannot alert on token overruns, debug finish_reason anomalies, or build cost dashboards without first ensuring they are recorded in your spans.
A trace is a tree of spans. Each span is a timed unit of work with structured attributes. The trace ID threads through every layer of your pipeline, making it possible to reconstruct the full causal chain of any request — including exactly which LLM call consumed how many tokens and returned which finish reason.
You are an observability engineer reviewing traces from a production RAG (retrieval-augmented generation) chatbot. The AI assistant will present you with trace data and ask you to identify root causes, slow spans, and missing attributes.
Discuss at least 3 different trace scenarios with the assistant to complete this lab.
When LangChain released its callback system in early 2023, tracing was an afterthought — developers were forced to implement custom callbacks and manually log every chain step. By mid-2023, teams at companies including Replit and Notion reported that debugging multi-step agents consumed a disproportionate share of engineering time because there was no standardized way to correlate a user complaint with a specific chain execution. LangChain subsequently integrated LangSmith in late 2023 and added OpenTelemetry export support in 2024, reducing mean time to diagnose agent failures by a documented factor of roughly 4× in internal benchmarks published at their developer conference.
OpenTelemetry (OTel) is a CNCF project that provides a unified SDK for capturing traces, metrics, and logs. For Python — the most common language for LLM application development — the relevant packages are opentelemetry-sdk, opentelemetry-api, and an exporter such as opentelemetry-exporter-otlp.
The bootstrap sequence is always the same: create a TracerProvider, attach a Span Processor (which controls how spans are batched and exported), configure an Exporter (which specifies where spans go), and then call tracer.start_as_current_span() to instrument your code.
Manually wrapping every LLM call with context managers is tedious and error-prone. The OTel ecosystem addresses this through auto-instrumentation — libraries that monkey-patch the underlying SDK client at import time so every call is automatically wrapped.
As of 2024, the most production-tested options are:
| Library | Covers | OTel GenAI Conventions |
|---|---|---|
| opentelemetry-instrumentation-openai | openai Python SDK (chat, embeddings, images) | Yes |
| opentelemetry-instrumentation-anthropic | anthropic Python SDK (messages, streams) | Yes |
| openinference (Arize) | LangChain, LlamaIndex, OpenAI, Anthropic | Partial |
| traceloop openllmetry | OpenAI, Anthropic, Cohere, Bedrock, LangChain | Partial |
| Langfuse SDK (decorator) | Any LLM call via Python decorator or callback | Own schema |
The choice between these libraries comes down to: (1) whether you need OTel-standard span format (for routing to generic backends), or (2) whether you want a purpose-built LLM observability backend (Langfuse, Arize Phoenix, Helicone) that has its own richer schema. Many teams run both: OTel for infrastructure-level traces, Langfuse for LLM-specific analytics.
At scale, tracing every single request is expensive. A production system processing 10,000 LLM calls per minute generates significant span data. OTel provides configurable samplers to control which traces are retained:
For LLM pipelines, tail-based sampling with error and latency rules is the gold standard. Always keep: any trace where a span has status ERROR, any trace where total duration exceeds your P99 threshold, and a configurable percentage of successful traces for baseline statistics. This ensures you never lose a failure trace while controlling storage costs.
Rather than exporting spans directly from your application to a backend, the recommended architecture interposes an OTel Collector — a standalone agent or sidecar that receives spans over OTLP, applies processing (batching, filtering, attribute redaction, tail sampling), and fans out to one or more backends.
This architecture decouples your application from your observability backend. You can switch from Jaeger to Grafana Tempo to Honeycomb without changing a line of application code — only the Collector's export configuration changes. For LLM pipelines that handle PII in prompts, the Collector is also the right place to apply attribute redaction — stripping or hashing sensitive fields before they leave your infrastructure boundary.
Instrument LLM calls with OTel using either manual span wrapping or auto-instrumentation libraries. Route spans through an OTel Collector for processing and fan-out. Use tail-based sampling to always retain error and high-latency traces. Standard GenAI semantic conventions ensure your spans are queryable in any compatible backend.
You are designing observability for a production RAG pipeline that processes 5,000 requests per hour, handles user PII in prompts, and uses both OpenAI GPT-4o and Anthropic Claude as fallbacks. Work with the assistant to design your complete instrumentation strategy.
Cover at least 3 topics: library selection, sampling configuration, and PII handling. Complete 3+ exchanges to finish the lab.
In 2023, GitHub faced scrutiny after researchers and enterprise customers discovered that Copilot's telemetry pipeline was, under certain configurations, retaining code snippets submitted as context to the model. Microsoft's subsequent privacy audit found that telemetry data paths were capturing more prompt content than intended for debugging purposes. The company updated its enterprise data handling policy in late 2023 to give organizations explicit controls over prompt retention — a direct consequence of observability instrumentation that was too broad. The episode highlighted a tension that every LLM platform team faces: you want prompt data to debug quality issues, but capturing it creates compliance and trust liabilities.
When instrumenting an LLM call, you have up to five categories of data you can record as span attributes or span events:
| Data Category | Debugging Value | Privacy Risk | Storage Cost | Recommendation |
|---|---|---|---|---|
| Token counts (in/out) | High | None | Trivial | Always capture |
| Model ID, finish_reason, latency | High | None | Trivial | Always capture |
| System prompt (static template) | High | Low | Small | Capture by reference / hash |
| User-injected prompt content | High | High | Medium | Redact or encrypt; policy-gated |
| Full completion text | Medium | Medium | High | Sample or truncate; policy-gated |
The practical implication: structural metadata (tokens, latency, model, finish reason) should always be captured because it has high debugging value and zero privacy risk. Prompt and response content requires a policy decision that depends on your regulatory environment, user consent model, and data residency requirements.
A common pattern is to store a SHA-256 hash of the full prompt as a span attribute instead of the prompt itself. This gives you three capabilities: detecting when the exact same prompt is being retried (debugging retry storms), correlating incidents across users who hit the same template (quality issues are template-level, not user-level), and linking span data to separately-stored prompt records without embedding PII in your trace backend.
Some teams go further and store the prompt in an encrypted side-channel store (e.g., S3 with customer-managed keys), then record only the storage key as a span attribute. This approach satisfies "right to erasure" GDPR requests — deleting the encryption key renders the stored prompt unreadable without any changes to the trace backend.
Token counts in spans are the raw material for cost tracking. The standard practice is to compute cost at query time — either in your tracing library or in your observability backend — using a pricing lookup table keyed on (provider, model, token_type).
The formula is simple: cost = (input_tokens × input_price_per_token) + (output_tokens × output_price_per_token). What makes this powerful is that cost becomes a span attribute, meaning you can break down cost by any other span attribute: user ID, feature flag, prompt template ID, or time-of-day.
Streaming completions introduce a subtle instrumentation challenge. When the model streams tokens back in chunks, the span cannot be closed until the stream is exhausted — but the first token latency is often the metric users perceive most directly. The correct pattern is to use a span event (zero-duration timestamp within the span) to mark the moment the first token chunk arrives, then close the span only when the stream ends.
This gives you two distinct latency metrics within one span: time to first token (TTFT) captured as a span event timestamp minus span start time, and total generation time captured as the span duration. TTFT is what matters for perceived responsiveness; total generation time matters for cost modeling and capacity planning.
Start span → make streaming call → on first chunk: record span event "first_token_received" → consume full stream, accumulate token count → close span. Compute TTFT = first_token_event.timestamp − span.start_time. Output token count may need to be counted from chunks if the provider doesn't send usage in the stream (Anthropic sends usage in the message_stop event; OpenAI sends it in the final chunk if stream_options.include_usage is True).
Always capture structural LLM metadata (tokens, latency, model, finish reason). Treat prompt and response content as sensitive data requiring a policy decision. Use span attributes for cost attribution keyed on (provider, model). Use span events to capture time-to-first-token within streaming spans.
You're building a B2B SaaS product where enterprise customers submit financial documents for AI-powered analysis. The product must: track LLM costs per customer account, support EU GDPR data residency, and provide quality debugging when the model gives wrong answers.
Work with the assistant to design your complete span attribute schema and data handling policy. Discuss cost attribution, PII handling, and streaming instrumentation. Complete 3+ exchanges.
Notion's AI team, which processes millions of LLM requests daily across its document generation, meeting summarization, and Q&A features, published a 2024 engineering blog post describing their observability evolution. They started with a Datadog-based approach, recording token counts as custom metrics. Over time, they found Datadog's query model was not well suited to the hierarchical, multi-step nature of their AI pipelines — they needed to correlate a user-reported quality issue with a specific prompt template version, model parameter set, and retrieval result, across a tree of spans. They migrated to a hybrid architecture: Datadog for infrastructure metrics, and a dedicated LLM observability layer (they evaluated both Langfuse and Arize) for prompt management, quality evals, and LLM-specific dashboards. The post noted that the specialized tool reduced the time to identify a prompt regression from "hours of manual log searching" to "minutes with the evaluation dashboard."
Tools like Datadog, New Relic, and Grafana were designed for microservice infrastructure — latency histograms, error rates, and resource utilization. They are excellent at these tasks but lack LLM-specific primitives. Specifically, generic APM tools do not natively:
1. Render prompt/response diffs. When a prompt template changes, you want to see side-by-side what changed and how outputs changed. APM has no concept of a "prompt template version."
2. Run LLM-as-judge evaluations. You often want to programmatically assess the quality of model outputs (relevance, groundedness, toxicity) and surface quality scores alongside latency and cost. This requires another LLM call — a meta-evaluation — that generic APM doesn't support.
3. Manage prompt versions. Treating prompts as versioned artifacts with A/B testing, rollback, and performance history requires a prompt registry, which is outside the scope of APM.
4. Visualize agent execution graphs. Multi-step agents produce DAG-shaped execution trees that are better visualized as interactive graphs than as flat trace waterfalls.
Langfuse is an open-source LLM observability platform (MIT licensed, self-hostable, also offered as cloud SaaS) that has emerged as one of the most widely adopted purpose-built options. Its key capabilities:
Helicone operates as an HTTP proxy rather than an SDK-based instrumentation library. You point your OpenAI (or Anthropic, etc.) API calls at https://oai.hconeai.com/v1 instead of https://api.openai.com/v1, and Helicone intercepts every request transparently. This approach has a notable advantage: zero code changes beyond a base URL swap. It also has a notable limitation: you cannot add business-context span attributes (user tier, feature name) without passing them as custom headers.
Helicone is well-suited for teams that want rapid time-to-value, B2C SaaS products with high-volume single-provider usage, and cost dashboards without engineering effort. It is less suited for complex multi-provider pipelines where agent orchestration spans many services.
Arize Phoenix is an open-source LLM observability tool focused on evaluation and debugging rather than production monitoring. It runs locally or as a cloud service and is primarily used for offline analysis — loading a dataset of traces, running evals, and identifying systematic failure modes.
Phoenix uses the OpenInference span convention, which extends OTel with richer LLM-specific attribute namespaces. It has tight integration with LangChain, LlamaIndex, and DSPy. Its strongest differentiator is the ability to run embedding-based clustering on traces — grouping inputs by semantic similarity to surface clusters where the model consistently underperforms.
| Platform | Instrumentation | Best For | Self-Host |
|---|---|---|---|
| Langfuse | SDK / OTel | Production monitoring + prompt management + evals | Yes (MIT) |
| Helicone | HTTP proxy | Rapid deployment, cost visibility, zero-code setup | Limited |
| Arize Phoenix | OpenInference / OTel | Offline eval, embedding clustering, agent debugging | Yes (OSS) |
| Datadog LLM Obs | Auto-instrument | Teams already on Datadog, unified infra + LLM | No |
The decision framework is straightforward once you identify your primary use case:
Startup, moving fast: Helicone proxy for immediate cost visibility, Langfuse cloud for quality evals. Migrate to self-hosted Langfuse when data residency becomes a requirement.
Enterprise with existing APM: Add OTel GenAI instrumentation to emit standard spans to your existing Datadog/Grafana stack. Add Langfuse (self-hosted) specifically for prompt management and LLM eval workflows. The OTel Collector fans out to both.
Research / offline analysis: Arize Phoenix locally for embedding analysis and systematic eval. Export production traces from Langfuse to Phoenix for deeper investigation of failure clusters.
Generic APM tools handle infrastructure metrics well but lack LLM-specific primitives: prompt versioning, evaluation scoring, and agent graph visualization. Purpose-built tools like Langfuse (production monitoring + prompts + evals), Helicone (proxy-based, zero-code), and Arize Phoenix (offline eval + embedding analysis) fill these gaps. Most mature teams run a hybrid: generic APM for infrastructure, a specialized LLM platform for AI-specific workflows.
You're a senior engineer at a healthcare AI company. Your LLM pipeline handles patient intake summaries, runs on AWS, processes ~50K requests/day, uses both OpenAI and Anthropic, and must meet HIPAA data handling requirements. Your team already uses Datadog for infrastructure.
Work with the assistant to select and justify a complete observability tool stack. Address: which tools to adopt, how they integrate, what data stays in each, and how you handle HIPAA constraints. Complete 3+ exchanges.