Module 5 · Lesson 1

What Streaming Is and Why It Matters

Tokens arrive one at a time — and that changes everything about how users experience your application.

Why does receiving the first token in 200ms feel faster than receiving the full response in 2 seconds?

When GitHub Copilot launched in June 2021, its internal benchmarks showed that users perceived the tool as significantly faster than a non-streaming competitor — even when total generation time was nearly identical. The difference was latency to first token: seeing code appear character by character engaged users rather than leaving them staring at a spinner. The psychological effect of progressive rendering is well-documented in UX research; streaming is one of the most direct ways to exploit it in LLM applications.

The Blocking Model vs. Streaming

In a standard (non-streaming) API call, the client sends a request and then waits until the entire response has been generated server-side before receiving a single byte of content. For a 500-token response at typical generation speed, that can mean 3–8 seconds of silence.

Streaming inverts this: the server begins transmitting tokens to the client as they are generated, using the Server-Sent Events (SSE) protocol over HTTP. The client reads an open connection, processing each incremental chunk as it arrives. The total wall-clock time to generate all tokens is the same — but the user sees the first word in under a second.

Blocking Request

Client sends request → server generates all tokens → server sends complete JSON → client renders. Median perceived wait: full generation time (2–8 s for typical responses).

Streaming Request

Client sends request → server sends first token (~200 ms) → subsequent tokens stream in → client renders progressively. Median perceived wait: time-to-first-token only.

Server-Sent Events — The Wire Protocol

Anthropic's streaming implementation uses SSE, a simple HTTP/1.1 mechanism. The connection stays open; the server writes newline-delimited event frames. Each frame looks like:

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" world"}}

event: message_stop
data: {"type":"message_stop"}

The SDKs abstract all of this away, but understanding the wire format helps when debugging network logs or implementing streaming in environments where you call the HTTP API directly.

When to Use Streaming

Streaming is not always the right choice. Use it when the response will be displayed progressively to a human user — chat interfaces, document editors, code completions. Avoid it when you need the complete text before doing anything useful with it (batch processing pipelines, structured-data extraction that requires the full JSON, automated test harnesses). Streaming adds connection management complexity; only pay that cost when UX demands it.

Real-World Data Point

Vercel's AI SDK team published internal latency data in 2023 showing that streaming reduced perceived response time by 60–70% in user studies, even when actual token generation time was unchanged. The metric that matters to users is time-to-first-meaningful-content, not total response time.

Key Terms

TTFTTime To First Token — the latency between sending a request and receiving the first streamed token. The primary UX metric for streaming applications.

SSEServer-Sent Events — an HTTP/1.1 protocol for unidirectional server-to-client event streams, used by Anthropic's streaming API.

Blocking callA standard API request that returns only after the full response is generated. Lower complexity, higher perceived latency.

Progressive renderingDisplaying content to users as it arrives, rather than waiting for the complete payload. Core UX benefit of streaming.

Module 5 · Lesson 1

Quiz: Streaming Fundamentals

3 questions — select the best answer for each.

What protocol does the Anthropic API use to deliver streaming responses?

Correct. Anthropic uses SSE — an HTTP/1.1 mechanism where the server writes newline-delimited event frames over a persistent connection.

Not quite. Anthropic uses Server-Sent Events (SSE), a simpler HTTP/1.1 protocol for unidirectional server-to-client streams.

Which metric most directly captures the user-perceived benefit of streaming?

Correct. TTFT is the latency between sending the request and receiving the first token. Users perceive this as the start of a response, making it the key UX metric.

Not quite. While throughput matters, the primary perceived-latency benefit of streaming is captured by Time To First Token (TTFT).

In which scenario is a non-streaming (blocking) API call generally preferable?

Correct. When you need the complete response before doing anything useful (e.g., parsing full JSON), a blocking call is simpler and avoids streaming overhead.

Not quite. Batch pipelines that need the complete text before processing it are better served by blocking calls — streaming adds complexity without UX benefit there.

Module 5 · Lab 1

Streaming Concepts Lab

Practice what you've learned — ask at least 3 questions to complete the lab.

Your Lab Assistant

This assistant specialises in streaming fundamentals — TTFT, SSE protocol, when to stream vs. block. Ask it anything from Lesson 1.

Try asking: "Why does streaming feel faster even if total generation time is the same?" or "What does an SSE event frame look like in the Anthropic API?"

Streaming Fundamentals Assistant

Lab 1

Hello! I'm your streaming fundamentals assistant. Ask me anything about SSE, TTFT, or when to use streaming vs. blocking API calls.

Module 5 · Lesson 2

Enabling Streaming in the Python SDK

One parameter change opens the stream — but handling it correctly takes a few more lines.

What does stream=True actually return, and how do you safely consume it?

The stream Parameter

In Anthropic's Python SDK, enabling streaming requires passing stream=True to client.messages.create(). This changes the return type from a Message object to a MessageStream context manager.

import anthropic

client = anthropic.Anthropic()

# Non-streaming — returns Message object directly
message = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Tell me about streaming."}]
)

# Streaming — use as context manager
with client.messages.stream(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Tell me about streaming."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Notice that client.messages.stream() — not client.messages.create(stream=True) — is the idiomatic approach in the current SDK. The stream() method returns a context manager that ensures the underlying HTTP connection is closed properly when the block exits, even if an exception occurs.

Iterating the Stream

The SDK exposes three main iteration interfaces on a MessageStream:

Property / Method	What You Get	Best Used For
.text_stream	Iterator of raw text strings as they arrive	Streaming text directly to a terminal or UI
.stream_events()	Iterator of typed event objects (MessageStartEvent, ContentBlockDeltaEvent, etc.)	Fine-grained control; inspecting metadata mid-stream
.get_final_message()	Complete Message object after stream ends	Accessing stop_reason, usage tokens after streaming

Accessing Usage and Stop Reason

Token usage and stop_reason are only available after the stream is complete, in the final message_delta event. The SDK's .get_final_message() method collects these for you:

with client.messages.stream(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain recursion."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    
    final = stream.get_final_message()
    print(f"\n\nStop reason: {final.stop_reason}")
    print(f"Input tokens: {final.usage.input_tokens}")
    print(f"Output tokens: {final.usage.output_tokens}")

The Context Manager Pattern — Why It Matters

Always use with blocks (context managers) for streaming. HTTP streaming keeps a TCP connection open. If your code raises an exception mid-stream and you haven't used a context manager, the connection leaks. The MessageStream context manager calls stream.close() in its __exit__, guaranteeing cleanup.

SDK Version Note

The client.messages.stream() interface was introduced in anthropic-sdk-python 0.18.0 (released March 2024). Earlier versions used create(stream=True) which returned a raw iterator without the convenience helpers. If you see older code using stream=True, the .stream() context manager is the current recommended pattern.

MessageStreamThe context manager returned by client.messages.stream(). Manages connection lifecycle and exposes text_stream, stream_events(), and get_final_message().

.text_streamA generator that yields each text delta as a plain string. Simplest way to consume streamed text output.

.get_final_message()Blocks until the stream ends, then returns the complete Message object including usage stats and stop_reason.

Module 5 · Lesson 2

Quiz: Python Streaming SDK

3 questions — select the best answer for each.

What does client.messages.stream() return in the Anthropic Python SDK?

Correct. client.messages.stream() returns a MessageStream context manager, which should be used with a with block.

Not quite. It returns a MessageStream context manager — use it with with client.messages.stream(...) as stream:.

When are usage tokens and stop_reason available during a stream?

Correct. Token usage and stop_reason arrive in the message_delta event at the very end of the stream. Use .get_final_message() to access them conveniently.

Not quite. Usage stats and stop_reason only arrive at stream end, in the final message_delta event.

Why should you always use a with block (context manager) when streaming?

Correct. The context manager calls stream.close() in __exit__, preventing connection leaks if your code raises an exception mid-stream.

Not quite. The context manager guarantees connection cleanup via stream.close() — preventing resource leaks on exceptions.

Module 5 · Lab 2

Python Streaming SDK Lab

Practice what you've learned — ask at least 3 questions to complete the lab.

Your Lab Assistant

This assistant specialises in the Anthropic Python SDK's streaming interface — MessageStream, text_stream, get_final_message(), and context managers. Ask it anything from Lesson 2.

Try asking: "What's the difference between .text_stream and .stream_events()?" or "Show me how to get token counts after a streaming response."

Python Streaming SDK Assistant

Lab 2

Hi! Ask me anything about using the Anthropic Python SDK for streaming — MessageStream, text_stream, get_final_message(), context managers, and more.

Module 5 · Lesson 3

Streaming Events Deep Dive

Every token arrives inside a typed event — knowing the full event lifecycle lets you build robust stream consumers.

What events fire between the first request and the final token — and what information does each carry?

The Complete Event Sequence

Anthropic's streaming API emits a deterministic sequence of event types. Understanding this lifecycle is essential when you need to react to partial data, track progress, or handle tool use mid-stream. The sequence for a simple text response is:

1
message_start — Emitted once, immediately. Contains the message ID, model name, role ("assistant"), and initial usage (input_tokens count).
2
content_block_start — Signals the start of a new content block (index 0 for the first text block). Carries the block type ("text") and index.
3
content_block_delta — Emitted repeatedly, once per token batch. The delta object has type: "text_delta" and a text field with the new content. This is the hot path — most events in a stream are this type.
4
content_block_stop — Signals the end of the current content block. One per block opened.
5
message_delta — Emitted once, near the end. Contains stop_reason, stop_sequence, and cumulative output_tokens usage.
6
message_stop — Final event. No payload beyond the event type. Safe to close the connection after receiving this.

Consuming Events with .stream_events()

Use .stream_events() when you need access to metadata beyond raw text — for example, to track when a new content block starts, or to capture the stop reason without waiting for get_final_message():

with client.messages.stream(
    model="claude-opus-4-5",
    max_tokens=512,
    messages=[{"role": "user", "content": "List three planets."}]
) as stream:
    for event in stream.stream_events():
        if event.type == "message_start":
            print(f"Stream opened. Input tokens: {event.message.usage.input_tokens}")
        
        elif event.type == "content_block_delta":
            if event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)
        
        elif event.type == "message_delta":
            print(f"\nStop reason: {event.delta.stop_reason}")
            print(f"Output tokens: {event.usage.output_tokens}")

Tool Use and Multiple Content Blocks

When Claude uses a tool during a streaming response, the event sequence includes additional content_block_start events with type: "tool_use", and content_block_delta events with type: "input_json_delta" carrying partial JSON for the tool input. This means a single stream can interleave multiple content blocks — always use the index field to track which block a delta belongs to.

Event Type	Key Fields	Count per Response
message_start	message.id, message.model, usage.input_tokens	1
content_block_start	index, content_block.type	1 per block
content_block_delta	index, delta.type, delta.text / delta.partial_json	Many (most events)
content_block_stop	index	1 per block
message_delta	delta.stop_reason, usage.output_tokens	1
message_stop	(none)	1

Practical Guidance

For simple text streaming, .text_stream is sufficient and much cleaner. Use .stream_events() only when you need metadata (stop_reason, usage, block index) or when handling tool use, where the delta types change mid-stream.

content_block_deltaThe most frequently emitted event type during streaming. Carries incremental text (text_delta) or partial tool input JSON (input_json_delta).

message_deltaThe penultimate event, carrying stop_reason, stop_sequence, and final output token count. The event to watch if you need to know why generation ended.

indexInteger field on content_block events identifying which content block a delta belongs to. Critical when a response contains multiple blocks (e.g., text + tool_use).

Module 5 · Lesson 3

Quiz: Streaming Event Lifecycle

3 questions — select the best answer for each.

Which event type is emitted most frequently during a streaming text response?

Correct. content_block_delta is emitted once per token batch throughout generation — the vast majority of events in any stream are this type.

Not quite. content_block_delta dominates — it fires once per token batch throughout the entire generation.

What field should you use to determine which content block a content_block_delta event belongs to?

Correct. The index field identifies which content block (0, 1, 2…) a delta belongs to — essential when a response includes multiple blocks like text + tool_use.

Not quite. Use event.index — it identifies which content block (0, 1, 2…) the delta is part of.

When does stop_reason become available in the event stream?

Correct. stop_reason is in the message_delta event — the penultimate event before message_stop.

Not quite. stop_reason appears in the message_delta event — emitted once, near the end, just before message_stop.

Module 5 · Lab 3

Streaming Events Lab

Practice what you've learned — ask at least 3 questions to complete the lab.

Your Lab Assistant

This assistant specialises in the streaming event lifecycle — message_start, content_block_delta, message_delta, tool use blocks, and the index field. Ask it anything from Lesson 3.

Try asking: "What's the difference between message_delta and message_stop?" or "How do I handle tool_use content blocks during streaming?"

Streaming Events Assistant

Lab 3

Hello! I can help you understand streaming events — message_start, content_block_delta, message_delta, and how to handle multiple content blocks. What would you like to explore?

Module 5 · Lesson 4

Error Handling and Production Patterns

Streams break in ways that batch calls don't — handling errors mid-stream, implementing retries, and forwarding to clients safely.

What happens when a network error occurs after you've already sent 200 tokens to a user — and how do you recover?

In December 2023, several teams building on the OpenAI API reported production incidents where streaming responses silently truncated due to proxy timeouts, sending partial JSON to downstream parsers and causing cascading failures. The same risk exists with any streaming LLM API. The pattern that emerged — tracking accumulated text, catching stream exceptions separately from connection setup exceptions, and using idempotent retry logic — became standard practice across the ecosystem.

Error Categories in Streaming

Streaming introduces two distinct failure windows that don't exist in blocking calls:

Pre-stream Errors

Occur before the first event arrives: authentication failures, invalid parameters, rate limit rejections. These raise APIStatusError subclasses (AuthenticationError, RateLimitError, etc.) exactly like non-streaming calls. Safe to retry or surface to the user immediately.

Mid-stream Errors

Occur after tokens have started flowing: network interruptions, proxy timeouts, server errors mid-generation. These raise exceptions inside the for loop or context manager. Partial content may already have been sent to the user — retrying from scratch means duplicate content.

Wrapping Streams Safely

import anthropic
from anthropic import APIConnectionError, RateLimitError

client = anthropic.Anthropic()
accumulated = []

try:
    with client.messages.stream(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": user_input}]
    ) as stream:
        for text in stream.text_stream:
            accumulated.append(text)
            yield_to_client(text)  # send to UI/websocket

except RateLimitError:
    # Do NOT retry mid-stream — duplicates content
    send_error_to_client("Rate limit reached. Please wait and try again.")

except APIConnectionError as e:
    # Network error mid-stream — partial content already sent
    partial = "".join(accumulated)
    log_partial_failure(partial, e)
    send_error_to_client("Connection lost mid-response. Please retry.")

Forwarding Streams to Web Clients

When your backend streams from Anthropic and forwards tokens to a browser, the standard pattern is to use SSE from your server to the browser as well. This creates a two-segment stream pipeline: Anthropic → your server → browser. Key considerations:

1
Set Content-Type: text/event-stream on your server response so browsers recognise it as SSE.
2
Disable response buffering in your web server or proxy (Nginx: proxy_buffering off;). Without this, your proxy will batch tokens before forwarding, defeating streaming's UX benefit.
3
Send a [DONE] sentinel (or use your own protocol) when the stream ends, so the browser knows to close the EventSource connection.
4
Handle partial sends — if the Anthropic stream fails mid-way, send an error event to the browser rather than silently closing the connection.

Async Streaming with AsyncAnthropic

In async Python frameworks (FastAPI, Starlette, aiohttp), use AsyncAnthropic and async with / async for:

from anthropic import AsyncAnthropic

aclient = AsyncAnthropic()

async def stream_response(user_input: str):
    async with aclient.messages.stream(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": user_input}]
    ) as stream:
        async for text in stream.text_stream:
            yield text  # async generator for FastAPI StreamingResponse

Production Checklist

① Always use context managers for connection safety. ② Catch pre-stream and mid-stream errors separately. ③ Accumulate sent tokens for logging/recovery. ④ Disable proxy buffering in Nginx/Apache. ⑤ Use AsyncAnthropic in async frameworks to avoid blocking the event loop. ⑥ Set a reasonable max_tokens — an open-ended stream can hold connections for minutes.

APIConnectionErrorRaised when the network connection to Anthropic is interrupted. Can occur mid-stream after tokens have already been forwarded.

AsyncAnthropicThe async variant of the Anthropic client. Uses async with / async for for non-blocking streaming in frameworks like FastAPI.

proxy_bufferingNginx directive that, when enabled (default), causes proxies to buffer streamed responses before forwarding — defeating streaming's latency benefit. Must be disabled.

Module 5 · Lesson 4

Quiz: Error Handling & Production

3 questions — select the best answer for each.

Why is retrying a request immediately after a mid-stream network error potentially harmful?

Correct. If 200 tokens were already forwarded before the error, a fresh retry restarts from token 1 — sending the user duplicate content. Log the partial response and ask the user to retry instead.

Not quite. The problem is content duplication: tokens already sent to the user cannot be "unsent," so retrying from scratch produces duplicate output.

Which Nginx directive must be set to off to prevent proxy buffering from defeating streaming?

Correct. proxy_buffering off; in Nginx forces it to forward chunks immediately rather than batching the entire response before forwarding.

Not quite. Set proxy_buffering off; in Nginx — without this, the proxy batches streamed tokens and the user won't see them until the buffer fills.

What is the correct Anthropic client class for non-blocking streaming in a FastAPI application?

Correct. AsyncAnthropic with async with and async for integrates natively with FastAPI's async event loop without blocking.

Not quite. Use anthropic.AsyncAnthropic with async for — the sync client would block FastAPI's event loop.

Module 5 · Lab 4

Production Streaming Lab

Practice what you've learned — ask at least 3 questions to complete the lab.

Your Lab Assistant

This assistant specialises in production streaming patterns — error handling, mid-stream failures, async streaming with FastAPI, proxy configuration, and retry safety. Ask it anything from Lesson 4.

Try asking: "How do I forward Anthropic streaming to a browser using FastAPI?" or "What's the safest way to handle an APIConnectionError mid-stream?"

Production Streaming Assistant

Lab 4

Hello! I'm here to help with production streaming patterns — error handling, async clients, proxy configuration, and forwarding streams to browsers. What are you working on?

Module 5

Module Test: Streaming Responses

15 questions — score 80% or higher to pass the module.

1. What does TTFT stand for in the context of streaming APIs?

Correct. TTFT — Time To First Token — measures the latency between sending a request and receiving the first streamed token.

TTFT stands for Time To First Token — the key perceived-latency metric for streaming applications.

2. Which HTTP mechanism does the Anthropic API use for streaming?

Correct. Anthropic uses SSE — Server-Sent Events — over HTTP/1.1.

Anthropic uses Server-Sent Events (SSE) over HTTP/1.1.

3. When is streaming generally NOT the right choice?

Correct. When you need the complete response before doing anything with it, streaming adds complexity without UX benefit.

Batch pipelines needing the full response before processing are better served by blocking calls.

4. What does client.messages.stream() return?

Correct. It returns a MessageStream context manager, used with a with block.

It returns a MessageStream context manager — use with client.messages.stream(...) as stream:.

5. Which property of MessageStream yields plain text strings as they arrive?

Correct. .text_stream is a generator that yields each text delta as a plain Python string.

.text_stream is the property that yields plain text strings as tokens arrive.

6. After a stream ends, how do you access the complete Message object including usage stats?

Correct. get_final_message() blocks until the stream ends and returns the complete Message object with stop_reason and usage.

Use stream.get_final_message() — it blocks until stream end and returns the complete Message.

7. What is the primary purpose of using a with block for streaming?

Correct. The context manager calls stream.close() in __exit__, preventing connection leaks if an exception occurs mid-stream.

The with block ensures stream.close() is called in __exit__, preventing connection leaks on exceptions.

8. Which event type is emitted first in every streaming response?

Correct. message_start is always first — it carries the message ID, model, and initial input token count.

message_start is emitted first, carrying the message ID, model name, and input token count.

9. In a streaming tool use response, what delta type carries partial tool input JSON?

Correct. content_block_delta events during tool use carry delta.type = "input_json_delta" with partial JSON strings.

Tool use deltas use delta.type = "input_json_delta" to carry incremental JSON fragments.

10. Which event carries stop_reason and final output token count?

Correct. message_delta carries delta.stop_reason and usage.output_tokens — the penultimate event before message_stop.

message_delta is the penultimate event — it carries stop_reason and output_tokens.

11. When should you use .stream_events() instead of .text_stream?

Correct. Use .stream_events() when you need typed event objects for metadata, stop_reason, or to handle multiple block types (text + tool_use).

Use .stream_events() when you need typed events for metadata, stop_reason tracking, or tool use handling.

12. Why is immediately retrying after a mid-stream error dangerous?

Correct. Tokens already forwarded to the user cannot be recalled — retrying from scratch sends them duplicate content.

Content already forwarded cannot be unsent — a fresh retry produces duplicate content for the user.

13. Which Nginx directive must be disabled to allow streamed tokens to reach the browser immediately?

Correct. Set proxy_buffering off; in Nginx to prevent it from batching streamed tokens before forwarding.

proxy_buffering off; in Nginx is required — otherwise the proxy batches tokens before forwarding them to the browser.

14. What Anthropic client class should you use for non-blocking streaming in FastAPI?

Correct. AsyncAnthropic integrates natively with Python's async event loop — use async with and async for.

Use anthropic.AsyncAnthropic with async with / async for to avoid blocking FastAPI's event loop.

15. What field on content_block_delta events identifies which content block the delta belongs to?

Correct. event.index is an integer (0, 1, 2…) identifying which content block the delta belongs to — essential for responses with multiple blocks.

Use event.index — an integer identifying which content block (0, 1, 2…) a delta belongs to.