When GitHub Copilot launched in June 2021, its internal benchmarks showed that users perceived the tool as significantly faster than a non-streaming competitor β even when total generation time was nearly identical. The difference was latency to first token: seeing code appear character by character engaged users rather than leaving them staring at a spinner. The psychological effect of progressive rendering is well-documented in UX research; streaming is one of the most direct ways to exploit it in LLM applications.
In a standard (non-streaming) API call, the client sends a request and then waits until the entire response has been generated server-side before receiving a single byte of content. For a 500-token response at typical generation speed, that can mean 3β8 seconds of silence.
Streaming inverts this: the server begins transmitting tokens to the client as they are generated, using the Server-Sent Events (SSE) protocol over HTTP. The client reads an open connection, processing each incremental chunk as it arrives. The total wall-clock time to generate all tokens is the same β but the user sees the first word in under a second.
Client sends request β server generates all tokens β server sends complete JSON β client renders. Median perceived wait: full generation time (2β8 s for typical responses).
Client sends request β server sends first token (~200 ms) β subsequent tokens stream in β client renders progressively. Median perceived wait: time-to-first-token only.
Anthropic's streaming implementation uses SSE, a simple HTTP/1.1 mechanism. The connection stays open; the server writes newline-delimited event frames. Each frame looks like:
The SDKs abstract all of this away, but understanding the wire format helps when debugging network logs or implementing streaming in environments where you call the HTTP API directly.
Streaming is not always the right choice. Use it when the response will be displayed progressively to a human user β chat interfaces, document editors, code completions. Avoid it when you need the complete text before doing anything useful with it (batch processing pipelines, structured-data extraction that requires the full JSON, automated test harnesses). Streaming adds connection management complexity; only pay that cost when UX demands it.
Vercel's AI SDK team published internal latency data in 2023 showing that streaming reduced perceived response time by 60β70% in user studies, even when actual token generation time was unchanged. The metric that matters to users is time-to-first-meaningful-content, not total response time.
This assistant specialises in streaming fundamentals β TTFT, SSE protocol, when to stream vs. block. Ask it anything from Lesson 1.
stream=True actually return, and how do you safely consume it?stream ParameterIn Anthropic's Python SDK, enabling streaming requires passing stream=True to client.messages.create(). This changes the return type from a Message object to a MessageStream context manager.
Notice that client.messages.stream() β not client.messages.create(stream=True) β is the idiomatic approach in the current SDK. The stream() method returns a context manager that ensures the underlying HTTP connection is closed properly when the block exits, even if an exception occurs.
The SDK exposes three main iteration interfaces on a MessageStream:
| Property / Method | What You Get | Best Used For |
|---|---|---|
| .text_stream | Iterator of raw text strings as they arrive | Streaming text directly to a terminal or UI |
| .stream_events() | Iterator of typed event objects (MessageStartEvent, ContentBlockDeltaEvent, etc.) | Fine-grained control; inspecting metadata mid-stream |
| .get_final_message() | Complete Message object after stream ends | Accessing stop_reason, usage tokens after streaming |
Token usage and stop_reason are only available after the stream is complete, in the final message_delta event. The SDK's .get_final_message() method collects these for you:
Always use with blocks (context managers) for streaming. HTTP streaming keeps a TCP connection open. If your code raises an exception mid-stream and you haven't used a context manager, the connection leaks. The MessageStream context manager calls stream.close() in its __exit__, guaranteeing cleanup.
The client.messages.stream() interface was introduced in anthropic-sdk-python 0.18.0 (released March 2024). Earlier versions used create(stream=True) which returned a raw iterator without the convenience helpers. If you see older code using stream=True, the .stream() context manager is the current recommended pattern.
client.messages.stream() return in the Anthropic Python SDK?client.messages.stream() returns a MessageStream context manager, which should be used with a with block.MessageStream context manager β use it with with client.messages.stream(...) as stream:.usage tokens and stop_reason available during a stream?message_delta event at the very end of the stream. Use .get_final_message() to access them conveniently.message_delta event.with block (context manager) when streaming?stream.close() in __exit__, preventing connection leaks if your code raises an exception mid-stream.stream.close() β preventing resource leaks on exceptions.This assistant specialises in the Anthropic Python SDK's streaming interface β MessageStream, text_stream, get_final_message(), and context managers. Ask it anything from Lesson 2.
Anthropic's streaming API emits a deterministic sequence of event types. Understanding this lifecycle is essential when you need to react to partial data, track progress, or handle tool use mid-stream. The sequence for a simple text response is:
type: "text_delta" and a text field with the new content. This is the hot path β most events in a stream are this type.stop_reason, stop_sequence, and cumulative output_tokens usage..stream_events()Use .stream_events() when you need access to metadata beyond raw text β for example, to track when a new content block starts, or to capture the stop reason without waiting for get_final_message():
When Claude uses a tool during a streaming response, the event sequence includes additional content_block_start events with type: "tool_use", and content_block_delta events with type: "input_json_delta" carrying partial JSON for the tool input. This means a single stream can interleave multiple content blocks β always use the index field to track which block a delta belongs to.
| Event Type | Key Fields | Count per Response |
|---|---|---|
| message_start | message.id, message.model, usage.input_tokens | 1 |
| content_block_start | index, content_block.type | 1 per block |
| content_block_delta | index, delta.type, delta.text / delta.partial_json | Many (most events) |
| content_block_stop | index | 1 per block |
| message_delta | delta.stop_reason, usage.output_tokens | 1 |
| message_stop | (none) | 1 |
For simple text streaming, .text_stream is sufficient and much cleaner. Use .stream_events() only when you need metadata (stop_reason, usage, block index) or when handling tool use, where the delta types change mid-stream.
content_block_delta is emitted once per token batch throughout generation β the vast majority of events in any stream are this type.content_block_delta dominates β it fires once per token batch throughout the entire generation.content_block_delta event belongs to?index field identifies which content block (0, 1, 2β¦) a delta belongs to β essential when a response includes multiple blocks like text + tool_use.event.index β it identifies which content block (0, 1, 2β¦) the delta is part of.stop_reason become available in the event stream?stop_reason is in the message_delta event β the penultimate event before message_stop.stop_reason appears in the message_delta event β emitted once, near the end, just before message_stop.This assistant specialises in the streaming event lifecycle β message_start, content_block_delta, message_delta, tool use blocks, and the index field. Ask it anything from Lesson 3.
In December 2023, several teams building on the OpenAI API reported production incidents where streaming responses silently truncated due to proxy timeouts, sending partial JSON to downstream parsers and causing cascading failures. The same risk exists with any streaming LLM API. The pattern that emerged β tracking accumulated text, catching stream exceptions separately from connection setup exceptions, and using idempotent retry logic β became standard practice across the ecosystem.
Streaming introduces two distinct failure windows that don't exist in blocking calls:
Occur before the first event arrives: authentication failures, invalid parameters, rate limit rejections. These raise APIStatusError subclasses (AuthenticationError, RateLimitError, etc.) exactly like non-streaming calls. Safe to retry or surface to the user immediately.
Occur after tokens have started flowing: network interruptions, proxy timeouts, server errors mid-generation. These raise exceptions inside the for loop or context manager. Partial content may already have been sent to the user β retrying from scratch means duplicate content.
When your backend streams from Anthropic and forwards tokens to a browser, the standard pattern is to use SSE from your server to the browser as well. This creates a two-segment stream pipeline: Anthropic β your server β browser. Key considerations:
Content-Type: text/event-stream on your server response so browsers recognise it as SSE.proxy_buffering off;). Without this, your proxy will batch tokens before forwarding, defeating streaming's UX benefit.[DONE] sentinel (or use your own protocol) when the stream ends, so the browser knows to close the EventSource connection.AsyncAnthropicIn async Python frameworks (FastAPI, Starlette, aiohttp), use AsyncAnthropic and async with / async for:
β Always use context managers for connection safety. β‘ Catch pre-stream and mid-stream errors separately. β’ Accumulate sent tokens for logging/recovery. β£ Disable proxy buffering in Nginx/Apache. β€ Use AsyncAnthropic in async frameworks to avoid blocking the event loop. β₯ Set a reasonable max_tokens β an open-ended stream can hold connections for minutes.
off to prevent proxy buffering from defeating streaming?proxy_buffering off; in Nginx forces it to forward chunks immediately rather than batching the entire response before forwarding.proxy_buffering off; in Nginx β without this, the proxy batches streamed tokens and the user won't see them until the buffer fills.AsyncAnthropic with async with and async for integrates natively with FastAPI's async event loop without blocking.anthropic.AsyncAnthropic with async for β the sync client would block FastAPI's event loop.This assistant specialises in production streaming patterns β error handling, mid-stream failures, async streaming with FastAPI, proxy configuration, and retry safety. Ask it anything from Lesson 4.
client.messages.stream() return?with block for streaming?stop_reason and final output token count?.stream_events() instead of .text_stream?content_block_delta events identifies which content block the delta belongs to?