In March 2023, when Anthropic's Claude API entered broad beta access, teams integrating it into customer-facing products discovered the same lesson that AWS, Stripe, and OpenAI customers had learned before them: rate limits are not edge cases. They are the normal operating condition of any successful API deployment. The teams that shipped stable products were those who had built retry logic from day one, not as an afterthought.
When you exceed Anthropic's rate limits, the API returns HTTP 429 Too Many Requests. Two distinct limits can trigger this: requests per minute (RPM) and tokens per minute (TPM). Both count against your tier independently. A large-context request can exhaust your TPM budget long before you hit your RPM ceiling.
The response includes a retry-after header indicating how many seconds to wait. Ignoring this header and retrying immediately is the most common mistake β it deepens the rate-limit hole rather than climbing out of it.
The correct pattern is exponential backoff with jitter. Each successive retry waits roughly twice as long as the previous, with random jitter added to prevent thundering-herd synchronization when many clients retry at the same time. AWS documented this pattern in 2015 after large-scale DynamoDB throttling events; it remains the industry standard.
Reactive retries are necessary but insufficient. The Anthropic API returns rate limit headers on every response, not just on 429s. Reading x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens lets you slow down before you hit the wall, rather than after. This is particularly valuable for batch processing pipelines where you control the request cadence.
The Anthropic Python SDK (v0.18+) includes built-in retry logic via the max_retries parameter on client initialization. For simple use cases, anthropic.Anthropic(max_retries=3) is sufficient. Build custom logic only when you need header-aware throttling or per-request retry policies.
retry-after header tells you exactly how many seconds to wait before the server will accept requests again. Using it prevents unnecessary additional 429s.retry-after header gives you the precise wait time you need.x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens on every response. Monitoring these lets batch pipelines voluntarily slow down before hitting zero, avoiding 429s entirely.You're building a batch document-processing pipeline that will send hundreds of requests to the Anthropic API. Work through the rate-limiting design decisions with your AI coach β ask about header inspection, backoff tuning, and how to handle partial batch failures.
In 2023, several early LLM startups discovered that their unit economics were fatally broken only after reaching scale. Systems built without per-request token budgets, caching, or prompt optimization were spending $0.40β$2.00 per user query on API costs alone β amounts that no subscription tier could support. The companies that survived built token accounting into their architecture from the start, treating each API call as a metered resource, not a free function call.
Anthropic charges separately for input tokens (your prompt) and output tokens (the model's response). As of Claude 3.5 Sonnet, input tokens cost roughly 3Γ less per million than output tokens. This asymmetry has significant architectural implications: verbose system prompts that repeat on every request are expensive; prompts that elicit long model responses are even more so.
Every API response includes a usage object with input_tokens and output_tokens. Logging these on every call is the minimum viable cost-tracking practice.
The max_tokens parameter in the API request is your primary output cost control. Setting it too high for tasks that only need short answers wastes money; setting it too low truncates responses. The right value depends on the task: classification needs 50 tokens, code generation may need 2000, document summarization typically lands between 300 and 600.
A common mistake is setting max_tokens=4096 globally. Instead, set per-task defaults and allow overrides only when justified.
Anthropic introduced prompt caching (Beta) in 2024, allowing you to mark large, static portions of your prompt β such as a lengthy system prompt, a document, or a toolset definition β with a cache_control marker. Subsequent requests that share that prefix pay only a fraction of the normal input cost. In workloads where the same large context is reused across many requests, this can reduce input token costs by 80β90%.
Build a per-session cost accumulator that raises a warning β or hard-stops β when a single conversation exceeds a threshold. Without this, adversarial or poorly-designed prompts can run up costs that dwarf typical usage. Stripe's API team reported that unbounded generation was the most common source of billing surprises in their LLM integrations.
Your company's Claude integration is burning through tokens faster than expected. You're seeing $800/day in API costs against a budget of $200. Work with your coach to identify the likely sources of waste and design a token-budgeting strategy.
When ChatGPT launched in November 2022, the streaming token-by-token display wasn't just a cosmetic choice. OpenAI's internal testing had shown that users rated the same response as higher quality when it arrived as a stream rather than as a single block after a delay β even when total latency was identical. Anthropic adopted the same pattern. Every production Claude integration should use streaming for any user-facing output.
Streaming uses server-sent events (SSE) to deliver tokens as they're generated. The Anthropic SDK provides a high-level streaming interface. The critical difference from non-streaming: your application receives the first token in 1β3 seconds, then continues receiving output until the model stops β rather than waiting for the entire response to generate before receiving anything.
Without timeouts, a hung connection can block a thread or coroutine indefinitely. The Anthropic SDK uses a default timeout of 600 seconds β appropriate for very long completions but catastrophic if applied to a user-facing chatbot. Set timeouts appropriate to your use case:
For concurrent processing β such as enriching records in a pipeline or handling multiple user sessions β the async SDK client allows many requests to run concurrently without multi-threading complexity. The async client is a drop-in counterpart: anthropic.AsyncAnthropic().
When building custom streaming parsers (for non-SDK languages or WebSocket relay), understanding the raw SSE event sequence is essential. The key events are message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop. Token text arrives in content_block_delta events with delta.type == "text_delta".
Streaming complicates retry logic. If a stream fails mid-response, you cannot resume from the interruption β you must restart from scratch. For long generations, this means either accepting partial loss or buffering the entire response (defeating streaming's purpose for latency). Most production systems accept the tradeoff: stream for UX, retry the full request on failure, surface an error to users only on exhausted retries.
You're building two things: a user-facing chatbot that must feel responsive, and a nightly batch job that processes 1,000 documents. Work with your coach to design the streaming and async patterns for both use cases.
In June 2023, Notion's AI writing feature experienced degraded availability when an upstream model API returned higher-than-normal error rates during a regional incident. Teams that had built graceful degradation β showing users a "AI temporarily unavailable, try again" message while logging the event β retained user trust. Those whose systems surfaced raw API errors or hung indefinitely did not. The operational lesson: every LLM integration should assume the model API will be unavailable, slow, or wrong at some point.
The minimum viable telemetry for a production Claude integration captures: request metadata, response metadata, latency, token usage, error classification, and a correlation ID that threads through your entire stack. This gives you cost accounting, latency dashboards, error-rate alerting, and the ability to reconstruct any problematic interaction.
Not all errors warrant the same response. The Anthropic SDK exposes a typed exception hierarchy that lets you handle each case appropriately:
A production system should never surface a raw API error to an end user. At minimum, define fallback behavior for each error class. Common patterns include: returning a cached previous response, routing to a simpler model, returning a canned "service temporarily unavailable" message, or queuing the request for retry and notifying the user asynchronously.
The Anthropic API may return a stop_reason of "max_tokens" (truncated) or trigger a refusal. Always check stop_reason. A truncated response from a classification task may return an incomplete JSON object that crashes your parser. A content-moderated response may return an empty content array. Production code must handle both cases explicitly rather than assuming a valid text response always arrives.
Before going to production: β All requests log correlation IDs. β Token usage tracked per request. β Error types classified and handled distinctly. β Fallback responses defined for each error class. β Alerts configured on error rate and p95 latency. β stop_reason checked on every non-streaming response. β Streaming failures trigger full retry, not partial display.
Your Claude-powered feature is going to production next week. Your SRE team is asking what dashboards, alerts, and runbooks you'll have in place. Work through the observability design with your coach β what to log, what alerts to set, and how the system behaves when the API goes down.