When Anthropic opened Claude API access broadly after the GPT-4 launch wave, demand spiked faster than anticipated. Several production teams building on the API reported sustained 429 storms — every request rejected for minutes at a time — because they had not accounted for the input-token dimension of rate limits. They optimised for request count but ignored that a single long-context call could consume the entire per-minute token budget. The incident became a reference case in Anthropic's developer documentation updates in early 2024.
Anthropic enforces limits on three independent axes simultaneously. Breaching any one triggers a 429:
Requests Per Minute (RPM) — the raw count of API calls in any rolling 60-second window.
Input Tokens Per Minute (ITPM) — the sum of prompt tokens across all requests in any 60-second window. A single large-context request can exhaust this budget even if RPM is fine.
Output Tokens Per Minute (OTPM) — the sum of completion tokens generated. This is often the binding constraint for streaming pipelines that request long completions.
A fourth constraint — Tokens Per Day (TPD) — applies cumulatively and resets at midnight UTC. TPD limits do not trigger 429; instead requests are rejected with a dedicated error message once the daily ceiling is hit.
Anthropic's published Tier 1 limits for claude-3-5-sonnet-20241022: 50 RPM · 40,000 ITPM · 8,000 OTPM · 1,000,000 TPD. Tier 4 (high-spend) raises these to 4,000 RPM · 400,000 ITPM · 80,000 OTPM · 5,000,000 TPD. Limits are per-key, not per-account, so multiple keys on the same account each carry their own quota.
Anthropic uses a spend-gated tier system. Tier 1 is granted on account creation with a valid payment method. Tier 2 requires $40 in cumulative spend and seven days since first use. Tier 3 requires $200 and 14 days. Tier 4 requires $4,000 and 30 days. There is no application process — the system promotes automatically when both conditions are met.
Usage tier information is visible in the Anthropic Console → Settings → Limits panel and is also returned in response headers (see Lesson 2). Some high-volume enterprise customers negotiate custom limits directly; those are outside the tier system.
Every successful API response carries six rate-limit headers that show the state of your quota after the request was applied:
| Header | Meaning |
|---|---|
| x-ratelimit-limit-requests | Your RPM ceiling |
| x-ratelimit-limit-tokens | Your combined token-per-minute ceiling |
| x-ratelimit-remaining-requests | Requests still available this minute |
| x-ratelimit-remaining-tokens | Tokens still available this minute |
| x-ratelimit-reset-requests | ISO 8601 timestamp when RPM counter resets |
| x-ratelimit-reset-tokens | ISO 8601 timestamp when token counter resets |
These headers are your primary tool for building proactive rate control — slowing down before you hit the ceiling rather than reacting to 429s. Reading them on every response adds no latency and eliminates entire classes of retry logic.
The token counters count input tokens at request submission time using estimated values, then reconcile against actual output tokens at completion. This means a request can be accepted at submission but still exhaust a budget mid-stream on very long outputs. Design your per-request max_tokens defensively — lower values leave more headroom for concurrent calls.
Practice discussing Anthropic's rate limit architecture. Ask the assistant about the differences between RPM, ITPM, OTPM, and TPD — or describe a hypothetical scenario and ask how you'd calculate whether your workload fits within a given tier.
In February 2024, multiple developers reported on the Anthropic Discord and GitHub that their SDKs were retrying 400 Bad Request errors indefinitely — burning their daily token budget on requests that could never succeed. The root cause was generic retry logic that treated all non-200 responses as transient. Anthropic responded by publishing explicit guidance distinguishing retryable from non-retryable errors, and the Python SDK was updated to stop retrying 400-class errors by default.
Anthropic errors follow standard HTTP semantics but add a structured JSON body with type, error.type, and error.message fields that carry fine-grained information:
| HTTP | error.type | Retryable? | Common Cause |
|---|---|---|---|
| 400 | invalid_request_error | No | Malformed JSON, missing required field, invalid model ID |
| 401 | authentication_error | No | Invalid or missing API key |
| 403 | permission_error | No | Key lacks access to requested model or feature |
| 404 | not_found_error | No | Endpoint path typo, deprecated model string |
| 413 | request_too_large | No | Prompt exceeds model's context window |
| 429 | rate_limit_error | Yes — with backoff | RPM, ITPM, OTPM, or TPD ceiling hit |
| 500 | api_error | Yes — limited | Anthropic-side error; retry up to 2–3 times |
| 529 | overloaded_error | Yes — with backoff | Anthropic infrastructure under heavy load |
The structured error body gives you exact diagnosis without guessing from the status code alone:
The message field differentiates a per-minute rate limit from a per-day exhaustion — both arrive as 429 but your handling differs: per-minute errors should back off and retry; per-day exhaustion should halt all requests until midnight UTC and alert the on-call engineer.
Never retry: 400, 401, 403, 404, 413. These are deterministic failures. Retrying wastes quota and signals a bug in your code or configuration, not a transient infrastructure issue.
Retry with exponential backoff: 429 (per-minute) and 529. Parse retry-after header if present; otherwise use the reset timestamp from rate-limit headers to compute exact wait time.
Retry conservatively (2–3 attempts max): 500. Anthropic's servers occasionally return transient 500s. Three retries with a short fixed delay is the standard practice per Anthropic's own SDK implementation.
Special case — 429 TPD: Do not retry. Stop all requests for the remainder of the day and notify stakeholders. Add TPD headroom monitoring to prevent recurrence.
As of anthropic-sdk-python v0.25+, the SDK automatically retries 429 and 529 up to two times using exponential backoff. It does NOT retry 400-class errors. You can override this with max_retries=0 for full manual control, or increase it — but the SDK caps at sensible defaults and respects the retry-after header.
max_retries at instantiation — e.g., Anthropic(max_retries=0) disables automatic retries entirely for full manual control.max_retries. You pass it when constructing the Anthropic client instance: client = Anthropic(api_key=..., max_retries=3).Describe error scenarios to the assistant and work through the correct diagnosis and handling strategy. Practice distinguishing retryable from non-retryable errors and understanding what each error type signals about your code or the API's state.
The "thundering herd" problem — thousands of clients retrying a failed service simultaneously at fixed intervals — has been documented in production incidents at AWS, Google, and Netflix. When a brief API outage recovers, all waiting clients fire at exactly the same moment, causing an immediate second overload. Jitter — randomising the retry delay — was formalised as a solution in AWS's 2015 blog post "Exponential Backoff and Jitter" by Marc Brooker, which remains the canonical reference. Anthropic's own SDK retry implementation uses full-jitter as described in that post.
Fixed delay: Wait the same duration every time (e.g., 1 second). Simple but creates thundering herds when many clients fail simultaneously. Acceptable for rare, isolated errors. Never use for 429s under load.
Exponential backoff (no jitter): Wait base × 2^attempt. Clients spread out over time but any group of clients that failed at the same moment will still retry at exactly the same times, causing secondary load spikes at each doubling interval.
Exponential backoff with full jitter: Wait random(0, base × 2^attempt). Clients that failed together retry at uniformly random points within the growing window. This is the AWS-recommended "full jitter" approach and what the Anthropic Python SDK uses internally.
When Anthropic includes a retry-after header (numeric seconds) or retry-after-ms (milliseconds), your minimum wait must be at least that long. Your jitter should be applied on top of the server-specified minimum, not instead of it:
Even well-designed retry logic needs a ceiling. A retry loop without a maximum retry count can turn a brief outage into a request backlog that grows without bound. Standard practice is 3–5 retries for 429s and 2–3 for 500s. Beyond that, surface the error to the caller — they need the opportunity to degrade gracefully, queue the work, or alert an operator.
In high-volume systems, consider a circuit breaker: after N consecutive failures, stop sending requests entirely for a fixed window (e.g., 30 seconds). This prevents your application from hammering a struggling API and allows headroom for recovery. Libraries like tenacity (Python) and p-retry (Node.js) provide circuit breaker primitives that integrate cleanly with async workflows.
For 429 per-minute errors, the x-ratelimit-reset-tokens and x-ratelimit-reset-requests headers give you the exact reset timestamp. Computing reset_time - now and sleeping that duration is more efficient than exponential backoff because you know exactly when capacity resumes. Add a small jitter (0–500ms) to avoid a thundering herd of clients resuming simultaneously.
Work through retry strategy design problems with the assistant. Describe your system's requirements — concurrency level, acceptable latency, retry budget — and get feedback on your approach. You can also paste pseudocode or describe an existing retry implementation for review.
Several teams building document-processing pipelines on the Anthropic API reported that their throughput was far below theoretical limits because they were processing requests sequentially. When they moved to a token-budget-aware concurrency pool — tracking remaining tokens from response headers and dynamically adjusting the number of concurrent workers — throughput increased 8–12× while 429 rates fell to near zero. The pattern was discussed in Anthropic's developer forum in mid-2024 and subsequently referenced in third-party integration guides.
Most developers start with reactive rate control: send requests freely, handle 429s with backoff. This works but is inefficient — every 429 represents wasted latency and a request that made it to the server before being rejected.
Proactive rate control reads the rate-limit headers on each successful response and uses that information to govern when the next request fires. You never hit the ceiling because you're continuously tracking remaining headroom.
The core pattern for high-throughput pipelines:
For batch workloads, a request queue decouples task submission from API execution. Workers pull from the queue only when the token-budget pool signals headroom. This pattern smooths bursty workloads into steady API consumption:
1. Accept all tasks into an internal queue immediately (no blocking on API limits).
2. A dispatcher loop checks token headroom before dispatching each item.
3. If headroom is insufficient, the dispatcher sleeps until the reset timestamp.
4. Completed tasks post their header data back to update the budget pool.
This design never produces a 429 under normal operation because the dispatcher enforces limits before the API does.
Rate limit incidents are difficult to debug after the fact without metrics. Emit these as time-series data:
| Metric | Alert Threshold | Why It Matters |
|---|---|---|
| 429 rate (per minute) | >5% of requests | Indicates reactive control is failing; switch to proactive |
| x-ratelimit-remaining-tokens (min per minute) | <10% of limit | Early warning before hitting ceiling |
| Retry count per request (p95) | >2 retries | Signals sustained pressure, not transient spikes |
| Daily token consumption (% of TPD) | >80% by 18:00 UTC | Predicts end-of-day exhaustion before it happens |
| Time to first token (p99) | Spike >2× baseline | Backoff delays propagating to users |
Setting max_tokens conservatively is one of the most underused rate control levers. For a pipeline that generates short classifications, setting max_tokens=150 instead of max_tokens=1000 reduces your OTPM consumption by 85% per request — directly translating to higher throughput under the same limits. Never leave max_tokens at the model maximum unless you genuinely need it.
For streaming responses, the actual output token count differs from max_tokens. Your monitoring should track actual output tokens from the usage.output_tokens field in the response, not the cap you set.
Anthropic launched the Message Batches API in late 2024. Batch requests are processed asynchronously with a 24-hour SLA and carry a 50% price discount. Critically, batch requests consume from a separate quota pool — they do not count against your standard RPM/ITPM limits. For document processing, embedding, or any non-realtime workload, the Batches API is the correct tool and eliminates rate limit concerns almost entirely.
Design or review production rate management architectures. Describe your system's API usage patterns — request volume, latency requirements, workload type — and get specific recommendations on concurrency models, queueing strategies, observability setup, and when to use the Batches API.