Module 6 · Lesson 1

Understanding Anthropic's Rate Limit Architecture

Tokens per minute, requests per minute, and the daily token ceiling — what they are and why they exist.

What exactly gets throttled when the API returns 429, and how does Anthropic decide your limits?

When Anthropic opened Claude API access broadly after the GPT-4 launch wave, demand spiked faster than anticipated. Several production teams building on the API reported sustained 429 storms — every request rejected for minutes at a time — because they had not accounted for the input-token dimension of rate limits. They optimised for request count but ignored that a single long-context call could consume the entire per-minute token budget. The incident became a reference case in Anthropic's developer documentation updates in early 2024.

Three Independent Rate-Limit Axes

Anthropic enforces limits on three independent axes simultaneously. Breaching any one triggers a 429:

Requests Per Minute (RPM) — the raw count of API calls in any rolling 60-second window.

Input Tokens Per Minute (ITPM) — the sum of prompt tokens across all requests in any 60-second window. A single large-context request can exhaust this budget even if RPM is fine.

Output Tokens Per Minute (OTPM) — the sum of completion tokens generated. This is often the binding constraint for streaming pipelines that request long completions.

A fourth constraint — Tokens Per Day (TPD) — applies cumulatively and resets at midnight UTC. TPD limits do not trigger 429; instead requests are rejected with a dedicated error message once the daily ceiling is hit.

Documented Tier Limits (as of mid-2024)

Anthropic's published Tier 1 limits for claude-3-5-sonnet-20241022: 50 RPM · 40,000 ITPM · 8,000 OTPM · 1,000,000 TPD. Tier 4 (high-spend) raises these to 4,000 RPM · 400,000 ITPM · 80,000 OTPM · 5,000,000 TPD. Limits are per-key, not per-account, so multiple keys on the same account each carry their own quota.

How Tiers Are Assigned

Anthropic uses a spend-gated tier system. Tier 1 is granted on account creation with a valid payment method. Tier 2 requires $40 in cumulative spend and seven days since first use. Tier 3 requires $200 and 14 days. Tier 4 requires $4,000 and 30 days. There is no application process — the system promotes automatically when both conditions are met.

Usage tier information is visible in the Anthropic Console → Settings → Limits panel and is also returned in response headers (see Lesson 2). Some high-volume enterprise customers negotiate custom limits directly; those are outside the tier system.

The Response-Header Clock

Every successful API response carries six rate-limit headers that show the state of your quota after the request was applied:

Header	Meaning
x-ratelimit-limit-requests	Your RPM ceiling
x-ratelimit-limit-tokens	Your combined token-per-minute ceiling
x-ratelimit-remaining-requests	Requests still available this minute
x-ratelimit-remaining-tokens	Tokens still available this minute
x-ratelimit-reset-requests	ISO 8601 timestamp when RPM counter resets
x-ratelimit-reset-tokens	ISO 8601 timestamp when token counter resets

These headers are your primary tool for building proactive rate control — slowing down before you hit the ceiling rather than reacting to 429s. Reading them on every response adds no latency and eliminates entire classes of retry logic.

Key Architecture Insight

The token counters count input tokens at request submission time using estimated values, then reconcile against actual output tokens at completion. This means a request can be accepted at submission but still exhaust a budget mid-stream on very long outputs. Design your per-request max_tokens defensively — lower values leave more headroom for concurrent calls.

RPMRequests Per Minute — raw call count enforced in a rolling 60-second window.

ITPMInput Tokens Per Minute — prompt token consumption summed across concurrent requests.

TPDTokens Per Day — cumulative daily ceiling, resets midnight UTC, produces a distinct error when exceeded.

Lesson 1 Quiz

Rate limit architecture — 4 questions

Which single factor can exhaust your per-minute budget even when your request count is very low?

Correct. ITPM is independent of RPM. One enormous prompt can consume the entire token budget for the minute even if you've only made a single request.

Not quite. The key insight from the 2023 incident is that token budgets (ITPM) and request budgets (RPM) are independent axes — a low request count is no protection against ITPM exhaustion from large contexts.

When does Anthropic automatically promote an account from Tier 2 to Tier 3?

Correct. Tier 3 requires $200 cumulative spend and 14 days account age. Both conditions must be met — promotion is automatic, no application needed.

Review the tier table. Tier 3 specifically requires $200 in spend and 14 days. $40 / 7 days are the Tier 2 thresholds.

What HTTP status code does Anthropic return when a daily token limit (TPD) is exceeded?

Correct. TPD exhaustion produces a 429, but with a specific error_type in the body indicating the daily ceiling was hit rather than a per-minute limit.

Both per-minute and per-day limits use 429. The distinction lies in the error body's type field, not the status code itself.

Which response header tells you the exact moment your per-minute token counter will reset?

Correct. x-ratelimit-reset-tokens is an ISO 8601 timestamp. Your code can parse it and schedule the next request to fire just after that moment.

x-ratelimit-remaining-tokens tells you how many tokens remain, not when the counter resets. The reset timestamp lives in x-ratelimit-reset-tokens.

Lab 1 — Rate Limit Architecture

Chat with your AI lab assistant about Anthropic's rate limit system

Your Task

Practice discussing Anthropic's rate limit architecture. Ask the assistant about the differences between RPM, ITPM, OTPM, and TPD — or describe a hypothetical scenario and ask how you'd calculate whether your workload fits within a given tier.

Starter: "My app makes 20 requests per minute, each with a 2,000-token prompt and requesting up to 500 tokens of output. Which Tier 1 limits am I closest to exhausting?"

AI Lab Assistant

Rate Limit Architecture

Hello! I'm your lab assistant for rate limit architecture. Ask me about RPM, ITPM, OTPM, TPD, tier thresholds, or walk me through your usage scenario and I'll help you work out which limits apply.

Module 6 · Lesson 2

The Anthropic Error Taxonomy

Every error type the API can return, what causes it, and which require retrying versus fixing code.

Which errors should trigger automatic retry and which are permanent failures your code must handle differently?

In February 2024, multiple developers reported on the Anthropic Discord and GitHub that their SDKs were retrying 400 Bad Request errors indefinitely — burning their daily token budget on requests that could never succeed. The root cause was generic retry logic that treated all non-200 responses as transient. Anthropic responded by publishing explicit guidance distinguishing retryable from non-retryable errors, and the Python SDK was updated to stop retrying 400-class errors by default.

The Complete Error Taxonomy

Anthropic errors follow standard HTTP semantics but add a structured JSON body with type, error.type, and error.message fields that carry fine-grained information:

HTTP	error.type	Retryable?	Common Cause
400	invalid_request_error	No	Malformed JSON, missing required field, invalid model ID
401	authentication_error	No	Invalid or missing API key
403	permission_error	No	Key lacks access to requested model or feature
404	not_found_error	No	Endpoint path typo, deprecated model string
413	request_too_large	No	Prompt exceeds model's context window
429	rate_limit_error	Yes — with backoff	RPM, ITPM, OTPM, or TPD ceiling hit
500	api_error	Yes — limited	Anthropic-side error; retry up to 2–3 times
529	overloaded_error	Yes — with backoff	Anthropic infrastructure under heavy load

Reading the Error Body

The structured error body gives you exact diagnosis without guessing from the status code alone:

// Example 429 response body

{

  "type": "error",

  "error": {

    "type": "rate_limit_error",

    "message": "Rate limit exceeded: you have exceeded your daily token limit"

  }

}

The message field differentiates a per-minute rate limit from a per-day exhaustion — both arrive as 429 but your handling differs: per-minute errors should back off and retry; per-day exhaustion should halt all requests until midnight UTC and alert the on-call engineer.

The Retry Decision Tree

Never retry: 400, 401, 403, 404, 413. These are deterministic failures. Retrying wastes quota and signals a bug in your code or configuration, not a transient infrastructure issue.

Retry with exponential backoff: 429 (per-minute) and 529. Parse retry-after header if present; otherwise use the reset timestamp from rate-limit headers to compute exact wait time.

Retry conservatively (2–3 attempts max): 500. Anthropic's servers occasionally return transient 500s. Three retries with a short fixed delay is the standard practice per Anthropic's own SDK implementation.

Special case — 429 TPD: Do not retry. Stop all requests for the remainder of the day and notify stakeholders. Add TPD headroom monitoring to prevent recurrence.

SDK Default Behaviour

As of anthropic-sdk-python v0.25+, the SDK automatically retries 429 and 529 up to two times using exponential backoff. It does NOT retry 400-class errors. You can override this with max_retries=0 for full manual control, or increase it — but the SDK caps at sensible defaults and respects the retry-after header.

overloaded_errorHTTP 529 — Anthropic-specific status indicating infrastructure saturation, not a client error. Backoff and retry.

invalid_request_errorHTTP 400 — Your request is structurally wrong. Retry will always fail. Fix the code.

retry-afterOptional response header on 429 responses indicating minimum seconds to wait before retrying.

Lesson 2 Quiz

Error taxonomy — 4 questions

A request returns HTTP 400 with error.type "invalid_request_error". What is the correct action?

Correct. 400 errors are deterministic — the server fully understood your request and rejected it. Retrying the same request will always produce the same 400. You must fix the request.

400 is non-retryable. The server understood your request perfectly and rejected it because it's invalid. No amount of waiting or backoff will change the outcome — only fixing the request will.

What distinguishes an HTTP 529 "overloaded_error" from an HTTP 500 "api_error" in terms of handling?

Correct. 529 is Anthropic's custom status for capacity saturation — common during demand spikes and very retryable with backoff. 500s are genuine server faults — retry conservatively (2–3 times max).

Both 529 and 500 are retryable, but they have different expected frequencies and retry strategies. 529 is expected during high-traffic periods; 500 is rarer and should be retried more conservatively.

Your app hits a 429 with the message "you have exceeded your daily token limit." What is the correct response?

Correct. A TPD exhaustion 429 will not recover until midnight UTC. Retrying burns nothing because tokens from already-consumed requests won't come back. Halt, alert, and review daily usage patterns.

TPD exhaustion is not a transient per-minute issue — it won't recover until the daily counter resets at midnight UTC. Backoff and retry logic is the wrong tool here.

Which SDK parameter controls how many automatic retries the official Anthropic Python SDK performs on 429 and 529 errors?

Correct. The Anthropic client accepts max_retries at instantiation — e.g., Anthropic(max_retries=0) disables automatic retries entirely for full manual control.

The correct parameter is max_retries. You pass it when constructing the Anthropic client instance: client = Anthropic(api_key=..., max_retries=3).

Lab 2 — Error Taxonomy

Diagnose error scenarios with your AI assistant

Your Task

Describe error scenarios to the assistant and work through the correct diagnosis and handling strategy. Practice distinguishing retryable from non-retryable errors and understanding what each error type signals about your code or the API's state.

Starter: "My app is getting a 403 permission_error when calling claude-3-opus-20240229 but my API key works fine for claude-3-haiku. What's likely happening and what should I do?"

AI Lab Assistant

Error Taxonomy

I'm here to help you diagnose Anthropic API errors. Describe an error you're seeing — include the HTTP status code and error.type if you have it — and I'll walk you through what it means and how to handle it correctly.

Module 6 · Lesson 3

Exponential Backoff and Jitter

Building retry logic that recovers fast without making congestion worse.

Why does naive retry logic often make API overload situations worse, and what does proper jitter actually change?

The "thundering herd" problem — thousands of clients retrying a failed service simultaneously at fixed intervals — has been documented in production incidents at AWS, Google, and Netflix. When a brief API outage recovers, all waiting clients fire at exactly the same moment, causing an immediate second overload. Jitter — randomising the retry delay — was formalised as a solution in AWS's 2015 blog post "Exponential Backoff and Jitter" by Marc Brooker, which remains the canonical reference. Anthropic's own SDK retry implementation uses full-jitter as described in that post.

The Three Retry Strategies

Fixed delay: Wait the same duration every time (e.g., 1 second). Simple but creates thundering herds when many clients fail simultaneously. Acceptable for rare, isolated errors. Never use for 429s under load.

Exponential backoff (no jitter): Wait base × 2^attempt. Clients spread out over time but any group of clients that failed at the same moment will still retry at exactly the same times, causing secondary load spikes at each doubling interval.

Exponential backoff with full jitter: Wait random(0, base × 2^attempt). Clients that failed together retry at uniformly random points within the growing window. This is the AWS-recommended "full jitter" approach and what the Anthropic Python SDK uses internally.

Reference Implementation (Python)

import time, random

import anthropic

def call_with_backoff(client, **kwargs):

  max_retries = 5

  base_delay = 1.0  # seconds

  max_delay = 60.0

  for attempt in range(max_retries):

    try:

      return client.messages.create(**kwargs)

    except anthropic.RateLimitError as e:

      if "daily" in str(e).lower():

        raise  # TPD — never retry

      cap = min(max_delay, base_delay * (2 ** attempt))

      wait = random.uniform(0, cap)  # full jitter

      time.sleep(wait)

    except anthropic.APIStatusError as e:

      if e.status_code < 500:

        raise  # 4xx — never retry

      if attempt == max_retries - 1:

        raise

      time.sleep(random.uniform(0, 2))

  raise RuntimeError("Exhausted retries")

Honouring the retry-after Header

When Anthropic includes a retry-after header (numeric seconds) or retry-after-ms (milliseconds), your minimum wait must be at least that long. Your jitter should be applied on top of the server-specified minimum, not instead of it:

def parse_retry_after(response_headers):

  ra = response_headers.get("retry-after")

  if ra:

    return float(ra)

  ra_ms = response_headers.get("retry-after-ms")

  if ra_ms:

    return float(ra_ms) / 1000

  return None

# In retry loop:

server_min = parse_retry_after(e.response.headers) or 0

wait = server_min + random.uniform(0, cap)

Max Retries and Circuit Breakers

Even well-designed retry logic needs a ceiling. A retry loop without a maximum retry count can turn a brief outage into a request backlog that grows without bound. Standard practice is 3–5 retries for 429s and 2–3 for 500s. Beyond that, surface the error to the caller — they need the opportunity to degrade gracefully, queue the work, or alert an operator.

In high-volume systems, consider a circuit breaker: after N consecutive failures, stop sending requests entirely for a fixed window (e.g., 30 seconds). This prevents your application from hammering a struggling API and allows headroom for recovery. Libraries like tenacity (Python) and p-retry (Node.js) provide circuit breaker primitives that integrate cleanly with async workflows.

Use the Reset Header Instead of Guessing

For 429 per-minute errors, the x-ratelimit-reset-tokens and x-ratelimit-reset-requests headers give you the exact reset timestamp. Computing reset_time - now and sleeping that duration is more efficient than exponential backoff because you know exactly when capacity resumes. Add a small jitter (0–500ms) to avoid a thundering herd of clients resuming simultaneously.

Full jitterrandom(0, cap) — the most effective jitter strategy; uniformly distributes retries across the entire backoff window.

Circuit breakerPattern that stops sending requests entirely after N consecutive failures, preventing cascading load during API degradation.

retry-afterServer-specified minimum wait in seconds — your wait must be at least this; jitter should be added on top.

Lesson 3 Quiz

Exponential backoff and jitter — 4 questions

Why does exponential backoff without jitter still cause thundering herd problems?

Correct. Synchronised clients remain synchronised even with exponential backoff — they all wait 1s, then 2s, then 4s together. Full jitter breaks this synchronisation.

The issue isn't speed — it's synchronisation. Clients that entered the retry loop at the same time will all wake up at the same doubled intervals, creating repeated mini-waves of load.

Your backoff cap is 60 seconds, base delay is 1 second, and this is attempt number 3 (0-indexed). What is the full-jitter wait range?

Correct. 1 × 2³ = 8, which is under the 60-second cap. Full jitter picks uniformly from 0 to 8 seconds. The cap would only apply from attempt 6 onward (1 × 2⁶ = 64 > 60).

Attempt 3 (0-indexed): base × 2^3 = 1 × 8 = 8. That's below the 60-second cap so the range is random(0, 8). The cap only bites when the exponential exceeds 60.

The API returns a 429 with a "retry-after: 12" header. Your calculated backoff cap for this attempt is 8 seconds. What should your wait be?

Correct. The server-specified retry-after is a hard minimum. Your computed backoff is secondary — the final wait must be max(server_min, computed_backoff) plus jitter. Never wait less than the server says.

Your computed backoff is irrelevant when it's shorter than the server-specified minimum. retry-after: 12 means you cannot retry before 12 seconds. Your cap only applies if it's larger than the server's value.

What is the primary benefit of using a circuit breaker on top of retry logic?

Correct. Without a circuit breaker, retry queues grow unbounded during sustained outages. The circuit breaker provides a hard stop — and more importantly, signals to the rest of your system that the dependency is degraded.

Circuit breakers don't replace backoff — they complement it. Their key function is stopping requests entirely during prolonged failures rather than letting queues grow forever.

Lab 3 — Backoff and Retry Design

Design and review retry strategies with your AI assistant

Your Task

Work through retry strategy design problems with the assistant. Describe your system's requirements — concurrency level, acceptable latency, retry budget — and get feedback on your approach. You can also paste pseudocode or describe an existing retry implementation for review.

Starter: "I have 50 concurrent workers all calling the Anthropic API. When I get 429s they all back off and retry at the same time. What specific changes should I make to my retry logic to fix this?"

AI Lab Assistant

Backoff & Retry Design

I'm ready to help you design or review retry strategies for the Anthropic API. Tell me about your workload — how many concurrent callers, what latency budget, and what your current retry logic looks like — and I'll give specific recommendations.

Module 6 · Lesson 4

Proactive Rate Management and Production Patterns

Token budgeting, request queues, header-driven pacing, and observability for high-volume deployments.

How do production teams stay within limits without sacrificing throughput — and how do they catch limit issues before users do?

Several teams building document-processing pipelines on the Anthropic API reported that their throughput was far below theoretical limits because they were processing requests sequentially. When they moved to a token-budget-aware concurrency pool — tracking remaining tokens from response headers and dynamically adjusting the number of concurrent workers — throughput increased 8–12× while 429 rates fell to near zero. The pattern was discussed in Anthropic's developer forum in mid-2024 and subsequently referenced in third-party integration guides.

Proactive vs. Reactive Rate Control

Most developers start with reactive rate control: send requests freely, handle 429s with backoff. This works but is inefficient — every 429 represents wasted latency and a request that made it to the server before being rejected.

Proactive rate control reads the rate-limit headers on each successful response and uses that information to govern when the next request fires. You never hit the ceiling because you're continuously tracking remaining headroom.

Token-Budget-Aware Concurrency

The core pattern for high-throughput pipelines:

class TokenBudgetPool:

  def __init__(self, max_workers=10):

    self.remaining_tokens = 40000  # start conservative

    self.reset_at = None

    self.lock = threading.Lock()

  def update_from_headers(self, headers):

    with self.lock:

      remaining = headers.get("x-ratelimit-remaining-tokens")

      if remaining:

        self.remaining_tokens = int(remaining)

      reset = headers.get("x-ratelimit-reset-tokens")

      if reset:

        self.reset_at = datetime.fromisoformat(reset)

  def can_submit(self, estimated_tokens):

    with self.lock:

      return self.remaining_tokens >= estimated_tokens * 1.1  # 10% safety margin

Request Queueing Pattern

For batch workloads, a request queue decouples task submission from API execution. Workers pull from the queue only when the token-budget pool signals headroom. This pattern smooths bursty workloads into steady API consumption:

1. Accept all tasks into an internal queue immediately (no blocking on API limits).
2. A dispatcher loop checks token headroom before dispatching each item.
3. If headroom is insufficient, the dispatcher sleeps until the reset timestamp.
4. Completed tasks post their header data back to update the budget pool.

This design never produces a 429 under normal operation because the dispatcher enforces limits before the API does.

Observability: What to Monitor in Production

Rate limit incidents are difficult to debug after the fact without metrics. Emit these as time-series data:

Metric	Alert Threshold	Why It Matters
429 rate (per minute)	>5% of requests	Indicates reactive control is failing; switch to proactive
x-ratelimit-remaining-tokens (min per minute)	<10% of limit	Early warning before hitting ceiling
Retry count per request (p95)	>2 retries	Signals sustained pressure, not transient spikes
Daily token consumption (% of TPD)	>80% by 18:00 UTC	Predicts end-of-day exhaustion before it happens
Time to first token (p99)	Spike >2× baseline	Backoff delays propagating to users

max_tokens as a Rate Control Tool

Setting max_tokens conservatively is one of the most underused rate control levers. For a pipeline that generates short classifications, setting max_tokens=150 instead of max_tokens=1000 reduces your OTPM consumption by 85% per request — directly translating to higher throughput under the same limits. Never leave max_tokens at the model maximum unless you genuinely need it.

For streaming responses, the actual output token count differs from max_tokens. Your monitoring should track actual output tokens from the usage.output_tokens field in the response, not the cap you set.

Batch API for Non-Realtime Workloads

Anthropic launched the Message Batches API in late 2024. Batch requests are processed asynchronously with a 24-hour SLA and carry a 50% price discount. Critically, batch requests consume from a separate quota pool — they do not count against your standard RPM/ITPM limits. For document processing, embedding, or any non-realtime workload, the Batches API is the correct tool and eliminates rate limit concerns almost entirely.

Token-budget-aware concurrencyA pattern where workers track remaining token headroom from response headers and gate new requests before limits are hit.

Message Batches APIAnthropic's async batch processing API — separate quota pool, 50% discount, 24-hour SLA. Ideal for non-realtime workloads.

usage.output_tokensThe actual token count in a response — always monitor this, not max_tokens, for accurate quota accounting.

Lesson 4 Quiz

Proactive rate management — 4 questions

What is the key difference between proactive and reactive rate control?

Correct. Proactive control prevents 429s from ever occurring by tracking headroom from response headers. Reactive control is simpler but wastes latency on every rejected request.

The distinction is about timing: proactive control acts before hitting limits using header data; reactive control responds after receiving a 429. Both use backoff, but proactive control rarely needs it.

Your pipeline processes 10,000 documents overnight. Which Anthropic API capability is most appropriate and why?

Correct. The Message Batches API was designed exactly for this — non-realtime bulk processing. Its separate quota pool means your overnight job won't compete with your realtime API traffic for the same limits.

For overnight bulk processing with no realtime requirement, the Message Batches API is the right tool: it uses a separate quota pool (no competition with live traffic), costs 50% less, and handles 10,000 documents easily within a 24-hour window.

Your daily TPD monitoring shows 80% consumption by 18:00 UTC. What should this trigger?

Correct. 80% consumed with 6 hours remaining means exhaustion is likely. Alerting at this threshold gives engineers time to act — rate-limit non-critical workloads, defer batch jobs, or temporarily reduce max_tokens — before users are impacted.

At 80% with 6 hours to midnight, the trajectory will hit 100% before reset. The alert threshold exists specifically to give engineers lead time to act before the limit is hit.

Why should you monitor usage.output_tokens from API responses rather than relying on your max_tokens setting for quota accounting?

Correct. Claude regularly produces outputs much shorter than max_tokens. If you forecast quota usage based on max_tokens, you'll systematically over-estimate consumption. usage.output_tokens gives you actual consumption for accurate forecasting.

max_tokens is a ceiling, not a prediction. Actual output — returned in usage.output_tokens — is typically well below the cap. Accurate quota accounting requires actual numbers, not ceiling values.

Lab 4 — Production Rate Management

Design production rate management strategies with your AI assistant

Your Task

Design or review production rate management architectures. Describe your system's API usage patterns — request volume, latency requirements, workload type — and get specific recommendations on concurrency models, queueing strategies, observability setup, and when to use the Batches API.

Starter: "I'm building a legal document review system that processes 500 documents per day with 30-minute SLA requirements and also handles real-time chat queries from lawyers. How should I architect the rate management to keep both workloads healthy?"

AI Lab Assistant

Production Rate Management

I'm here to help you design production rate management for the Anthropic API. Tell me about your system — request volumes, latency requirements, workload mix, and any existing infrastructure — and I'll recommend specific patterns for token budgeting, concurrency, queueing, and observability.

Module 6 Test

Rate Limits and Error Handling — 15 questions · Pass at 80%

1. Which Anthropic rate limit axis is most commonly the binding constraint for large-context document pipelines?

Correct. Long prompts consume large amounts of ITPM even with few requests, making ITPM the typical bottleneck for document pipelines.

For document pipelines with large contexts, ITPM is typically the binding constraint — each request carries a large prompt consuming disproportionate token budget relative to request count.

2. Anthropic Tier 3 requires which conditions to be met?

Correct. Tier 3: $200 + 14 days. Promotion is automatic when both thresholds are met.

Review tier thresholds: Tier 2 is $40/7 days, Tier 3 is $200/14 days, Tier 4 is $4,000/30 days. No application required for any tier.

3. Which header gives you the ISO 8601 timestamp for when your request count resets?

Correct. x-ratelimit-reset-requests provides the exact reset time for the RPM counter.

The reset timestamps are in x-ratelimit-reset-requests (for RPM) and x-ratelimit-reset-tokens (for token limits).

4. An HTTP 413 error from the Anthropic API indicates what, and should it be retried?

Correct. 413 request_too_large is non-retryable. The request itself is too large — you must shorten it before retrying.

413 means the request payload exceeds the model's context window. Retrying the same request will always fail. You must reduce prompt length.

5. Which HTTP status code does Anthropic use for infrastructure saturation (distinct from rate limiting)?

Correct. HTTP 529 (overloaded_error) is Anthropic's custom status for capacity saturation. It's retryable with backoff.

Anthropic uses the non-standard HTTP 529 for infrastructure saturation (overloaded_error). 429 is for rate limits, 500 for server errors.

6. The Anthropic Python SDK (v0.25+) automatically retries which error classes by default?

Correct. The SDK retries 429 and 529 by default (up to 2 times). 400-class errors are never retried automatically.

The SDK correctly identifies 400-class errors as non-retryable. It auto-retries only 429 (rate limit) and 529 (overloaded) — the two transient server-side conditions.

7. "Full jitter" exponential backoff computes the wait as:

Correct. Full jitter selects uniformly from zero to the capped exponential ceiling — this is the AWS-recommended approach that best prevents thundering herds.

Full jitter: random(0, cap) where cap = min(max_delay, base × 2^attempt). The key is the lower bound is zero — distributing clients uniformly across the entire window.

8. A 429 response includes "retry-after: 15". Your computed full-jitter backoff for this attempt is 6 seconds. What should your actual wait be?

Correct. retry-after is a hard minimum floor. Your 6-second backoff is irrelevant when it's below the server's required wait. Wait at least 15 seconds; you may add jitter on top.

The server's retry-after value is a minimum — you must wait at least that long. Your computed backoff only matters if it exceeds the server minimum. Here 15 > 6, so 15 seconds is the floor.

9. What is the primary purpose of a circuit breaker in an API integration?

Correct. Circuit breakers provide a hard stop during sustained outages. Without them, request queues grow unbounded and the API never gets breathing room to recover.

Circuit breakers open (stop requests) after N failures. This is distinct from backoff — it's a complete halt, not a delay. It prevents cascading queue buildup during prolonged API degradation.

10. Which of these workloads is the best candidate for Anthropic's Message Batches API?

Correct. Nightly batch processing with no realtime requirement is ideal for the Batches API: separate quota pool, 50% discount, 24-hour completion window fits perfectly.

The Batches API is for non-realtime workloads with flexible SLAs (up to 24 hours). Real-time, streaming, and interactive use cases require the standard synchronous API.

11. What does "token-budget-aware concurrency" mean in practice?

Correct. Workers update a shared remaining-token counter from response headers and only dispatch the next request when sufficient headroom exists — preventing 429s without sacrificing throughput.

Token-budget-aware concurrency uses the remaining-tokens header to govern dispatch decisions — not billing, not a fixed concurrency cap, but dynamic adjustment based on actual remaining API capacity.

12. At what daily token consumption percentage (by 18:00 UTC) should you trigger a proactive alert?

Correct. 80% by 18:00 UTC gives engineers 6 hours to act before midnight reset — enough time to shift workloads, reduce rate, or arrange temporary capacity, without impacting users.

Alerting at 100% is too late — you're already out of budget. 80% by 18:00 UTC (6 hours before reset) gives meaningful lead time for intervention.

13. Why should max_tokens be set conservatively rather than at the model maximum?

Correct. max_tokens caps output length, which bounds OTPM consumption per request. Tighter caps mean more requests can fit within the same per-minute token budget.

The rate management benefit: lower max_tokens limits OTPM per request, leaving more token budget for concurrent or subsequent requests. This directly increases throughput without changing your limit tier.

14. Which field in the API response gives the actual output token count (not the cap)?

Correct. usage.output_tokens is the precise completion token count. It's always ≤ max_tokens and is the value to use for accurate quota accounting.

usage.output_tokens is the direct field. While usage.total_tokens - usage.input_tokens would give the same number mathematically, usage.output_tokens is the explicit and correct field to read.

15. The Message Batches API does NOT count against which resource?

Correct. Batch requests consume from a separate quota pool — they don't compete with your synchronous API's RPM/ITPM limits. This is a key advantage for mixed workloads.

The Batches API uses a separate quota pool, meaning batch jobs don't reduce the RPM/ITPM headroom available for your real-time synchronous calls. Billing still applies (at 50% discount).