Module 8 · Lesson 1

Rate Limiting & Retry Logic

Every production system eventually hits a 429. The question is whether yours recovers gracefully — or cascades into failure.

How do you design API clients that handle throttling without losing requests or hammering the service?

In March 2023, when Anthropic's Claude API entered broad beta access, teams integrating it into customer-facing products discovered the same lesson that AWS, Stripe, and OpenAI customers had learned before them: rate limits are not edge cases. They are the normal operating condition of any successful API deployment. The teams that shipped stable products were those who had built retry logic from day one, not as an afterthought.

Understanding the HTTP 429

When you exceed Anthropic's rate limits, the API returns HTTP 429 Too Many Requests. Two distinct limits can trigger this: requests per minute (RPM) and tokens per minute (TPM). Both count against your tier independently. A large-context request can exhaust your TPM budget long before you hit your RPM ceiling.

The response includes a retry-after header indicating how many seconds to wait. Ignoring this header and retrying immediately is the most common mistake — it deepens the rate-limit hole rather than climbing out of it.

# Anthropic rate limit response headers
HTTP/1.1 429 Too Many Requests
x-ratelimit-limit-requests: 60
x-ratelimit-limit-tokens: 100000
x-ratelimit-remaining-requests: 0
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-requests: 2024-01-15T14:32:05Z
retry-after: 37

Exponential Backoff with Jitter

The correct pattern is exponential backoff with jitter. Each successive retry waits roughly twice as long as the previous, with random jitter added to prevent thundering-herd synchronization when many clients retry at the same time. AWS documented this pattern in 2015 after large-scale DynamoDB throttling events; it remains the industry standard.

import anthropic, time, random

def call_with_retry(client, **kwargs):
    max_retries = 5
    base_delay = 1.0   # seconds
    max_delay  = 60.0

    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)

        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Respect retry-after if present
            retry_after = float(
                e.response.headers.get('retry-after', base_delay * (2 ** attempt))
            )
            # Add ±25% jitter
            delay = min(
                retry_after * random.uniform(0.75, 1.25),
                max_delay
            )
            print(f"Rate limited. Retry {attempt+1} in {delay:.1f}s")
            time.sleep(delay)

Proactive Rate Limit Management

Reactive retries are necessary but insufficient. The Anthropic API returns rate limit headers on every response, not just on 429s. Reading x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens lets you slow down before you hit the wall, rather than after. This is particularly valuable for batch processing pipelines where you control the request cadence.

class RateLimitAwareClient:
    def __init__(self, client):
        self.client = client
        self.remaining_requests = float('inf')
        self.remaining_tokens   = float('inf')

    def create(self, **kwargs):
        # Slow down when near limit
        if self.remaining_requests < 5:
            time.sleep(2.0)

        response = self.client.messages.create(**kwargs)
        hdrs = response._raw_response.headers
        self.remaining_requests = int(hdrs.get(
            'x-ratelimit-remaining-requests', 999))
        self.remaining_tokens = int(hdrs.get(
            'x-ratelimit-remaining-tokens', 999999))
        return response

Key Concepts

RPMRequests per minute — each API call consumes one unit regardless of token count.

TPMTokens per minute — counts both input and output tokens against a rolling window.

Exponential backoffWait time doubles each retry: 1s, 2s, 4s, 8s… preventing server overload.

JitterRandom delay variation that prevents synchronized retry storms from multiple clients.

Thundering herdWhen many clients simultaneously retry after a shared outage, collectively overwhelming the service.

Production Principle

The Anthropic Python SDK (v0.18+) includes built-in retry logic via the max_retries parameter on client initialization. For simple use cases, anthropic.Anthropic(max_retries=3) is sufficient. Build custom logic only when you need header-aware throttling or per-request retry policies.

Lesson 1 Quiz

Rate Limiting & Retry Logic — 3 questions

Which HTTP header should your retry logic check first when receiving a 429 response from the Anthropic API?

Correct. The retry-after header tells you exactly how many seconds to wait before the server will accept requests again. Using it prevents unnecessary additional 429s.

Not quite. While rate limit headers are useful for proactive throttling, when you've already received a 429 the retry-after header gives you the precise wait time you need.

Why is "jitter" added to exponential backoff delays?

Correct. Without jitter, multiple clients that hit the same rate limit will all wait the same exponential interval and retry at the same moment — creating a thundering herd that immediately re-triggers the limit.

Jitter's purpose is client desynchronization. When many services retry simultaneously after a shared rate-limit event, they collectively overwhelm the server again. Random variance spreads these retries across time.

What is the key advantage of reading rate-limit response headers on successful (non-429) Anthropic API responses?

Correct. The API returns x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens on every response. Monitoring these lets batch pipelines voluntarily slow down before hitting zero, avoiding 429s entirely.

These headers enable proactive throttling — slowing your own request rate before the limit is reached. This is far more efficient for batch workloads than reactive retry loops.

Lab 1 — Designing Retry Strategies

Practice session · discuss rate limiting and retry logic with your AI coach

Your Mission

You're building a batch document-processing pipeline that will send hundreds of requests to the Anthropic API. Work through the rate-limiting design decisions with your AI coach — ask about header inspection, backoff tuning, and how to handle partial batch failures.

Suggested opener: "I'm processing 500 PDFs and need to call the API for each one. How should I structure my retry logic to handle rate limiting without slowing everything down too much?"

Production Patterns Coach

Lab 1

Ready to work through your rate-limiting strategy. Tell me about your batch processing use case — volume, latency requirements, and whether failures are recoverable or need to block.

Module 8 · Lesson 2

Cost Control & Token Budgeting

API costs that look manageable in development can grow nonlinearly in production. Token budgeting prevents surprises that end careers.

How do you build systems that remain economically viable as usage scales from hundreds to millions of requests?

In 2023, several early LLM startups discovered that their unit economics were fatally broken only after reaching scale. Systems built without per-request token budgets, caching, or prompt optimization were spending $0.40–$2.00 per user query on API costs alone — amounts that no subscription tier could support. The companies that survived built token accounting into their architecture from the start, treating each API call as a metered resource, not a free function call.

Understanding Anthropic's Pricing Model

Anthropic charges separately for input tokens (your prompt) and output tokens (the model's response). As of Claude 3.5 Sonnet, input tokens cost roughly 3× less per million than output tokens. This asymmetry has significant architectural implications: verbose system prompts that repeat on every request are expensive; prompts that elicit long model responses are even more so.

Every API response includes a usage object with input_tokens and output_tokens. Logging these on every call is the minimum viable cost-tracking practice.

import anthropic
from dataclasses import dataclass, field
from typing import List

@dataclass
class TokenBudget:
    max_input_tokens:  int = 4096
    max_output_tokens: int = 1024
    total_input_used:  int = field(default=0)
    total_output_used: int = field(default=0)

    def record(self, usage):
        self.total_input_used  += usage.input_tokens
        self.total_output_used += usage.output_tokens

    def cost_usd(self,
               input_price_per_mtok:  float = 3.0,
               output_price_per_mtok: float = 15.0) -> float:
        return (
            (self.total_input_used  / 1_000_000) * input_price_per_mtok +
            (self.total_output_used / 1_000_000) * output_price_per_mtok
        )

The max_tokens Parameter as a Hard Budget

The max_tokens parameter in the API request is your primary output cost control. Setting it too high for tasks that only need short answers wastes money; setting it too low truncates responses. The right value depends on the task: classification needs 50 tokens, code generation may need 2000, document summarization typically lands between 300 and 600.

A common mistake is setting max_tokens=4096 globally. Instead, set per-task defaults and allow overrides only when justified.

Task → Token Budget

Sentiment classification: 10–50
Entity extraction: 100–300
Single-paragraph summary: 200–400
Full document summary: 400–800
Code generation: 500–2000
Multi-turn conversation: 300–600

Cost Reduction Techniques

Prompt caching (Anthropic beta)
Compress few-shot examples
Use smaller models for routing
Batch similar requests together
Cache identical or near-identical prompts
Strip whitespace from system prompts

Prompt Caching

Anthropic introduced prompt caching (Beta) in 2024, allowing you to mark large, static portions of your prompt — such as a lengthy system prompt, a document, or a toolset definition — with a cache_control marker. Subsequent requests that share that prefix pay only a fraction of the normal input cost. In workloads where the same large context is reused across many requests, this can reduce input token costs by 80–90%.

# Prompt caching — mark static prefix with cache_control
response = client.beta.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_static_context,   # 50k tokens of docs
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=user_messages,
    betas=["prompt-caching-2024-07-31"]
)

Monitoring Tip

Build a per-session cost accumulator that raises a warning — or hard-stops — when a single conversation exceeds a threshold. Without this, adversarial or poorly-designed prompts can run up costs that dwarf typical usage. Stripe's API team reported that unbounded generation was the most common source of billing surprises in their LLM integrations.

Input tokensAll text in your system prompt, user messages, and tool definitions — billed at a lower rate.

Output tokensThe model's generated response — billed at a higher rate, typically 3–5× input cost.

Prompt cachingAnthropic beta feature that reuses cached prompt prefixes at ~10% of normal input cost.

cache_controlThe API field used to mark which content blocks should be cached for subsequent requests.

Lesson 2 Quiz

Cost Control & Token Budgeting — 3 questions

Why do output tokens typically cost more per million than input tokens in Anthropic's pricing?

Correct. Autoregressive generation requires a full forward pass per output token, while processing input requires one pass over the entire sequence. The compute asymmetry justifies the price difference.

The price difference reflects compute cost. Generating each output token requires a full forward pass through the model, whereas reading input context is a single attention computation over all tokens at once.

What is the primary mechanism Anthropic's prompt caching feature uses to reduce costs?

Correct. Prompt caching stores the key-value attention cache for a marked prefix. Subsequent requests that share that prefix pay reduced cost because the model doesn't re-compute attention for those tokens.

Prompt caching works at the model's KV-cache level. Static content is computed once; subsequent requests with the same prefix reuse that computation state, reducing the effective input token cost substantially.

Which approach is best for controlling output token costs across diverse task types?

Correct. A classification task rarely needs more than 50 tokens; a code generation task may need 2000. Task-specific budgets avoid wasted headroom on short tasks and truncation on long ones.

A global max_tokens ceiling is too blunt. It either wastes budget on short tasks (if set high) or truncates legitimate outputs (if set low). Per-task budgets are the professional approach.

Lab 2 — Token Budget Design

Practice session · optimize a cost-heavy API integration with your AI coach

Your Mission

Your company's Claude integration is burning through tokens faster than expected. You're seeing $800/day in API costs against a budget of $200. Work with your coach to identify the likely sources of waste and design a token-budgeting strategy.

Suggested opener: "Our Claude integration is costing 4× what we budgeted. Each request uses an 8,000-token system prompt, we always set max_tokens=4096, and we process each user session turn by turn. Where should I look first?"

Cost Optimization Coach

Lab 2

Let's diagnose your token spend. Walk me through your current architecture — system prompt size, average conversation length, output token usage, and whether any prompt content repeats across requests.

Module 8 · Lesson 3

Streaming, Timeouts & Async Patterns

Latency is a user experience problem. Streaming and async architecture turn slow completions into perceived responsiveness.

How do you build API integrations that feel fast even when model inference takes 15–30 seconds?

When ChatGPT launched in November 2022, the streaming token-by-token display wasn't just a cosmetic choice. OpenAI's internal testing had shown that users rated the same response as higher quality when it arrived as a stream rather than as a single block after a delay — even when total latency was identical. Anthropic adopted the same pattern. Every production Claude integration should use streaming for any user-facing output.

The Streaming API

Streaming uses server-sent events (SSE) to deliver tokens as they're generated. The Anthropic SDK provides a high-level streaming interface. The critical difference from non-streaming: your application receives the first token in 1–3 seconds, then continues receiving output until the model stops — rather than waiting for the entire response to generate before receiving anything.

import anthropic

client = anthropic.Anthropic()

# Context manager streams and cleans up automatically
with client.messages.stream(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": user_input}]
) as stream:
    for text in stream.text_stream():
        print(text, end="", flush=True)

# Access final message for usage stats after stream ends
final = stream.get_final_message()
print(f"\nTokens: {final.usage.input_tokens} in, {final.usage.output_tokens} out")

Timeout Strategy

Without timeouts, a hung connection can block a thread or coroutine indefinitely. The Anthropic SDK uses a default timeout of 600 seconds — appropriate for very long completions but catastrophic if applied to a user-facing chatbot. Set timeouts appropriate to your use case:

import httpx

# Per-client timeout (applies to all requests)
client = anthropic.Anthropic(
    timeout=httpx.Timeout(
        connect=5.0,   # connection establishment
        read=60.0,     # time waiting for first byte
        write=10.0,    # sending request body
        pool=5.0       # acquiring connection from pool
    )
)

# Per-request timeout override
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=500,
    messages=[...],
    timeout=30.0   # override for this call only
)

Async Patterns with asyncio

For concurrent processing — such as enriching records in a pipeline or handling multiple user sessions — the async SDK client allows many requests to run concurrently without multi-threading complexity. The async client is a drop-in counterpart: anthropic.AsyncAnthropic().

import asyncio
import anthropic

client = anthropic.AsyncAnthropic()

async def process_item(item: str) -> str:
    async with client.messages.stream(
        model="claude-haiku-4-5",
        max_tokens=300,
        messages=[{"role": "user", "content": item}]
    ) as stream:
        return await stream.get_final_text()

async def batch_process(items: list, concurrency: int = 5):
    sem = asyncio.Semaphore(concurrency)  # respect rate limits
    async def guarded(item):
        async with sem:
            return await process_item(item)
    return await asyncio.gather(*[guarded(i) for i in items])

SSE Event Types in Raw Streaming

When building custom streaming parsers (for non-SDK languages or WebSocket relay), understanding the raw SSE event sequence is essential. The key events are message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop. Token text arrives in content_block_delta events with delta.type == "text_delta".

Streaming + Retry

Streaming complicates retry logic. If a stream fails mid-response, you cannot resume from the interruption — you must restart from scratch. For long generations, this means either accepting partial loss or buffering the entire response (defeating streaming's purpose for latency). Most production systems accept the tradeoff: stream for UX, retry the full request on failure, surface an error to users only on exhausted retries.

SSEServer-Sent Events — a unidirectional HTTP streaming protocol used by the Anthropic API for token-by-token delivery.

TTFTTime To First Token — the latency metric that streaming optimizes; typically 1–3 seconds for Claude.

SemaphoreAn asyncio primitive that limits how many concurrent coroutines can run, used to avoid rate-limit violations in async batches.

AsyncAnthropicThe async variant of the Anthropic SDK client, compatible with Python's asyncio event loop.

Lesson 3 Quiz

Streaming, Timeouts & Async Patterns — 3 questions

What is the primary user experience benefit of streaming API responses?

Correct. Streaming delivers TTFT (time to first token) of 1–3 seconds regardless of total response length. Research showed users rate streamed responses as higher quality even when total latency is identical.

Streaming doesn't reduce total latency or cost — it improves perceived responsiveness by delivering the first visible output quickly, rather than making the user wait for the entire response to generate.

Why is a Semaphore used when batch-processing with AsyncAnthropic.gather()?

Correct. Without a semaphore, asyncio.gather() would launch all requests simultaneously. A semaphore limits concurrency to a number that stays within your RPM/TPM quota, preventing an immediate 429 storm.

The semaphore is a rate-limiting tool. Launching 500 concurrent API calls simultaneously would immediately exhaust your rate limit. A semaphore with concurrency=5 ensures at most 5 calls are in flight at any moment.

What happens when a streaming response fails mid-stream?

Correct. SSE streaming is stateless from the server's perspective. A mid-stream failure requires a full restart. Production systems typically accept this tradeoff and retry the entire request, surfacing an error only after all retries are exhausted.

Streaming cannot be resumed. There is no checkpoint mechanism in the Anthropic API's streaming protocol. A failure requires restarting the entire request from scratch — an important consideration for very long generations.

Lab 3 — Streaming & Async Design

Practice session · design a streaming chatbot and async batch pipeline with your coach

Your Mission

You're building two things: a user-facing chatbot that must feel responsive, and a nightly batch job that processes 1,000 documents. Work with your coach to design the streaming and async patterns for both use cases.

Suggested opener: "I need to stream responses in a FastAPI chatbot and also run a 1,000-document batch job each night. What's the architecture for each, and how do I handle rate limits in the async batch?"

Async Architecture Coach

Lab 3

Great — two distinct patterns for two different use cases. Let's start with the chatbot. Tell me your tech stack and what "responsive" means to your users. Are they watching token-by-token output, or waiting for a complete block?

Module 8 · Lesson 4

Observability, Safety & Graceful Degradation

Production systems fail in ways that staging never reveals. Observability turns invisible failures into fixable incidents.

How do you monitor LLM API integrations effectively and handle failures so users are never left without a response?

In June 2023, Notion's AI writing feature experienced degraded availability when an upstream model API returned higher-than-normal error rates during a regional incident. Teams that had built graceful degradation — showing users a "AI temporarily unavailable, try again" message while logging the event — retained user trust. Those whose systems surfaced raw API errors or hung indefinitely did not. The operational lesson: every LLM integration should assume the model API will be unavailable, slow, or wrong at some point.

What to Log on Every Request

The minimum viable telemetry for a production Claude integration captures: request metadata, response metadata, latency, token usage, error classification, and a correlation ID that threads through your entire stack. This gives you cost accounting, latency dashboards, error-rate alerting, and the ability to reconstruct any problematic interaction.

import time, uuid, logging
from dataclasses import dataclass

logger = logging.getLogger("claude.api")

def tracked_create(client, correlation_id=None, **kwargs):
    cid = correlation_id or str(uuid.uuid4())
    t0  = time.monotonic()
    error_type = None

    try:
        response = client.messages.create(**kwargs)
        latency  = time.monotonic() - t0
        logger.info({
            "event":          "claude_request",
            "correlation_id": cid,
            "model":          kwargs.get("model"),
            "latency_ms":     round(latency * 1000),
            "input_tokens":   response.usage.input_tokens,
            "output_tokens":  response.usage.output_tokens,
            "stop_reason":    response.stop_reason,
        })
        return response

    except anthropic.APIError as e:
        logger.error({
            "event":          "claude_error",
            "correlation_id": cid,
            "error_type":     type(e).__name__,
            "status_code":    getattr(e, 'status_code', None),
            "latency_ms":     round((time.monotonic() - t0) * 1000),
        })
        raise

Error Classification

Not all errors warrant the same response. The Anthropic SDK exposes a typed exception hierarchy that lets you handle each case appropriately:

Retriable Errors

RateLimitError (429) — backoff and retry
APIStatusError 529 — overloaded, retry
APIConnectionError — network transient
APITimeoutError — retry with longer timeout

Non-Retriable Errors

AuthenticationError (401) — fix API key
PermissionDeniedError (403) — tier/policy
BadRequestError (400) — fix your request
NotFoundError (404) — wrong model ID

Graceful Degradation Patterns

A production system should never surface a raw API error to an end user. At minimum, define fallback behavior for each error class. Common patterns include: returning a cached previous response, routing to a simpler model, returning a canned "service temporarily unavailable" message, or queuing the request for retry and notifying the user asynchronously.

async def safe_complete(prompt: str, fallback: str = "") -> str:
    try:
        response = await client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

    except (anthropic.RateLimitError, anthropic.APIStatusError):
        # Retriable — queue for async retry
        await retry_queue.push(prompt)
        return "Response delayed — we'll notify you when ready."

    except anthropic.APIConnectionError:
        # Network issue — try cached response
        cached = await cache.get_similar(prompt)
        return cached or fallback

    except anthropic.BadRequestError:
        # Client error — log and fail cleanly
        logger.error("Invalid request", extra={"prompt_len": len(prompt)})
        return fallback

Safety: Content Moderation in Production

The Anthropic API may return a stop_reason of "max_tokens" (truncated) or trigger a refusal. Always check stop_reason. A truncated response from a classification task may return an incomplete JSON object that crashes your parser. A content-moderated response may return an empty content array. Production code must handle both cases explicitly rather than assuming a valid text response always arrives.

Operational Checklist

Before going to production: ✓ All requests log correlation IDs. ✓ Token usage tracked per request. ✓ Error types classified and handled distinctly. ✓ Fallback responses defined for each error class. ✓ Alerts configured on error rate and p95 latency. ✓ stop_reason checked on every non-streaming response. ✓ Streaming failures trigger full retry, not partial display.

Correlation IDA UUID threaded through all log events for a single request, enabling full trace reconstruction.

stop_reasonThe field on every Anthropic response indicating why generation stopped: "end_turn", "max_tokens", or "stop_sequence".

Graceful degradationThe design pattern where a system reduces functionality under failure rather than crashing or surfacing raw errors.

p95 latencyThe 95th percentile response time — a more reliable SLO metric than mean latency for API monitoring.

Lesson 4 Quiz

Observability, Safety & Graceful Degradation — 3 questions

Why should every API request log a correlation ID alongside the request metadata?

Correct. In distributed systems, a single user request may generate dozens of log entries across multiple services. A correlation ID lets you grep every relevant log line for one interaction — essential for debugging production issues.

Correlation IDs are a distributed tracing tool. They let engineers reconstruct the full sequence of events for a specific request by filtering logs to a single UUID, even when those logs span multiple services and machines.

Which Anthropic API errors are non-retriable and require code/configuration fixes rather than retry logic?

Correct. 4xx errors except 429 indicate problems with your request or credentials that will not self-resolve. Retrying a 401 (bad API key) or 400 (malformed request) is pointless and wastes quota.

Non-retriable errors are those caused by client-side problems: wrong API key (401), malformed request (400), wrong model name (404). These require code or configuration fixes — retrying them does nothing and burns tokens.

Why must production code explicitly check the stop_reason field on every Anthropic API response?

Correct. If you expect JSON and the model is cut off mid-object by max_tokens, you'll receive invalid JSON. Checking stop_reason lets you detect truncation and either request a continuation or handle the incomplete response gracefully.

stop_reason is a safety check. If generation stops because max_tokens was reached (not "end_turn"), your response may be truncated mid-sentence or mid-structure. Assuming a complete response always arrives is a common source of parser crashes in production.

Lab 4 — Production Observability

Practice session · design monitoring, alerting, and fallback strategies with your coach

Your Mission

Your Claude-powered feature is going to production next week. Your SRE team is asking what dashboards, alerts, and runbooks you'll have in place. Work through the observability design with your coach — what to log, what alerts to set, and how the system behaves when the API goes down.

Suggested opener: "My SRE team wants dashboards and runbooks before we go live. What metrics should I track for a Claude integration, what alert thresholds make sense, and what does the on-call runbook look like for an API outage?"

SRE & Observability Coach

Lab 4

Good — shipping with observability beats shipping fast and debugging blind. Tell me about your integration: is it user-facing, async batch, or both? That determines which latency metrics and alert thresholds make sense.

Module 8 — Production Patterns

Module Test · 15 questions · 80% to pass

1. What HTTP status code does the Anthropic API return when you exceed your rate limit?

Correct. 429 is the standard rate-limit response. The API includes a retry-after header indicating the wait time.

Rate limiting returns 429 Too Many Requests, with a retry-after header. 503 is a service availability error; 401 is authentication failure.

2. Which two rate-limit dimensions does Anthropic track independently?

Correct. RPM limits the number of API calls regardless of size; TPM limits total token throughput. Either can trigger a 429 independently.

Anthropic tracks requests per minute (RPM) and tokens per minute (TPM). A single large-context request can exhaust your TPM budget before you hit the RPM limit.

3. What does "exponential backoff" mean in the context of retry logic?

Correct. Delays follow a sequence like 1s, 2s, 4s, 8s, 16s — preventing continuous hammering of an overloaded service.

Exponential backoff means each retry waits roughly 2× longer than the last. This prevents a flood of retries from overwhelming a service that is already struggling.

4. The Anthropic Python SDK includes built-in retry logic via which client parameter?

Correct. anthropic.Anthropic(max_retries=3) enables built-in retry logic with exponential backoff for simple use cases.

The SDK's built-in retry parameter is max_retries, set at client initialization: anthropic.Anthropic(max_retries=3).

5. Why do output tokens cost more per million than input tokens in Anthropic's pricing?

Correct. Each output token requires a full forward pass, while input tokens are processed in a single parallel attention computation.

The compute asymmetry: input processing is one parallel attention pass over all tokens; output generation is N sequential forward passes, one per token — hence higher cost.

6. What API field is used to mark content blocks for Anthropic's prompt caching feature?

Correct. The cache_control field with type "ephemeral" marks a content block for caching. Subsequent requests sharing that prefix pay ~10% of normal input token cost.

The correct field is cache_control: {"type": "ephemeral"} placed on the content block you want cached in subsequent requests.

7. What is TTFT, and why is it the key metric for streaming API responses?

Correct. TTFT determines whether an interface feels responsive. Streaming delivers the first visible tokens in 1–3 seconds regardless of total response length.

TTFT is Time To First Token. It's the primary UX metric for streaming because users see output begin immediately, rather than waiting for the full generation to complete before anything appears.

8. What protocol does the Anthropic API use for streaming responses?

Correct. SSE (Server-Sent Events) is a unidirectional HTTP streaming protocol where the server pushes data to the client over a persistent connection.

Anthropic uses Server-Sent Events (SSE) — a standard HTTP streaming mechanism. The SDK abstracts this, but raw integrations parse SSE event lines directly.

9. An asyncio.Semaphore with concurrency=5 in a batch processing loop ensures what?

Correct. The semaphore allows at most 5 coroutines to enter the critical section simultaneously — here, 5 concurrent in-flight API requests — preventing a rate-limit-triggering burst.

A Semaphore(5) limits concurrency to 5 simultaneous operations. In API batching, this prevents 500 requests from launching at once and immediately triggering rate limits.

10. What happens when a streaming response from the Anthropic API fails mid-stream?

Correct. There is no mid-stream resume mechanism. A failure requires full restart — production code should retry the entire request and surface an error only after retries are exhausted.

Streaming is stateless. A mid-stream connection failure requires restarting from request zero. Production systems typically retry the full request and show an error to users only after all retries fail.

11. What is the purpose of a correlation ID in API request logging?

Correct. A UUID correlation ID threaded through all log entries lets engineers find every log line for a specific request, even when those logs span multiple services.

Correlation IDs are a distributed tracing tool. They allow an engineer to grep all logs for a single UUID and reconstruct the exact sequence of events for one specific request or user session.

12. Which stop_reason value indicates a response was cut off before the model finished generating?

Correct. "max_tokens" means the generation was truncated by the max_tokens limit. "end_turn" indicates the model completed naturally. Truncated JSON or structured output will be invalid.

When max_tokens is reached before the model naturally finishes, stop_reason is "max_tokens". This indicates potentially truncated output that may break downstream parsers expecting complete structured data.

13. Which Anthropic error type should NOT be retried automatically?

Correct. An AuthenticationError (401) means your API key is invalid or missing. No amount of retrying will fix this — the code or configuration must be corrected.

AuthenticationError (401) is non-retriable. Retrying with a bad API key will always fail. Network errors, rate limits, and timeouts are transient and worth retrying; authentication failures are not.

14. What is "graceful degradation" in the context of LLM API integrations?

Correct. Graceful degradation means users receive a useful response (a fallback message, cached result, or queue acknowledgment) even when the API is unavailable, rather than a crash or raw error.

Graceful degradation is the design principle where failures reduce functionality rather than causing complete failure. Users see "service temporarily unavailable" rather than stack traces or hung requests.

15. Why is p95 latency preferred over mean latency as an SLO metric for API monitoring?

Correct. A mean latency of 2s could hide a 5% of requests taking 30 seconds. P95 tells you the worst experience that 95% of users will encounter — a far more actionable metric for setting user experience SLOs.

Mean latency hides tail latency — a few very slow requests barely move the mean but massively impact user experience. P95 tells you: 95% of requests are faster than this threshold, making it a better SLO anchor.