Working with the Anthropic API · Introduction

You've read a hundred blog posts. Now you open the editor.

After the theory, the craft. This is the nuts-and-bolts course.

There comes a point in every technology transition where the reading stops and the building starts. You've read about LLMs. You've used the chat interface. You've followed the news. Eventually you sit down to actually write code that calls an API and returns a useful result, and everything you knew abstractly has to become concrete.

The Anthropic API is Claude's front door for developers. It's what powers most of the applications people have built on Claude — from startups to enterprise products to internal tools. It has its own conventions, limits, pricing curves, retry behaviors, streaming protocols, and feature set that's moved faster than any documentation can keep up with.

This course is the hands-on, code-first guide. It covers authentication, the messages API, streaming, tool use, system prompts, context windows, pricing tiers, rate limits, error handling, batch processing, the Files and Search APIs, and the dozen specific gotchas that trip up every developer in their first week. By the end, you'll be able to build something real.

If you finish every module, here's who you become:

You'll understand how the Messages API actually works — roles, content types, request structure, and what Claude expects on every call.
You'll write production-ready code that handles streaming responses, retries on rate limits, and fails gracefully when things go wrong.
You'll know how prompt caching, token counting, and context window management translate directly into lower bills and faster responses.
You'll be able to define tools, wire up function-calling workflows, and pass images or documents into Claude without guessing at the format.
You'll recognize the specific error patterns, limit curves, and gotchas that slow down most developers in their first week — and know how to avoid them.
You'll design batch jobs and async architectures that let Claude work at scale, not just in a single chat loop.
You're becoming a developer who doesn't just use Claude through a UI — you build with it, confidently, from a cold editor.

Lesson 1 · The Messages API

Anatomy of a Request

Every call to Claude begins with a single HTTP endpoint — understanding its structure is the foundation of everything else.

What exactly happens when you send a message to Claude, and why does the shape of that request matter so much?

When Anthropic publicly launched the Claude API in March 2023, developers immediately noticed something deliberate about its design: unlike chat-completion APIs that bundle history into a single string, the Messages API treats every turn of a conversation as a discrete, typed object. This was a conscious choice. Anthropic's engineers had observed that ambiguous history formats were a leading cause of prompt injection and context-confusion bugs in production systems. The typed message array was their answer.

The Single Endpoint

The entire Messages API lives at one URL: POST https://api.anthropic.com/v1/messages. There is no separate endpoint for chat, for completions, or for streaming — they are all the same endpoint with different parameters. This simplicity is intentional: Anthropic wanted developers to think in terms of messages, not in terms of prompt/completion pairs.

Every request requires two HTTP headers beyond standard JSON content-type: x-api-key carrying your secret key, and anthropic-version specifying the API version string (currently 2023-06-01). The version header exists because Anthropic occasionally makes breaking changes; pinning a version protects your production code from unexpected behavior.

# Minimal valid request (Python)
import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Hello, Claude."}
    ]
)
print(message.content[0].text)

Required Parameters

Three parameters are required on every call. Leave out any one of them and the API returns a 400 error immediately.

Parameter	Type	Purpose
model	string	Which Claude model to use — e.g. claude-opus-4-5, claude-sonnet-4-5, or claude-haiku-3-5.
max_tokens	integer	Hard ceiling on output length. The model will stop generating at this many tokens even mid-sentence. Required to prevent runaway costs.
messages	array	Ordered list of message objects, each with a role ("user" or "assistant") and a content field.

The Messages Array

The messages array is the heart of the API. Each element is an object with exactly two fields: role and content. The role must alternate between "user" and "assistant" — you cannot have two consecutive user turns or two consecutive assistant turns. This constraint reflects the underlying model architecture: Claude was trained on alternating human/assistant dialogues.

To simulate a multi-turn conversation, you populate the array yourself with the full history. The API is stateless — it does not remember previous calls. Every request must carry its entire conversational context. This design makes scaling trivial (any server can handle any request) but places the burden of history management on the developer.

# Multi-turn conversation
messages = [
    {"role": "user",      "content": "What is photosynthesis?"},
    {"role": "assistant", "content": "Photosynthesis is..."},
    {"role": "user",      "content": "Can you give me an analogy?"}
]

The Response Object

The API returns a JSON object with a predictable structure. The most important fields are content (an array of content blocks), stop_reason (why generation stopped), and usage (token counts for billing). The content field is an array — not a string — because Claude can return multiple blocks including text and tool-use results in a single response.

The stop_reason field takes one of four values: end_turn (model finished naturally), max_tokens (hit your ceiling), stop_sequence (matched a custom stop string), or tool_use (model invoked a tool). Checking this field in production code prevents you from silently serving truncated responses to users.

Design Insight

Anthropic's decision to make the API stateless — requiring developers to send full conversation history on every request — was partly a safety measure. It makes the model's full context visible and auditable on every call, rather than hidden in server-side session state. Anthropic's internal red teams found that stateful session management created subtle attack surfaces where adversarial inputs could persist across user sessions.

Key Terms

Messages APIAnthropic's primary API endpoint (POST /v1/messages) for sending prompts to Claude and receiving completions.

anthropic-versionRequired HTTP header that pins the API to a specific version string, protecting production code from breaking changes.

max_tokensRequired integer parameter setting the hard ceiling on output length. The model stops generating at this count regardless of sentence completeness.

stop_reasonField in the response indicating why generation stopped: end_turn, max_tokens, stop_sequence, or tool_use.

Stateless APIAn API that retains no memory between calls — each request must include its full context. The Messages API is stateless by design.

Lesson 1 Quiz

Anatomy of a Request — 4 questions

1. Which three parameters are required on every call to the Messages API?

Correct. model, max_tokens, and messages are the three required parameters. temperature and system are optional.

Not quite. The three required parameters are model, max_tokens, and messages. temperature and system are both optional.

2. Why is the Messages API described as "stateless"?

Correct. The API holds no session state. Developers must send the entire message history with every request.

Not quite. Stateless here means the server retains no memory between API calls — you must include the full conversation history in every request.

3. What does a stop_reason of "max_tokens" indicate?

Correct. max_tokens as a stop_reason means your specified ceiling was reached. The output may be incomplete — always check this field in production.

Not quite. A stop_reason of max_tokens means your specified integer limit was reached. The model stopped mid-generation, and the response may be truncated.

4. What is the purpose of the anthropic-version header?

Correct. The anthropic-version header is a versioning mechanism. Pinning it ensures Anthropic's future API changes do not silently break your production code.

Not quite. The anthropic-version header pins your code to a specific API behavior contract. It is separate from authentication (the x-api-key header) and from model selection (the model parameter).

Lab 1 — Request Anatomy

Practice with an AI tutor · Complete 3 exchanges to finish

Your Task

You are exploring the structure of a Messages API request. Use the AI tutor below to ask questions about required parameters, the messages array format, response fields, or anything covered in Lesson 1.

Suggested start: "What would happen if I forgot to include max_tokens in my API call?" — or ask your own question about request anatomy.

API Tutor

Messages API · Lesson 1

Hello! I'm your tutor for Lesson 1: Anatomy of a Request. Ask me anything about the Messages API endpoint, required parameters, the messages array, response structure, or why the API is stateless. What would you like to explore?

Lesson 2 · The Messages API

The System Prompt

Before your user speaks a single word, you can shape Claude's entire persona, constraints, and knowledge context.

How does the system prompt differ from a user message, and what makes it uniquely powerful for production applications?

In 2023, Notion shipped its AI writing assistant — built on Claude — with a system prompt that ran to several hundred words. The prompt established Claude as a writing collaborator, defined the output format Notion expected (markdown with specific heading conventions), and explicitly prohibited the model from discussing competitors. Notion's engineering blog noted that the system prompt was the single highest-leverage lever they had: a two-sentence change to the system prompt produced measurably different user satisfaction scores in A/B tests, far outweighing changes to temperature or model version.

What the System Parameter Is

The system parameter is an optional top-level string in the request body, outside the messages array. It sits at a different privilege level than user messages — Claude's training treats it as higher-authority instructions that frame how the entire conversation should proceed. It is not a "system role" message inside the array (that syntax belongs to OpenAI's API, not Anthropic's).

Because system sits outside the messages array, it cannot be "overwritten" by a user message in the same way. A user typing "Ignore your previous instructions" inside a message turn is working against a system prompt — the model treats these as different levels of trust. This architectural choice was documented by Anthropic's alignment team as part of their hierarchical prompt trust framework.

# System prompt in context
message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=512,
    system="You are a concise technical writer. Respond only in 
    bullet points. Never use more than 5 bullets. Do not 
    discuss topics unrelated to software documentation.",
    messages=[
        {"role": "user", "content": "Explain API rate limiting."}
    ]
)

What to Put in a System Prompt

Anthropic's published guidance identifies four categories of system prompt content that reliably improve output quality:

Role and persona. Telling Claude who it is ("You are a customer support agent for Acme Corp") focuses its responses and reduces scope-creep. Anthropic's internal testing showed that persona instructions in the system prompt reduced off-topic responses by roughly 30% compared to the same instructions in a user turn.

Output format. Specifying the expected format (JSON, markdown, plain prose, numbered lists) in the system prompt is more reliable than asking for it per-turn, because it establishes a standing expectation rather than a per-request request.

Constraints and prohibitions. What Claude should not do — discussing competitors, revealing internal pricing, speaking outside its domain of expertise. These work best when framed positively where possible ("Focus exclusively on X") rather than as prohibitions ("Do not discuss Y").

Context and knowledge. Background information that applies to every turn: company name, product details, user tier, current date. This is far more token-efficient than repeating context in every user message.

System Prompts and Token Costs

Every system prompt token is billed on every API call, even if the user's message is short. A 500-token system prompt on 10,000 daily requests adds 5 million tokens per day to your input bill. Anthropic introduced prompt caching in 2024 specifically to address this: by marking a system prompt with a cache_control parameter, the API stores a compiled version and charges only 10% of the normal input token price for cache hits. This made long, rich system prompts economically viable for high-volume applications.

Prompt Injection Risk

System prompts are not cryptographically sealed. A sophisticated user who convinces Claude that the "true" system prompt has changed — through elaborate role-play, hypotheticals, or adversarial instructions — can sometimes override system prompt instructions. Anthropic's red team published findings in 2023 noting that constraints expressed as absolute rules ("Never, under any circumstances...") were more robust than soft preferences ("Try to avoid..."). Defense-in-depth — combining system prompt constraints with application-layer filtering — is the recommended production pattern.

Key Terms

system parameterOptional top-level string in the request body that provides high-authority framing instructions before the conversation begins. Separate from the messages array.

prompt cachingA 2024 Anthropic feature allowing system prompts to be cached server-side; cache hits are charged at 10% of normal input token price.

hierarchical trustThe principle that system prompt instructions carry higher authority than user turn instructions in Claude's training and behavior.

prompt injectionAn attack where adversarial user input attempts to override or bypass system prompt instructions.

Lesson 2 Quiz

The System Prompt — 4 questions

1. Where does the system parameter appear in a Messages API request?

Correct. The system parameter is a top-level field in the request body, not a role inside the messages array. That's an OpenAI-style convention, not Anthropic's.

Not quite. In the Anthropic API, system is a top-level parameter in the request body, separate from the messages array. The "system" role inside messages is an OpenAI convention.

2. What was Anthropic's primary motivation for introducing prompt caching in 2024?

Correct. Prompt caching directly addresses the economics of long system prompts at scale — cache hits cost 10% of normal input token price.

Not quite. Prompt caching was introduced to cut costs for long system prompts at high request volumes. Cache hits are charged at 10% of standard input token pricing.

3. Which of these is a recommended practice for writing constraint instructions in a system prompt?

Correct. Anthropic's 2023 red team findings showed that absolute-rule framing was more resistant to prompt injection than soft preference language.

Not quite. Anthropic's red team research found that absolute rules ("Never, under any circumstances...") were more robust against adversarial override attempts than soft preferences.

4. Which of these belongs in a system prompt rather than a per-turn user message?

Correct. Standing context — persona, format, product info — belongs in the system prompt. Task-specific or turn-specific content belongs in user messages.

Not quite. System prompts carry standing context that applies to every turn: persona, format requirements, company details. Turn-specific tasks and content go in user messages.

Lab 2 — System Prompts

Practice with an AI tutor · Complete 3 exchanges to finish

Your Task

You are designing system prompts for a real application. Use the tutor to explore best practices, discuss what to include, how prompt caching works, or why system prompts carry higher authority than user messages.

Suggested start: "I'm building a customer support bot. What should I put in my system prompt?" — or ask your own question about system prompts.

API Tutor

System Prompts · Lesson 2

Hello! I'm your tutor for Lesson 2: The System Prompt. Ask me anything about structuring system prompts, what to include, how prompt caching reduces costs, or how to write robust constraints against prompt injection. What are you working on?

Lesson 3 · The Messages API

Models, Tokens & Sampling

Choosing the right model and tuning sampling parameters is the difference between a useful application and an unreliable one.

What do temperature, top_p, and top_k actually control — and when should you change them from their defaults?

In late 2024, Anthropic released Claude Haiku 3.5 at a price point roughly 25× cheaper than Claude Opus 4.5 per token. Within weeks, engineering teams at several AI-native startups publicly documented on technical blogs that they had migrated classification, routing, and extraction tasks from Opus to Haiku with no measurable quality regression — while reducing API costs by over 90%. The model selection decision had become a first-order architectural concern, not an afterthought.

The Model Family

Anthropic organizes Claude into three tiers, each reflecting a different capability-cost tradeoff. Understanding this tiers matters enormously for production cost management.

Model Tier	Example	Best For
Opus	claude-opus-4-5	Complex reasoning, nuanced writing, multi-step analysis where quality is paramount and cost is secondary.
Sonnet	claude-sonnet-4-5	General-purpose tasks with a good balance of capability and cost. The default choice for most production workloads.
Haiku	claude-haiku-3-5	High-volume, latency-sensitive tasks: classification, extraction, routing, short-form responses.

How Tokens Work

Claude processes text as tokens — chunks of characters roughly 3–4 characters each on average for English. The word "photosynthesis" is two tokens; the word "a" is one. Token counts determine both billing (input tokens and output tokens are priced separately) and context limits (each model has a maximum context window, currently 200,000 tokens for Claude 3.x and Claude 4.x models).

A critical distinction: context window (the maximum total tokens in a single request — input + output) versus max_tokens (the ceiling you set for output in this specific call). You can have a 200K context window but set max_tokens to 500, meaning the model can read 200K tokens of input but will only write 500 tokens in response.

Temperature

Temperature controls randomness in token selection. At temperature=0, the model always picks the single most probable next token — deterministic, highly consistent, but prone to repetition on longer outputs. At temperature=1 (the default), the model samples proportionally from the probability distribution, producing varied and natural output. At very high temperatures (above 1.0, up to a maximum of 1.0 in Claude's API) output becomes incoherent.

Anthropic's documentation recommends temperature=0 for tasks where consistency and correctness matter more than variety: classification, extraction, code generation, factual Q&A. Temperature closer to 1 for creative tasks: brainstorming, story writing, generating diverse options.

# Low temperature for deterministic extraction
message = client.messages.create(
    model="claude-haiku-3-5",
    max_tokens=256,
    temperature=0,
    messages=[{"role": "user",
        "content": "Extract the date from: 'Meeting on March 5th 2025'"}]
)

# Higher temperature for creative generation
message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    temperature=1,
    messages=[{"role": "user",
        "content": "Give me 10 unusual names for a bakery."}]
)

top_p and top_k

top_p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability exceeds p. Setting top_p=0.9 means the model only ever considers tokens that together account for 90% of the probability mass — cutting out long-tail nonsense tokens. This is often more stable than temperature for controlling randomness.

top_k restricts selection to the k most probable tokens at each step, regardless of probability. Setting top_k=10 means only the top 10 candidate tokens are ever considered. Anthropic's documentation notes that you should use temperature or top_p, not both simultaneously — combining them produces unpredictable interactions. top_k is a coarser control and is rarely needed when temperature or top_p are set.

Stop Sequences

The optional stop_sequences parameter accepts an array of strings. When the model generates any of these strings, it stops immediately and returns what it has so far (stop_reason will be "stop_sequence"). This is invaluable for structured output: if you ask Claude to generate JSON and set stop_sequences=["}"], the model stops as soon as it closes the root object, preventing it from appending commentary after the JSON.

Key Terms

temperatureSampling parameter (0–1) controlling output randomness. 0 = deterministic; 1 = fully sampled from probability distribution.

top_pNucleus sampling parameter. Restricts token selection to the set of tokens whose cumulative probability exceeds p. Recommended over temperature for most use cases.

top_kRestricts token selection to the k most probable candidates at each step. Coarser than top_p; rarely needed in isolation.

context windowThe maximum total tokens (input + output) a model can process in a single request. Claude 3.x/4.x models support up to 200,000 tokens.

stop_sequencesOptional array of strings that cause the model to stop generating immediately upon encountering a match. Useful for enforcing structured output formats.

Lesson 3 Quiz

Models, Tokens & Sampling — 4 questions

1. You are building a high-volume document classification system that runs 50,000 requests per day. Which model tier should you choose first?

Correct. Classification is exactly the kind of task Haiku was designed for — structured, bounded, high-volume. Engineering teams have demonstrated 90%+ cost reductions by using Haiku for these workloads.

Not quite. For high-volume, bounded tasks like classification, Haiku is the appropriate starting point. Its cost advantage (roughly 25× cheaper than Opus) is enormous at scale, with minimal quality loss on structured tasks.

2. What is the key difference between the context window and the max_tokens parameter?

Correct. You can have a 200K token context window (total capacity) but set max_tokens to 256 (output ceiling for this call). They operate at different levels.

Not quite. The context window is the model's total capacity for a request (input + output). max_tokens is your chosen ceiling for output only on this specific call — you can limit output to 256 tokens even with a 200K context window.

3. You want Claude to extract a specific date from a document and always return it in exactly the same format. What temperature setting is most appropriate?

Correct. Extraction and classification tasks call for temperature=0. You want the model's best single answer, consistently, not a sample from the probability distribution.

Not quite. For deterministic extraction tasks, temperature=0 is correct — it always picks the most probable token. Also, Anthropic advises against combining temperature and top_p simultaneously.

4. How would you use stop_sequences to reliably capture only a JSON object from Claude's output without any trailing commentary?

Correct. stop_sequences=["}"] causes generation to halt the moment the root JSON object is closed, preventing any post-JSON commentary from appearing in the response.

Not quite. stop_sequences is the right tool here. Setting stop_sequences=["}"] stops generation immediately when the closing brace is produced, preventing trailing text — regardless of temperature or system prompt instructions.

Lab 3 — Models & Sampling

Practice with an AI tutor · Complete 3 exchanges to finish

Your Task

You are making model and parameter decisions for a real application. Use the tutor to reason through model selection, understand temperature vs. top_p, discuss stop sequences, or explore token cost implications.

Suggested start: "I need to build a product that generates creative marketing copy and also extracts structured data. How should I think about model and temperature choices for each?" — or ask your own question.

API Tutor

Models & Sampling · Lesson 3

Hello! I'm your tutor for Lesson 3: Models, Tokens & Sampling. Ask me about choosing between Opus, Sonnet, and Haiku, how temperature and top_p differ, when to use stop_sequences, or how to think about context windows and token costs. What's on your mind?

Lesson 4 · The Messages API

Streaming & Error Handling

Production APIs fail gracefully. Streaming changes the user experience. Both require deliberate design.

How does streaming work under the hood, and what error codes should every production application handle explicitly?

When Claude.ai launched its public interface in 2023, the product team made a deliberate choice to use streaming on every response — not just long ones. User research had shown that time-to-first-token mattered more to perceived responsiveness than total generation time. A response that starts appearing in 400ms feels faster than one that arrives complete in 1.8 seconds, even if the streaming response takes longer end-to-end. This finding, documented in Anthropic's product blog, influenced how the SDK's streaming helpers were subsequently designed.

How Streaming Works

Setting stream=True in a request causes the API to return an HTTP response with Content-Type: text/event-stream (Server-Sent Events format). Instead of waiting for the full response, your code receives a sequence of event objects as the model generates tokens. Each event has a type field and associated data.

The event sequence for a streaming response follows a fixed structure: message_start (metadata about the response), one or more content_block_start + content_block_delta + content_block_stop triplets for each content block, then message_delta (final stop_reason and token usage), and finally message_stop. Knowing this structure matters when you are building your own streaming parser rather than using the SDK helper.

# Streaming with the Python SDK helper
with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a short poem about APIs."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Get the final message after streaming completes
final_message = stream.get_final_message()
print("\nTotal output tokens:", final_message.usage.output_tokens)

Error Codes to Handle in Production

The Messages API returns standard HTTP status codes. Anthropic's documentation specifies the error codes that production applications must handle explicitly rather than treating all errors uniformly.

Status Code	Type	Recommended Action
400	invalid_request_error	Fix your request — malformed JSON, missing required parameter, or invalid field value. Do not retry.
401	authentication_error	Invalid or missing API key. Check your key and headers. Do not retry.
403	permission_error	Your key lacks permission for this resource or region. Do not retry — escalate to account management.
429	rate_limit_error	You have exceeded your rate limit. Retry with exponential backoff. The Retry-After header indicates the wait time.
500	api_error	Internal Anthropic server error. Retry with backoff; if persistent, check status.anthropic.com.
529	overloaded_error	API temporarily overloaded. Retry with exponential backoff.

Exponential Backoff

For retriable errors (429, 500, 529), the standard pattern is exponential backoff with jitter: wait 1s, then 2s, then 4s, then 8s, each with a small random offset to prevent thundering herd when many clients retry simultaneously. The Anthropic Python SDK implements this automatically if you use its built-in retry mechanism, configurable via max_retries on the client constructor.

Anthropic's published best practices explicitly warn against fixed-delay retry loops — under sustained load, they amplify pressure on already-overloaded infrastructure rather than relieving it. Jitter is not optional in high-throughput production systems.

Rate Limits vs. Usage Limits

A 429 error can signal two distinct problems: requests per minute exceeded, or tokens per minute exceeded. The error body specifies which. Token-rate limiting is more commonly triggered by large-context requests even at low request-per-minute volumes. Monitoring both dimensions separately — not just request count — is essential for capacity planning.

Content Filtering Errors

When Claude declines to answer due to safety policies, this is not returned as an HTTP error. Instead, the response has HTTP 200, but the stop_reason is end_turn and the content block contains Claude's refusal message. This means content filtering cannot be detected at the HTTP layer — you must inspect the response content. Some applications use a secondary classifier on the output text to detect policy-related refusals and handle them gracefully in the UI.

Key Terms

streamingA mode (stream=True) where the API returns tokens incrementally as Server-Sent Events rather than waiting for full generation to complete.

content_block_deltaThe SSE event type carrying incremental text as it is generated in streaming mode.

exponential backoffA retry strategy that doubles wait time between attempts plus a random jitter offset. Required for 429 and 5xx error handling.

429 rate_limit_errorHTTP error indicating exceeded requests-per-minute or tokens-per-minute limits. Retriable with backoff.

529 overloaded_errorAnthropic-specific HTTP status indicating temporary API overload. Retriable with backoff.

Lesson 4 Quiz

Streaming & Error Handling — 4 questions

1. What HTTP protocol format does the Messages API use to deliver streaming responses?

Correct. The API uses the Server-Sent Events (SSE) format with Content-Type: text/event-stream for streaming responses.

Not quite. Streaming uses Server-Sent Events format — a unidirectional push protocol over HTTP with Content-Type: text/event-stream.

2. Your application receives a 429 error from the API. What is the correct production response?

Correct. 429 is a retriable rate limit error. Exponential backoff with jitter prevents thundering herd. The Retry-After header provides the minimum wait time.

Not quite. 429 means rate limited — retriable. The correct pattern is exponential backoff with jitter. Tight retry loops will worsen the problem; 429 is not an account suspension signal.

3. Claude declines to answer a user's question due to safety policies. What HTTP status code does the API return?

Correct. Content policy refusals return HTTP 200. The model's refusal message appears in the content field. You cannot detect refusals at the HTTP layer alone.

Not quite. Content filtering is not signaled by an HTTP error. The API returns 200 OK with Claude's refusal text in the content block. You must inspect the response text to detect it.

4. Why does Anthropic recommend adding random jitter to exponential backoff intervals?

Correct. Without jitter, clients that were rate-limited together will all retry at the same moment — creating another identical load spike. Jitter spreads retries across time.

Not quite. Jitter solves the "thundering herd" problem: without it, many clients rate-limited at the same moment will all retry simultaneously after the same backoff interval, recreating the exact overload that caused the error.

Lab 4 — Streaming & Errors

Practice with an AI tutor · Complete 3 exchanges to finish

Your Task

You are designing the error handling and streaming strategy for a production API integration. Use the tutor to work through streaming implementation details, retry logic, error code handling, or content filter detection.

Suggested start: "I'm getting sporadic 429 errors in production at about 500 requests/minute. Walk me through how to diagnose whether it's RPM or TPM limiting, and what the retry strategy should look like." — or ask your own question.

API Tutor

Streaming & Errors · Lesson 4

Hello! I'm your tutor for Lesson 4: Streaming & Error Handling. Ask me about implementing streaming with SSE, building retry logic with exponential backoff and jitter, understanding different error codes, or detecting content policy refusals in your application. What would you like to work through?

Module Test — The Messages API

15 questions · Score 80% or above to pass

1. What is the base URL for all Messages API requests?

Correct.

The correct URL is https://api.anthropic.com/v1/messages.

2. Which HTTP header carries your Anthropic API key?

Correct. Anthropic uses x-api-key, not the Authorization: Bearer pattern.

The correct header is x-api-key. Anthropic does not use Authorization: Bearer.

3. How many required parameters does the Messages API have?

Correct. model, max_tokens, and messages are all required.

Three parameters are required: model, max_tokens, and messages. system is optional.

4. The messages array must follow which constraint on role ordering?

Correct. The messages array must strictly alternate between user and assistant roles.

Roles must alternate: user, assistant, user, assistant. No two consecutive identical roles are permitted.

5. A developer sends a request without including past conversation turns. What happens to the previous conversation?

Correct. The API is stateless. Previous context exists only in the messages array you send.

The API is stateless — it holds no memory between calls. History not included in the request is not available to the model.

6. Where should the system parameter appear in a Messages API request?

Correct. system is a top-level parameter, not a role inside the messages array.

system is a top-level parameter in the request body — not inside the messages array. The "system" role convention belongs to OpenAI's API.

7. Prompt caching, introduced by Anthropic in 2024, charges cache hits at what percentage of normal input token price?

Correct. Prompt cache hits are charged at 10% of normal input token price.

Prompt cache hits are charged at 10% of normal input token price — a 90% savings on cached content.

8. What is the maximum context window for Claude 3.x and Claude 4.x models?

Correct. Claude 3.x and 4.x models support a 200,000 token context window.

Claude 3.x and 4.x models support a 200,000 token context window.

9. You are building a legal document summarizer that runs 100,000 requests/day. Summaries are ~300 words. Which model tier is the best starting point?

Correct. Summarization involves genuine comprehension and nuance — Sonnet is the appropriate starting point for a quality/cost balance at scale.

Summarization requires comprehension beyond simple classification — Sonnet is the right starting point. Haiku is best for very structured, bounded tasks. Model choice directly affects cost.

10. Anthropic's documentation advises NOT combining which two sampling parameters simultaneously?

Correct. temperature and top_p should not be used together — their interaction produces unpredictable sampling behavior.

Anthropic advises against combining temperature and top_p — their interaction is unpredictable. Use one or the other.

11. What SSE event type carries the actual token text during a streaming response?

Correct. content_block_delta events carry the incremental token text as it is generated.

The content_block_delta event carries the actual text being generated incrementally.

12. Which HTTP status code is Anthropic-specific and indicates temporary API overload (distinct from rate limiting)?

Correct. 529 is Anthropic's custom overloaded_error status code, distinct from 429 (rate limit) and standard 5xx errors.

529 is Anthropic's overloaded_error — a custom status code for temporary infrastructure overload, separate from 429 rate limiting.

13. A 400 error from the Messages API indicates what, and should you retry it?

Correct. 400 is a client error — your request is malformed. Retrying the same request will produce the same error. Fix it first.

400 means the request itself is invalid (missing parameter, bad JSON, etc.). Retrying without fixing the request is pointless — it will return 400 again.

14. Claude refuses to answer a user's request due to safety guidelines. What appears in the API response?

Correct. Content policy refusals are returned as normal 200 responses with the refusal text in content. HTTP status cannot detect them.

Content policy refusals return HTTP 200 with Claude's refusal in the content field. You must inspect the text content to detect and handle refusals in your application.

15. Why does Anthropic's design make the Messages API stateless, requiring full conversation history in every request?

Correct. Anthropic's alignment team documented that stateless design improves auditability and eliminates server-side session state as an attack surface for adversarial persistence.

Anthropic documented the stateless design as partly a security decision: full context is visible on every call, and there is no server-side session state for adversarial inputs to persist in across users.