There comes a point in every technology transition where the reading stops and the building starts. You've read about LLMs. You've used the chat interface. You've followed the news. Eventually you sit down to actually write code that calls an API and returns a useful result, and everything you knew abstractly has to become concrete.
The Anthropic API is Claude's front door for developers. It's what powers most of the applications people have built on Claude β from startups to enterprise products to internal tools. It has its own conventions, limits, pricing curves, retry behaviors, streaming protocols, and feature set that's moved faster than any documentation can keep up with.
This course is the hands-on, code-first guide. It covers authentication, the messages API, streaming, tool use, system prompts, context windows, pricing tiers, rate limits, error handling, batch processing, the Files and Search APIs, and the dozen specific gotchas that trip up every developer in their first week. By the end, you'll be able to build something real.
If you finish every module, here's who you become:
When Anthropic publicly launched the Claude API in March 2023, developers immediately noticed something deliberate about its design: unlike chat-completion APIs that bundle history into a single string, the Messages API treats every turn of a conversation as a discrete, typed object. This was a conscious choice. Anthropic's engineers had observed that ambiguous history formats were a leading cause of prompt injection and context-confusion bugs in production systems. The typed message array was their answer.
The entire Messages API lives at one URL: POST https://api.anthropic.com/v1/messages. There is no separate endpoint for chat, for completions, or for streaming β they are all the same endpoint with different parameters. This simplicity is intentional: Anthropic wanted developers to think in terms of messages, not in terms of prompt/completion pairs.
Every request requires two HTTP headers beyond standard JSON content-type: x-api-key carrying your secret key, and anthropic-version specifying the API version string (currently 2023-06-01). The version header exists because Anthropic occasionally makes breaking changes; pinning a version protects your production code from unexpected behavior.
Three parameters are required on every call. Leave out any one of them and the API returns a 400 error immediately.
| Parameter | Type | Purpose |
|---|---|---|
| model | string | Which Claude model to use β e.g. claude-opus-4-5, claude-sonnet-4-5, or claude-haiku-3-5. |
| max_tokens | integer | Hard ceiling on output length. The model will stop generating at this many tokens even mid-sentence. Required to prevent runaway costs. |
| messages | array | Ordered list of message objects, each with a role ("user" or "assistant") and a content field. |
The messages array is the heart of the API. Each element is an object with exactly two fields: role and content. The role must alternate between "user" and "assistant" β you cannot have two consecutive user turns or two consecutive assistant turns. This constraint reflects the underlying model architecture: Claude was trained on alternating human/assistant dialogues.
To simulate a multi-turn conversation, you populate the array yourself with the full history. The API is stateless β it does not remember previous calls. Every request must carry its entire conversational context. This design makes scaling trivial (any server can handle any request) but places the burden of history management on the developer.
The API returns a JSON object with a predictable structure. The most important fields are content (an array of content blocks), stop_reason (why generation stopped), and usage (token counts for billing). The content field is an array β not a string β because Claude can return multiple blocks including text and tool-use results in a single response.
The stop_reason field takes one of four values: end_turn (model finished naturally), max_tokens (hit your ceiling), stop_sequence (matched a custom stop string), or tool_use (model invoked a tool). Checking this field in production code prevents you from silently serving truncated responses to users.
Anthropic's decision to make the API stateless β requiring developers to send full conversation history on every request β was partly a safety measure. It makes the model's full context visible and auditable on every call, rather than hidden in server-side session state. Anthropic's internal red teams found that stateful session management created subtle attack surfaces where adversarial inputs could persist across user sessions.
You are exploring the structure of a Messages API request. Use the AI tutor below to ask questions about required parameters, the messages array format, response fields, or anything covered in Lesson 1.
In 2023, Notion shipped its AI writing assistant β built on Claude β with a system prompt that ran to several hundred words. The prompt established Claude as a writing collaborator, defined the output format Notion expected (markdown with specific heading conventions), and explicitly prohibited the model from discussing competitors. Notion's engineering blog noted that the system prompt was the single highest-leverage lever they had: a two-sentence change to the system prompt produced measurably different user satisfaction scores in A/B tests, far outweighing changes to temperature or model version.
The system parameter is an optional top-level string in the request body, outside the messages array. It sits at a different privilege level than user messages β Claude's training treats it as higher-authority instructions that frame how the entire conversation should proceed. It is not a "system role" message inside the array (that syntax belongs to OpenAI's API, not Anthropic's).
Because system sits outside the messages array, it cannot be "overwritten" by a user message in the same way. A user typing "Ignore your previous instructions" inside a message turn is working against a system prompt β the model treats these as different levels of trust. This architectural choice was documented by Anthropic's alignment team as part of their hierarchical prompt trust framework.
Anthropic's published guidance identifies four categories of system prompt content that reliably improve output quality:
Role and persona. Telling Claude who it is ("You are a customer support agent for Acme Corp") focuses its responses and reduces scope-creep. Anthropic's internal testing showed that persona instructions in the system prompt reduced off-topic responses by roughly 30% compared to the same instructions in a user turn.
Output format. Specifying the expected format (JSON, markdown, plain prose, numbered lists) in the system prompt is more reliable than asking for it per-turn, because it establishes a standing expectation rather than a per-request request.
Constraints and prohibitions. What Claude should not do β discussing competitors, revealing internal pricing, speaking outside its domain of expertise. These work best when framed positively where possible ("Focus exclusively on X") rather than as prohibitions ("Do not discuss Y").
Context and knowledge. Background information that applies to every turn: company name, product details, user tier, current date. This is far more token-efficient than repeating context in every user message.
Every system prompt token is billed on every API call, even if the user's message is short. A 500-token system prompt on 10,000 daily requests adds 5 million tokens per day to your input bill. Anthropic introduced prompt caching in 2024 specifically to address this: by marking a system prompt with a cache_control parameter, the API stores a compiled version and charges only 10% of the normal input token price for cache hits. This made long, rich system prompts economically viable for high-volume applications.
System prompts are not cryptographically sealed. A sophisticated user who convinces Claude that the "true" system prompt has changed β through elaborate role-play, hypotheticals, or adversarial instructions β can sometimes override system prompt instructions. Anthropic's red team published findings in 2023 noting that constraints expressed as absolute rules ("Never, under any circumstances...") were more robust than soft preferences ("Try to avoid..."). Defense-in-depth β combining system prompt constraints with application-layer filtering β is the recommended production pattern.
You are designing system prompts for a real application. Use the tutor to explore best practices, discuss what to include, how prompt caching works, or why system prompts carry higher authority than user messages.
In late 2024, Anthropic released Claude Haiku 3.5 at a price point roughly 25Γ cheaper than Claude Opus 4.5 per token. Within weeks, engineering teams at several AI-native startups publicly documented on technical blogs that they had migrated classification, routing, and extraction tasks from Opus to Haiku with no measurable quality regression β while reducing API costs by over 90%. The model selection decision had become a first-order architectural concern, not an afterthought.
Anthropic organizes Claude into three tiers, each reflecting a different capability-cost tradeoff. Understanding this tiers matters enormously for production cost management.
| Model Tier | Example | Best For |
|---|---|---|
| Opus | claude-opus-4-5 | Complex reasoning, nuanced writing, multi-step analysis where quality is paramount and cost is secondary. |
| Sonnet | claude-sonnet-4-5 | General-purpose tasks with a good balance of capability and cost. The default choice for most production workloads. |
| Haiku | claude-haiku-3-5 | High-volume, latency-sensitive tasks: classification, extraction, routing, short-form responses. |
Claude processes text as tokens β chunks of characters roughly 3β4 characters each on average for English. The word "photosynthesis" is two tokens; the word "a" is one. Token counts determine both billing (input tokens and output tokens are priced separately) and context limits (each model has a maximum context window, currently 200,000 tokens for Claude 3.x and Claude 4.x models).
A critical distinction: context window (the maximum total tokens in a single request β input + output) versus max_tokens (the ceiling you set for output in this specific call). You can have a 200K context window but set max_tokens to 500, meaning the model can read 200K tokens of input but will only write 500 tokens in response.
Temperature controls randomness in token selection. At temperature=0, the model always picks the single most probable next token β deterministic, highly consistent, but prone to repetition on longer outputs. At temperature=1 (the default), the model samples proportionally from the probability distribution, producing varied and natural output. At very high temperatures (above 1.0, up to a maximum of 1.0 in Claude's API) output becomes incoherent.
Anthropic's documentation recommends temperature=0 for tasks where consistency and correctness matter more than variety: classification, extraction, code generation, factual Q&A. Temperature closer to 1 for creative tasks: brainstorming, story writing, generating diverse options.
top_p (nucleus sampling) restricts token selection to the smallest set of tokens whose cumulative probability exceeds p. Setting top_p=0.9 means the model only ever considers tokens that together account for 90% of the probability mass β cutting out long-tail nonsense tokens. This is often more stable than temperature for controlling randomness.
top_k restricts selection to the k most probable tokens at each step, regardless of probability. Setting top_k=10 means only the top 10 candidate tokens are ever considered. Anthropic's documentation notes that you should use temperature or top_p, not both simultaneously β combining them produces unpredictable interactions. top_k is a coarser control and is rarely needed when temperature or top_p are set.
The optional stop_sequences parameter accepts an array of strings. When the model generates any of these strings, it stops immediately and returns what it has so far (stop_reason will be "stop_sequence"). This is invaluable for structured output: if you ask Claude to generate JSON and set stop_sequences=["}"], the model stops as soon as it closes the root object, preventing it from appending commentary after the JSON.
You are making model and parameter decisions for a real application. Use the tutor to reason through model selection, understand temperature vs. top_p, discuss stop sequences, or explore token cost implications.
When Claude.ai launched its public interface in 2023, the product team made a deliberate choice to use streaming on every response β not just long ones. User research had shown that time-to-first-token mattered more to perceived responsiveness than total generation time. A response that starts appearing in 400ms feels faster than one that arrives complete in 1.8 seconds, even if the streaming response takes longer end-to-end. This finding, documented in Anthropic's product blog, influenced how the SDK's streaming helpers were subsequently designed.
Setting stream=True in a request causes the API to return an HTTP response with Content-Type: text/event-stream (Server-Sent Events format). Instead of waiting for the full response, your code receives a sequence of event objects as the model generates tokens. Each event has a type field and associated data.
The event sequence for a streaming response follows a fixed structure: message_start (metadata about the response), one or more content_block_start + content_block_delta + content_block_stop triplets for each content block, then message_delta (final stop_reason and token usage), and finally message_stop. Knowing this structure matters when you are building your own streaming parser rather than using the SDK helper.
The Messages API returns standard HTTP status codes. Anthropic's documentation specifies the error codes that production applications must handle explicitly rather than treating all errors uniformly.
| Status Code | Type | Recommended Action |
|---|---|---|
| 400 | invalid_request_error | Fix your request β malformed JSON, missing required parameter, or invalid field value. Do not retry. |
| 401 | authentication_error | Invalid or missing API key. Check your key and headers. Do not retry. |
| 403 | permission_error | Your key lacks permission for this resource or region. Do not retry β escalate to account management. |
| 429 | rate_limit_error | You have exceeded your rate limit. Retry with exponential backoff. The Retry-After header indicates the wait time. |
| 500 | api_error | Internal Anthropic server error. Retry with backoff; if persistent, check status.anthropic.com. |
| 529 | overloaded_error | API temporarily overloaded. Retry with exponential backoff. |
For retriable errors (429, 500, 529), the standard pattern is exponential backoff with jitter: wait 1s, then 2s, then 4s, then 8s, each with a small random offset to prevent thundering herd when many clients retry simultaneously. The Anthropic Python SDK implements this automatically if you use its built-in retry mechanism, configurable via max_retries on the client constructor.
Anthropic's published best practices explicitly warn against fixed-delay retry loops β under sustained load, they amplify pressure on already-overloaded infrastructure rather than relieving it. Jitter is not optional in high-throughput production systems.
A 429 error can signal two distinct problems: requests per minute exceeded, or tokens per minute exceeded. The error body specifies which. Token-rate limiting is more commonly triggered by large-context requests even at low request-per-minute volumes. Monitoring both dimensions separately β not just request count β is essential for capacity planning.
When Claude declines to answer due to safety policies, this is not returned as an HTTP error. Instead, the response has HTTP 200, but the stop_reason is end_turn and the content block contains Claude's refusal message. This means content filtering cannot be detected at the HTTP layer β you must inspect the response content. Some applications use a secondary classifier on the output text to detect policy-related refusals and handle them gracefully in the UI.
You are designing the error handling and streaming strategy for a production API integration. Use the tutor to work through streaming implementation details, retry logic, error code handling, or content filter detection.