In January 2024, Anthropic published its pricing update for Claude 3 — the first time the company offered three distinct capability tiers at dramatically different price points. The decision forced developers to confront a question they had often ignored: not "can our app do this?" but "can our app afford to do this at scale?"
The gap between Haiku and Opus pricing — roughly 60× per output token at launch — made model selection a genuine engineering decision, not an afterthought.
The Anthropic API bills on tokens consumed, not time, not requests, not users. A token is approximately four characters of English text. Every call produces a bill with two line items: input tokens (everything sent to the model — system prompt, conversation history, user message, documents) and output tokens (every token the model generates in its response).
Output tokens are priced higher than input tokens on every Claude model — typically 5× more expensive per million tokens. This asymmetry matters enormously when optimizing: a response that generates 2,000 tokens costs roughly the same as sending 10,000 tokens of context.
| Model | Input (per M tokens) | Output (per M tokens) | Best For |
|---|---|---|---|
| Claude 3 Haiku | $0.25 | $1.25 | High-volume classification, routing, simple extraction |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Complex reasoning, coding, nuanced analysis |
| Claude 3 Opus | $15.00 | $75.00 | Hardest tasks requiring maximum capability |
At 1 million calls per day with an average 500-token output, switching from Sonnet to Haiku saves approximately $7,375 per day — $2.7M per year — if Haiku achieves acceptable quality for the task.
Most developers underestimate their input token count because they think only of the user's message. In practice, the system prompt is sent on every single request. A 2,000-token system prompt across 1 million daily calls contributes 2 billion input tokens per day — before any user sends a word.
Long conversation histories compound this. In a multi-turn chat where the full history is replayed each turn, a 10-turn conversation with 200-token average messages sends 2,000 cumulative tokens of history on the final turn alone. By turn 20, that's 4,000 tokens — just from history.
Attached documents and retrieved context from RAG pipelines are common culprits. A single 10-page PDF passed verbatim might add 5,000–8,000 tokens to every request that references it.
Output cost is driven by response length, which is driven by instructions. If a system prompt says "be thorough and comprehensive," the model will be — and you will pay for it. Vague prompts tend to produce verbose responses; specific prompts asking for structured, concise output produce shorter ones.
The max_tokens parameter caps output but does not guarantee short responses. It is a ceiling, not a target. Setting it appropriately for the task prevents runaway generation but doesn't substitute for prompting for concision.
Anthropic publishes current pricing at anthropic.com/api. Prices have decreased substantially with each model generation — Claude 3 Haiku is approximately 30× cheaper per token than the original Claude 1 at launch. Budget projections should use current figures and expect continued reductions.
max_tokens parameter controls?max_tokens is a ceiling, not a target. The model stops generating when it naturally completes its response OR when it hits max_tokens, whichever comes first. Setting it too low can truncate responses; it doesn't force brevity — proper prompting does.max_tokens only caps output generation — it's a maximum, not an exact count. The model will generate fewer tokens if the response naturally concludes before reaching the limit.You're working with a cost-estimation assistant. Walk through real cost calculation scenarios — ask it to help you estimate monthly API costs for different application designs, compare model tiers for specific workloads, or explore what happens when you change system prompt length, response length, or request volume.
When Anthropic launched prompt caching in beta in August 2024, the announcement included a reference implementation showing a legal document analysis workflow. A 50-page contract (roughly 25,000 tokens) analyzed with 10 different questions would, without caching, incur 250,000 input tokens. With caching, the document tokens are processed once; only each short question is charged at full rate. Cost reduction: approximately 90% on the document-heavy portion of the bill.
Prompt caching allows you to mark specific portions of your prompt with a cache_control parameter set to {"type": "ephemeral"}. When the API receives a request with a cache-marked prefix, it checks whether that exact token sequence is already stored in its cache. If yes, it reuses the stored key-value computation rather than reprocessing those tokens — and charges you at a dramatically reduced rate.
Cached input tokens cost 10% of the normal input price (a 90% discount). The first request that creates the cache entry is charged at 1.25× the normal input price (a one-time 25% premium). Subsequent requests within the cache window recoup this investment in a fraction of the calls.
Cache entries have a 5-minute TTL (time-to-live) from the last use. Each use of a cached entry resets the timer, so frequently-used caches can persist indefinitely in active applications. Caches are scoped to your API key and are not shared between organizations.
The minimum cacheable prefix length is 1,024 tokens for Claude 3.5 Sonnet and Claude 3 Opus, and 2,048 tokens for Claude 3 Haiku. Short system prompts below these thresholds cannot be cached — a key reason why consolidating instructions into a single longer prompt can actually reduce costs even if it means a slightly larger uncached prompt.
You can define up to four cache breakpoints per request, enabling multi-section caching — for example, caching a static system prompt separately from a large shared document that changes daily.
The highest-value targets for caching are content that is static or slow-changing and large. System prompts are the obvious first target. Beyond that, consider: reference documents (product manuals, legal agreements, codebases), few-shot examples included in every prompt, and conversation history in multi-turn sessions.
Cache creation costs 1.25× normal input price. Cache hits cost 0.10× normal. Break-even occurs after just 1.39 cache hits — meaning if more than 2 requests ever use the same cached prefix, you are net-saving money. At any real traffic volume, caching almost always wins.
When prompt caching is active, the API returns a usage object with additional fields: cache_creation_input_tokens (tokens written to cache this request) and cache_read_input_tokens (tokens read from cache). Logging these fields lets you verify cache hit rates and debug cache misses — for example, if a content change invalidates the cache more often than expected.
Work with a caching strategy assistant to design the optimal caching approach for different application types. Describe an application you're building (or a hypothetical one), and explore which parts of the prompt should be cached, where to place cache breakpoints, and how to estimate savings.
In 2024, multiple engineering teams publicly documented their migration from GPT-4-class models to lighter-weight alternatives for classification and routing tasks. Anthropic's own documentation describes a common pattern: using Claude 3 Haiku as a "triage" model to classify incoming requests, then routing only complex cases to Sonnet or Opus. Teams reported cost reductions of 60–80% on the triage tier while maintaining end-user quality perception, because most requests were simple enough that Haiku handled them perfectly.
The core principle: use the cheapest model that reliably succeeds at the task. This sounds obvious but requires systematic evaluation. The instinct to "just use the best model" is expensive at scale and often unnecessary.
Model selection should be driven by empirical testing, not intuition. Build a representative test set of 50–200 inputs with known correct outputs. Run the same prompt on Haiku, Sonnet, and Opus. Measure accuracy or quality against your acceptance threshold. If Haiku achieves 95% and your bar is 90%, you've found your model.
| Task Type | Recommended Start | Notes |
|---|---|---|
| Binary classification, sentiment | Haiku | Simple pattern matching; Haiku excels here at a fraction of Sonnet cost |
| Entity extraction, tagging | Haiku → Sonnet if needed | Test Haiku first; escalate only for ambiguous domains |
| Summarization (standard) | Haiku or Sonnet | Haiku works for news/simple docs; Sonnet for technical/nuanced material |
| Code generation, debugging | Sonnet | Haiku struggles with complex logic; Sonnet is cost-effective here |
| Complex multi-step reasoning | Sonnet → Opus if needed | Opus for genuinely hard reasoning; Sonnet handles most cases |
| Creative writing, nuanced tone | Sonnet or Opus | Task-dependent; evaluate subjectively with human raters |
Beyond model selection, the text of your prompts directly controls token consumption. Verbose, redundant prompts waste money on every call. The principles of token-efficient prompting:
A powerful pattern is building a classifier-router that uses a small, cheap model to categorize each incoming request, then routes it to the appropriate model tier. For example: Haiku classifies "Is this a simple FAQ, a complex reasoning task, or an edge case?" Simple FAQs go to a pre-written template or Haiku itself. Complex reasoning goes to Sonnet. Edge cases go to Opus.
The classification call itself costs almost nothing (a few input tokens + a single output token). The routing decision it enables can save orders of magnitude in downstream cost.
Never optimize blindly. Use the usage field in every API response to log actual token counts per call. Aggregate by endpoint, user type, and time of day. Real usage data almost always reveals that 20% of call types consume 80% of tokens — making optimization efforts extremely targeted and high-ROI.
Submit any system prompt or complete prompt structure to the prompt auditor. It will identify token waste, suggest more concise alternatives, recommend output format changes, and estimate savings. You can also ask it to help you design a tiered routing system for a specific use case.
In November 2024, Anthropic released the Message Batches API — a feature that allows developers to submit up to 10,000 requests in a single batch, processed asynchronously over up to 24 hours. The trade-off: a 50% cost reduction on every token. For any workload that doesn't require real-time responses — nightly data enrichment, document indexing, report generation, evals — the savings are immediate and require no changes to prompt logic.
The Message Batches API accepts up to 10,000 individual requests submitted as a single batch. Anthropic processes them over up to 24 hours and returns results as they complete. Every token in a batch request — input and output — is charged at 50% of the standard API price.
Each request in the batch is fully independent: different system prompts, different models, different max_tokens values. A single batch submission can mix Haiku requests for simple classification alongside Sonnet requests for complex analysis — all at half price.
Streaming (receiving response tokens as they are generated rather than waiting for the full response) does not change token costs — you pay for the same tokens whether you stream or not. However, streaming affects perceived latency and can affect how you bill downstream users if your application charges per response.
Where streaming matters for cost: it enables you to implement early termination. If your application displays streaming output to a user who quickly navigates away or cancels, you can close the connection and stop generation mid-response, preventing the model from generating (and billing you for) tokens the user will never see. This requires client-side connection management but can meaningfully reduce output token consumption in interactive applications with high abandonment rates.
At production scale, cost surprises are operational incidents. These controls prevent them:
usage.input_tokens, usage.output_tokens, model, and user ID on every API call. Aggregate daily. This data enables anomaly detection and identifies cost-per-user outliers quickly.Not all workloads have the same latency requirements. Classify yours before choosing an API pattern:
| Workload Type | Latency Need | Recommended Pattern | Cost Impact |
|---|---|---|---|
| User-facing chat | Real-time (<3s) | Standard API + streaming | Full price; optimize with caching |
| Background enrichment | Hours acceptable | Message Batches API | 50% discount |
| Nightly report generation | Overnight | Message Batches API | 50% discount |
| Model evaluation / evals | Hours acceptable | Message Batches API | 50% discount |
| Webhook / event processing | Seconds to minutes | Standard API; consider async queue | Full price; optimize with model tier |
The maximum cost reduction stack: use Haiku for appropriate tasks (~95% cheaper than Opus) + prompt caching on system prompts (90% off input tokens) + Message Batches API where latency allows (50% off everything). Applied together on an eligible workload, a task that costs $1,000 on Opus real-time can cost under $3 — a 99.7% reduction.
Work with a production architecture advisor to design a complete cost-optimization strategy. Describe a production application (or use the one provided) and explore: which workloads should use batch vs. real-time, where to apply prompt caching, which model tiers to use, and what monitoring controls to implement.
usage object in the API response includes cache_read_input_tokens (tokens served from cache) and cache_creation_input_tokens (tokens written to cache). Logging these lets you verify hit rates and debug cache misses.usage.cache_read_input_tokens (how many tokens were served from cache) and usage.cache_creation_input_tokens (how many were written to cache this call).