Module 7 · Lesson 1

Understanding Token Economics

Every API call costs money. Learning what drives that cost is the first step toward controlling it.

How does Anthropic price API usage, and where do most costs actually come from?

In January 2024, Anthropic published its pricing update for Claude 3 — the first time the company offered three distinct capability tiers at dramatically different price points. The decision forced developers to confront a question they had often ignored: not "can our app do this?" but "can our app afford to do this at scale?"

The gap between Haiku and Opus pricing — roughly 60× per output token at launch — made model selection a genuine engineering decision, not an afterthought.

The Token Billing Model

The Anthropic API bills on tokens consumed, not time, not requests, not users. A token is approximately four characters of English text. Every call produces a bill with two line items: input tokens (everything sent to the model — system prompt, conversation history, user message, documents) and output tokens (every token the model generates in its response).

Output tokens are priced higher than input tokens on every Claude model — typically 5× more expensive per million tokens. This asymmetry matters enormously when optimizing: a response that generates 2,000 tokens costs roughly the same as sending 10,000 tokens of context.

Model	Input (per M tokens)	Output (per M tokens)	Best For
Claude 3 Haiku	$0.25	$1.25	High-volume classification, routing, simple extraction
Claude 3.5 Sonnet	$3.00	$15.00	Complex reasoning, coding, nuanced analysis
Claude 3 Opus	$15.00	$75.00	Hardest tasks requiring maximum capability

Key Insight

At 1 million calls per day with an average 500-token output, switching from Sonnet to Haiku saves approximately $7,375 per day — $2.7M per year — if Haiku achieves acceptable quality for the task.

What Inflates Input Tokens

Most developers underestimate their input token count because they think only of the user's message. In practice, the system prompt is sent on every single request. A 2,000-token system prompt across 1 million daily calls contributes 2 billion input tokens per day — before any user sends a word.

Long conversation histories compound this. In a multi-turn chat where the full history is replayed each turn, a 10-turn conversation with 200-token average messages sends 2,000 cumulative tokens of history on the final turn alone. By turn 20, that's 4,000 tokens — just from history.

Attached documents and retrieved context from RAG pipelines are common culprits. A single 10-page PDF passed verbatim might add 5,000–8,000 tokens to every request that references it.

Output Token Drivers

Output cost is driven by response length, which is driven by instructions. If a system prompt says "be thorough and comprehensive," the model will be — and you will pay for it. Vague prompts tend to produce verbose responses; specific prompts asking for structured, concise output produce shorter ones.

The max_tokens parameter caps output but does not guarantee short responses. It is a ceiling, not a target. Setting it appropriately for the task prevents runaway generation but doesn't substitute for prompting for concision.

Pricing Documentation

Anthropic publishes current pricing at anthropic.com/api. Prices have decreased substantially with each model generation — Claude 3 Haiku is approximately 30× cheaper per token than the original Claude 1 at launch. Budget projections should use current figures and expect continued reductions.

Key Terms

Input tokensAll tokens sent to the API in a single request: system prompt + conversation history + user message + any attached content.

Output tokensTokens generated by the model in its response. Priced 3–5× higher than input tokens depending on model.

TokenApproximately 4 characters of English text, or ¾ of a word. Non-English languages, code, and special characters vary.

Cost per call(input_tokens × input_price_per_token) + (output_tokens × output_price_per_token).

Quiz — Token Economics

Three questions. Select the best answer for each.

1. Why are output tokens priced higher than input tokens on the Anthropic API?

✓ Correct — Correct. Input tokens can be processed in a single parallelized forward pass (the "prefill" stage), while output tokens must be generated one at a time autoregressively — each new token depends on all previous ones. This sequential constraint makes generation more GPU-intensive per token.

Not quite. The pricing difference reflects computational cost. Input processing is parallelizable; output generation is sequential and therefore more expensive per token to compute.

2. A system prompt is 3,000 tokens long and is sent with every API request. If an application handles 500,000 requests per day, how many input tokens does the system prompt alone contribute daily?

✓ Correct — Correct. 3,000 × 500,000 = 1,500,000,000 — 1.5 billion tokens per day from the system prompt alone. Without prompt caching, this is billed as input tokens on every request. This is why system prompt length and caching are critical at scale.

Revisit the math: 3,000 tokens × 500,000 requests = 1,500,000,000 tokens (1.5 billion). Without caching, the full system prompt is billed as input on every call.

3. Which of the following MOST accurately describes what the max_tokens parameter controls?

✓ Correct — Correct. max_tokens is a ceiling, not a target. The model stops generating when it naturally completes its response OR when it hits max_tokens, whichever comes first. Setting it too low can truncate responses; it doesn't force brevity — proper prompting does.

Not quite. max_tokens only caps output generation — it's a maximum, not an exact count. The model will generate fewer tokens if the response naturally concludes before reaching the limit.

Lab — Token Cost Calculator

Practice estimating and comparing API costs across scenarios.

Your Task

You're working with a cost-estimation assistant. Walk through real cost calculation scenarios — ask it to help you estimate monthly API costs for different application designs, compare model tiers for specific workloads, or explore what happens when you change system prompt length, response length, or request volume.

Try asking: "My app sends 200,000 requests per day. Each request has a 1,500-token system prompt, a 300-token user message, and gets a 400-token response. What's my monthly cost on Claude 3.5 Sonnet versus Haiku?"

Cost Estimation Lab

Module 7 · L1

Hello! I'm your API cost estimation assistant. I can help you calculate token costs, compare model tiers, and model how different architectural choices affect your monthly bill. What scenario would you like to explore?

Module 7 · Lesson 2

Prompt Caching

Anthropic's prompt caching feature can eliminate up to 90% of input token costs for repeated context — if you know how to use it.

How does prompt caching work, and when does it deliver maximum savings?

When Anthropic launched prompt caching in beta in August 2024, the announcement included a reference implementation showing a legal document analysis workflow. A 50-page contract (roughly 25,000 tokens) analyzed with 10 different questions would, without caching, incur 250,000 input tokens. With caching, the document tokens are processed once; only each short question is charged at full rate. Cost reduction: approximately 90% on the document-heavy portion of the bill.

How Prompt Caching Works

Prompt caching allows you to mark specific portions of your prompt with a cache_control parameter set to {"type": "ephemeral"}. When the API receives a request with a cache-marked prefix, it checks whether that exact token sequence is already stored in its cache. If yes, it reuses the stored key-value computation rather than reprocessing those tokens — and charges you at a dramatically reduced rate.

Cached input tokens cost 10% of the normal input price (a 90% discount). The first request that creates the cache entry is charged at 1.25× the normal input price (a one-time 25% premium). Subsequent requests within the cache window recoup this investment in a fraction of the calls.

# Prompt caching example — system prompt with cache_control
{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 1024,
  "system": [
    {
      "type": "text",
      "text": "You are an expert legal analyst... [2000 tokens of instructions]",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Summarize section 4.2"}
  ]
}
    

Cache Duration and Limits

Cache entries have a 5-minute TTL (time-to-live) from the last use. Each use of a cached entry resets the timer, so frequently-used caches can persist indefinitely in active applications. Caches are scoped to your API key and are not shared between organizations.

The minimum cacheable prefix length is 1,024 tokens for Claude 3.5 Sonnet and Claude 3 Opus, and 2,048 tokens for Claude 3 Haiku. Short system prompts below these thresholds cannot be cached — a key reason why consolidating instructions into a single longer prompt can actually reduce costs even if it means a slightly larger uncached prompt.

You can define up to four cache breakpoints per request, enabling multi-section caching — for example, caching a static system prompt separately from a large shared document that changes daily.

What to Cache

The highest-value targets for caching are content that is static or slow-changing and large. System prompts are the obvious first target. Beyond that, consider: reference documents (product manuals, legal agreements, codebases), few-shot examples included in every prompt, and conversation history in multi-turn sessions.

Best Target

System Prompts

Sent on every request. Even a 2,000-token system prompt cached across 100k daily calls saves 198 million input tokens per day at 90% discount.

Up to 90% off input

High Value

Reference Documents

PDFs, codebases, manuals passed to every call. Cache the document content; only the per-question tokens are billed at full rate.

Up to 90% off input

Good Target

Few-Shot Examples

Large banks of worked examples included to improve output quality. Cache the examples block; only the new input changes each call.

Up to 90% off input

Multi-Turn

Conversation History

In long conversations, mark the growing history as cacheable. Each new turn pays full price only for the latest message, not the entire history.

Up to 90% off history

Break-Even Calculation

Cache creation costs 1.25× normal input price. Cache hits cost 0.10× normal. Break-even occurs after just 1.39 cache hits — meaning if more than 2 requests ever use the same cached prefix, you are net-saving money. At any real traffic volume, caching almost always wins.

API Response Fields

When prompt caching is active, the API returns a usage object with additional fields: cache_creation_input_tokens (tokens written to cache this request) and cache_read_input_tokens (tokens read from cache). Logging these fields lets you verify cache hit rates and debug cache misses — for example, if a content change invalidates the cache more often than expected.

Quiz — Prompt Caching

Three questions on caching mechanics and best use cases.

1. A cached prompt prefix costs how much compared to normal input token pricing?

✓ Correct — Correct. Cache hits are billed at 10% of normal input token price — a 90% discount. The first call that writes the cache pays a one-time 1.25× premium, but this is recovered after fewer than two subsequent cache hits.

Not quite. Cache hits cost 10% of normal input pricing (90% discount). Cache creation costs 1.25× normal — a one-time premium that breaks even very quickly.

2. What is the minimum prompt length required to be eligible for caching on Claude 3.5 Sonnet?

✓ Correct — Correct. Claude 3.5 Sonnet and Claude 3 Opus require at least 1,024 tokens to cache a prefix. Claude 3 Haiku requires 2,048 tokens. Prompts shorter than these thresholds are not eligible — an important constraint when designing small system prompts.

The minimum for Claude 3.5 Sonnet is 1,024 tokens. Claude 3 Haiku has a higher minimum of 2,048 tokens. Prompts shorter than the threshold cannot be cached regardless of how cache_control is set.

3. A cache entry expires after 5 minutes of inactivity. Which application pattern would MOST benefit from prompt caching?

✓ Correct — Correct. High-traffic applications with shared, stable context (same system prompt + same docs across many users) get maximum benefit from caching. The cache is hit constantly, the 5-minute TTL is continuously refreshed, and the savings compound with every request.

Caching benefits most when the same prefix is used frequently within a 5-minute window. A high-traffic app with shared system prompts hits the cache constantly. Single-run scripts, nightly batches with unique content, or highly personalized prompts each create the cache but rarely get enough hits to maximize savings.

Lab — Prompt Caching Strategy

Design caching strategies for real application architectures.

Your Task

Work with a caching strategy assistant to design the optimal caching approach for different application types. Describe an application you're building (or a hypothetical one), and explore which parts of the prompt should be cached, where to place cache breakpoints, and how to estimate savings.

Try asking: "I'm building a RAG-powered legal research tool. Each query searches a corpus of 200 case documents, then sends the top 3 relevant ones (about 15,000 tokens total) plus a 3,000-token system prompt to Claude. How should I structure caching for this?"

Caching Strategy Lab

Module 7 · L2

Hello! I'm your prompt caching strategy advisor. Describe your application and I'll help you design a caching architecture — identifying which content to cache, where to place breakpoints, and how to estimate your savings. What are you building?

Module 7 · Lesson 3

Model Selection & Prompt Optimization

Choosing the right model and writing leaner prompts are the two highest-leverage decisions in API cost management.

How do you match model capability to task requirements — and how do you write prompts that don't waste tokens?

In 2024, multiple engineering teams publicly documented their migration from GPT-4-class models to lighter-weight alternatives for classification and routing tasks. Anthropic's own documentation describes a common pattern: using Claude 3 Haiku as a "triage" model to classify incoming requests, then routing only complex cases to Sonnet or Opus. Teams reported cost reductions of 60–80% on the triage tier while maintaining end-user quality perception, because most requests were simple enough that Haiku handled them perfectly.

The Model-Task Fit Framework

The core principle: use the cheapest model that reliably succeeds at the task. This sounds obvious but requires systematic evaluation. The instinct to "just use the best model" is expensive at scale and often unnecessary.

Model selection should be driven by empirical testing, not intuition. Build a representative test set of 50–200 inputs with known correct outputs. Run the same prompt on Haiku, Sonnet, and Opus. Measure accuracy or quality against your acceptance threshold. If Haiku achieves 95% and your bar is 90%, you've found your model.

Task Type	Recommended Start	Notes
Binary classification, sentiment	Haiku	Simple pattern matching; Haiku excels here at a fraction of Sonnet cost
Entity extraction, tagging	Haiku → Sonnet if needed	Test Haiku first; escalate only for ambiguous domains
Summarization (standard)	Haiku or Sonnet	Haiku works for news/simple docs; Sonnet for technical/nuanced material
Code generation, debugging	Sonnet	Haiku struggles with complex logic; Sonnet is cost-effective here
Complex multi-step reasoning	Sonnet → Opus if needed	Opus for genuinely hard reasoning; Sonnet handles most cases
Creative writing, nuanced tone	Sonnet or Opus	Task-dependent; evaluate subjectively with human raters

Prompt Optimization for Token Efficiency

Beyond model selection, the text of your prompts directly controls token consumption. Verbose, redundant prompts waste money on every call. The principles of token-efficient prompting:

Technique 1

Eliminate Preamble

Remove phrases like "You are an incredibly helpful, thoughtful, and intelligent AI assistant..." These consume tokens without improving output. State your constraints directly.

50–200 tokens saved

Technique 2

Specify Output Format

Ask for JSON, bullet points, or single-sentence answers rather than open-ended prose. Structured outputs are typically shorter and more parseable.

30–60% output reduction

Technique 3

Remove Redundancy

Audit system prompts for repeated instructions. "Do not hallucinate" stated four times costs 4× the tokens of stating it once. Consolidate.

Variable

Technique 4

Truncate Context

In RAG pipelines, don't pass entire documents when a relevant excerpt suffices. Semantic chunking + top-k retrieval can reduce context by 80% with minimal quality impact.

Up to 80% input reduction

Technique 5

Shorten Conversation History

In multi-turn chat, summarize older turns rather than replaying the full raw history. A 100-token summary of 10 old turns saves 1,500+ tokens per subsequent request.

Large savings at scale

Technique 6

Use stop_sequences

Define stop sequences to halt generation exactly when needed. For structured tasks, stop at the closing bracket rather than letting the model add a concluding paragraph.

Prevents overrun

Tiered Routing Architecture

A powerful pattern is building a classifier-router that uses a small, cheap model to categorize each incoming request, then routes it to the appropriate model tier. For example: Haiku classifies "Is this a simple FAQ, a complex reasoning task, or an edge case?" Simple FAQs go to a pre-written template or Haiku itself. Complex reasoning goes to Sonnet. Edge cases go to Opus.

The classification call itself costs almost nothing (a few input tokens + a single output token). The routing decision it enables can save orders of magnitude in downstream cost.

# Tiered routing — pseudo-code
def route_request(user_message):
    classification = haiku.call(
        system="Classify as: SIMPLE, COMPLEX, or EDGE_CASE. One word only.",
        message=user_message,
        max_tokens=5
    )
    if classification == "SIMPLE":
        return haiku.call(system=system_prompt, message=user_message)
    elif classification == "COMPLEX":
        return sonnet.call(system=system_prompt, message=user_message)
    else:
        return opus.call(system=system_prompt, message=user_message)
    

Measurement First

Never optimize blindly. Use the usage field in every API response to log actual token counts per call. Aggregate by endpoint, user type, and time of day. Real usage data almost always reveals that 20% of call types consume 80% of tokens — making optimization efforts extremely targeted and high-ROI.

Quiz — Model Selection & Prompt Optimization

Three questions on choosing models and trimming token waste.

1. A team is building a system that must classify customer support tickets into 12 categories. The task requires no reasoning, just pattern recognition. Which approach is most cost-effective?

✓ Correct — Correct. Simple pattern-recognition classification is exactly the task where Haiku excels. Starting with the cheapest model and escalating only if quality tests fail is the correct methodology. Defaulting to Sonnet without testing is unnecessarily expensive; Opus for a 12-category classifier is almost certainly overkill.

For simple classification, start with the cheapest capable model and test empirically. Haiku handles pattern-based categorization extremely well. Escalating to Sonnet or Opus without testing first means paying 12–300× more without evidence of need.

2. Which prompt optimization technique has the highest impact on OUTPUT token costs specifically?

✓ Correct — Correct. Output format specification directly controls response verbosity. When you ask for JSON with specific fields, or a bullet list with a defined maximum, or a single-sentence answer, the model produces exactly that — often 30–60% fewer output tokens than open-ended prose that invites elaboration. Output tokens are the most expensive per-token, so this is the highest-leverage output-side optimization.

The other options reduce input tokens. Output token costs are best controlled by specifying structured, constrained output formats. Open-ended prose prompts invite the model to elaborate; specific format instructions produce leaner, more predictable output lengths.

3. In a tiered routing architecture, what does the initial "router" call typically use as its model, and why?

✓ Correct — Correct. The router call is made on 100% of requests, so model cost multiplies by total volume. Classification into a few categories is well within Haiku's capability. Using a cheap model for routing means the overhead of the routing step is negligible compared to the savings from sending only complex requests to expensive models.

The router runs on every request, so its cost is critical. Classification is a simple task suited for Haiku. Using Opus or even Sonnet for routing would add significant cost on every call — defeating the purpose of the tiered architecture.

Lab — Prompt Audit Workshop

Submit a prompt for token-efficiency analysis and optimization recommendations.

Your Task

Submit any system prompt or complete prompt structure to the prompt auditor. It will identify token waste, suggest more concise alternatives, recommend output format changes, and estimate savings. You can also ask it to help you design a tiered routing system for a specific use case.

Try asking: "Here's my current system prompt: 'You are an incredibly helpful, knowledgeable, and always available AI assistant that is eager to help users with any questions they might have. You should always be polite, professional, and thorough in your responses, making sure to address all aspects of the user's question in detail...' How can I trim this down?"

Prompt Audit Lab

Module 7 · L3

Welcome to the prompt audit workshop! Paste any system prompt, user prompt, or full conversation template and I'll analyze it for token waste, verbosity, redundancy, and missed optimization opportunities. I can also help design tiered routing systems. What would you like to optimize?

Module 7 · Lesson 4

Batching, Streaming & Production Cost Controls

Message Batches API, streaming configuration, and the operational controls that prevent budget overruns in production.

What tools does Anthropic provide for bulk processing and cost governance at production scale?

In November 2024, Anthropic released the Message Batches API — a feature that allows developers to submit up to 10,000 requests in a single batch, processed asynchronously over up to 24 hours. The trade-off: a 50% cost reduction on every token. For any workload that doesn't require real-time responses — nightly data enrichment, document indexing, report generation, evals — the savings are immediate and require no changes to prompt logic.

Message Batches API

The Message Batches API accepts up to 10,000 individual requests submitted as a single batch. Anthropic processes them over up to 24 hours and returns results as they complete. Every token in a batch request — input and output — is charged at 50% of the standard API price.

Each request in the batch is fully independent: different system prompts, different models, different max_tokens values. A single batch submission can mix Haiku requests for simple classification alongside Sonnet requests for complex analysis — all at half price.

# Submitting a message batch
import anthropic

client = anthropic.Anthropic()
batch = client.beta.messages.batches.create(
    requests=[
        {
            "custom_id": "req-001",
            "params": {
                "model": "claude-3-haiku-20240307",
                "max_tokens": 100,
                "messages": [{"role": "user", "content": "Classify sentiment: 'Great product!'"}]
            }
        },
        # ... up to 9,999 more requests
    ]
)
# Poll for completion
while batch.processing_status == "in_progress":
    batch = client.beta.messages.batches.retrieve(batch.id)
    

Streaming and Cost

Streaming (receiving response tokens as they are generated rather than waiting for the full response) does not change token costs — you pay for the same tokens whether you stream or not. However, streaming affects perceived latency and can affect how you bill downstream users if your application charges per response.

Where streaming matters for cost: it enables you to implement early termination. If your application displays streaming output to a user who quickly navigates away or cancels, you can close the connection and stop generation mid-response, preventing the model from generating (and billing you for) tokens the user will never see. This requires client-side connection management but can meaningfully reduce output token consumption in interactive applications with high abandonment rates.

Production Cost Controls

At production scale, cost surprises are operational incidents. These controls prevent them:

Control 1

Spending Limits

Anthropic's Console allows setting monthly spending limits with email alerts at custom thresholds (e.g., alert at 80%, hard stop at 100%). Configure these before deploying any production workload.

Control 2

Per-User Rate Limiting

Implement rate limits in your application layer: max requests per user per minute/hour/day. A single abusive user or malfunctioning bot loop can exhaust monthly budgets in hours without rate limiting.

Control 3

Token Logging

Log usage.input_tokens, usage.output_tokens, model, and user ID on every API call. Aggregate daily. This data enables anomaly detection and identifies cost-per-user outliers quickly.

Control 4

Input Validation

Validate and truncate user-supplied content before it reaches the API. A user who pastes a 500-page document into a chat field should be shown an error — not silently billed for 300,000 input tokens.

Control 5

Caching at App Layer

Cache API responses for identical or near-identical inputs at the application layer (Redis, memcached). If 30% of your queries are repeated FAQ lookups, serving them from cache costs zero API tokens.

Control 6

Cost Alerts on Anomalies

Track rolling hourly token spend. If spend exceeds 3× the hourly average, trigger an alert. Bugs like infinite retry loops or recursive prompting can be caught within minutes rather than at month-end billing.

Workload Classification for Cost

Not all workloads have the same latency requirements. Classify yours before choosing an API pattern:

Workload Type	Latency Need	Recommended Pattern	Cost Impact
User-facing chat	Real-time (<3s)	Standard API + streaming	Full price; optimize with caching
Background enrichment	Hours acceptable	Message Batches API	50% discount
Nightly report generation	Overnight	Message Batches API	50% discount
Model evaluation / evals	Hours acceptable	Message Batches API	50% discount
Webhook / event processing	Seconds to minutes	Standard API; consider async queue	Full price; optimize with model tier

Combined Strategy

The maximum cost reduction stack: use Haiku for appropriate tasks (~95% cheaper than Opus) + prompt caching on system prompts (90% off input tokens) + Message Batches API where latency allows (50% off everything). Applied together on an eligible workload, a task that costs $1,000 on Opus real-time can cost under $3 — a 99.7% reduction.

Quiz — Batching, Streaming & Production Controls

Three questions on production-scale cost management tools.

1. What pricing discount does the Message Batches API provide compared to the standard synchronous API?

✓ Correct — Correct. The Message Batches API charges 50% of standard pricing on all tokens — both input and output. The trade-off is asynchronous processing (up to 24 hours). For any non-real-time workload, this is one of the most straightforward cost reductions available.

The Message Batches API offers a 50% discount on all tokens. It's asynchronous (up to 24 hours), but for background processing, nightly jobs, or evaluations, this is a direct halving of API costs with no prompt engineering required.

2. Does enabling streaming (Server-Sent Events) change the number of tokens billed for a response?

✓ Correct — Correct. Streaming does not change token billing. You pay for all generated tokens regardless of delivery method. The only cost-related advantage of streaming is the ability to terminate the connection early — stopping generation before completion — which prevents billing for unseen tokens in high-abandonment interactive apps.

Streaming itself doesn't change token costs — you pay for generated tokens the same way either way. However, streaming enables early termination (closing the connection mid-generation), which prevents the model from completing (and billing for) tokens the user never sees.

3. An engineering team discovers their API spend tripled overnight without any new feature releases. Which production control would MOST likely have caught this earliest?

✓ Correct — Correct. A monthly cap would catch the problem only at billing time (potentially weeks away). An hourly anomaly alert would fire within the first hour — likely within minutes of the bug appearing — enabling rapid response. Real-time anomaly detection on rolling spend is the control that catches sudden-onset problems before they compound.

Monthly limits would only catch this at the end of the month. Prompt caching and input validation address different problems. A rolling hourly anomaly alert — triggering when spend suddenly spikes above a multiple of the baseline — is the control designed to catch unexpected cost surges within minutes of occurrence.

Lab — Production Cost Architecture

Design a complete cost optimization stack for a production API deployment.

Your Task

Work with a production architecture advisor to design a complete cost-optimization strategy. Describe a production application (or use the one provided) and explore: which workloads should use batch vs. real-time, where to apply prompt caching, which model tiers to use, and what monitoring controls to implement.

Try asking: "I'm running a SaaS platform with two workloads: (1) a real-time customer chat that handles 50,000 requests/day with a 2,000-token system prompt, and (2) a nightly pipeline that classifies 200,000 customer feedback entries. What's the optimal cost architecture?"

Production Architecture Lab

Module 7 · L4

Hello! I'm your production cost architecture advisor. Describe your application's workloads — request volumes, latency requirements, prompt structures, and current setup — and I'll design a complete cost-optimization stack covering model selection, caching strategy, batch vs. real-time routing, and production monitoring controls. What are you working with?

Module 7 — Cost Optimization

15 questions · Score 80% or above to pass · All lessons covered

1. The Anthropic API bills for which of the following?

✓ Correct — Correct. Billing is per input token + per output token, at separate rates.

The API bills per token — input tokens at one rate, output tokens at a higher rate. Neither time nor request count drives the bill.

2. Why are output tokens priced higher than input tokens on all Claude models?

✓ Correct — Correct. Autoregressive generation requires a sequential forward pass per token; input processing is parallelized — making generation computationally more expensive.

The difference is computational: input can be processed in parallel, while each output token requires a sequential forward pass through the model.

3. A system prompt is 4,000 tokens. An application makes 1,000,000 API calls per day without caching. How many input tokens does the system prompt contribute daily?

✓ Correct — Correct. 4,000 × 1,000,000 = 4,000,000,000 (4 billion) tokens per day from the system prompt alone.

Without caching, the full system prompt is sent and billed on every call: 4,000 × 1,000,000 = 4 billion tokens per day.

4. What discount do cache HIT tokens receive compared to standard input token pricing?

✓ Correct — Correct. Cache hits are billed at 10% of the normal input token price — a 90% discount. Cache creation costs 1.25× normal, but breaks even after ~1.4 subsequent hits.

Cache hits are 10% of normal input pricing (90% off). The first call that creates the cache pays 1.25× as a one-time creation fee.

5. What is the minimum cacheable prefix length for Claude 3.5 Sonnet?

✓ Correct — Correct. Claude 3.5 Sonnet and Claude 3 Opus require at least 1,024 tokens to cache. Claude 3 Haiku requires 2,048 tokens.

The minimum is 1,024 tokens for Sonnet and Opus; 2,048 tokens for Haiku. Prompts shorter than this threshold cannot be cached.

6. How long does a prompt cache entry remain valid without being accessed?

✓ Correct — Correct. Cache entries expire after 5 minutes of inactivity. Each cache hit resets the 5-minute timer, so active applications effectively maintain persistent caches.

Cache entries expire after 5 minutes of inactivity. Frequent access keeps the cache alive indefinitely — each use resets the 5-minute countdown.

7. Which content type provides the MOST cost benefit from prompt caching?

✓ Correct — Correct. Caching benefits maximize when the cached content is large (more tokens saved per hit) and reused frequently (more hits to amortize creation cost). Static system prompts and reference documents are ideal.

Caching delivers maximum ROI on large, stable content used in many requests — system prompts, product documentation, reference manuals. Personalized or dynamic content changes too often to cache effectively.

8. What is the PRIMARY method to reduce output token costs?

✓ Correct — Correct. Specifying structured, concise output formats directly controls response length. Open-ended prompts invite elaboration; precise format instructions produce leaner, more predictable outputs — often 30–60% shorter.

Low max_tokens can truncate responses mid-sentence without reducing verbosity in shorter outputs. Prompt caching doesn't affect output tokens. Specifying structured output formats is the primary lever for output token reduction.

9. A tiered routing architecture uses a cheap model to classify requests before routing them. What is the PRIMARY purpose of this pattern?

✓ Correct — Correct. Tiered routing matches task complexity to model capability. Simple tasks go to cheap models; only genuinely complex requests reach expensive models. The classification call costs nearly nothing, but the routing decision saves significant money at scale.

Tiered routing is a cost optimization pattern. It ensures expensive models (Sonnet, Opus) only handle tasks that actually require their capability, while cheaper models (Haiku) handle the majority of simpler requests.

10. How does the Message Batches API differ from the standard synchronous API?

✓ Correct — Correct. The Batches API is asynchronous — you submit up to 10,000 requests and poll for results, which arrive within 24 hours. In exchange, all tokens (input and output) are billed at 50% of standard pricing.

The Batches API is asynchronous (results in up to 24 hours) and costs 50% of standard pricing. It's not faster — it's cheaper in exchange for latency tolerance.

11. Does enabling streaming change the number of tokens billed?

✓ Correct — Correct. Streaming doesn't change billing mechanics. However, closing the stream connection early stops generation — and you're only billed for tokens generated up to that point. This can reduce costs in applications where users frequently abandon responses.

Streaming mode doesn't affect token billing rates. The only cost benefit is the ability to terminate early — stopping generation mid-stream means the model stops producing (and billing for) tokens at that point.

12. Which production control most effectively prevents a billing surprise caused by a runaway retry loop in application code?

✓ Correct — Correct. A runaway loop can exhaust budgets in hours or minutes. Monthly caps only trigger at billing time. Real-time anomaly alerts on rolling hourly spend detect sudden spikes within minutes — enabling rapid response before significant damage is done.

A monthly cap won't help until the end of the month. Input validation and caching address different problems. Real-time hourly anomaly detection catches sudden spike patterns — like a retry loop — almost immediately.

13. In a RAG pipeline, which optimization reduces input tokens most dramatically?

✓ Correct — Correct. Passing entire documents can add tens of thousands of tokens per request. Retrieving only the top-k relevant chunks via semantic search reduces context by 80–95% with minimal quality loss — the single largest input token reduction available in RAG architectures.

In RAG, the retrieved document context is typically the largest token cost. Passing full documents instead of relevant excerpts can waste 90%+ of context tokens. Semantic chunking + top-k retrieval is the primary optimization.

14. What field in the API response confirms that prompt caching is working correctly?

✓ Correct — Correct. The usage object in the API response includes cache_read_input_tokens (tokens served from cache) and cache_creation_input_tokens (tokens written to cache). Logging these lets you verify hit rates and debug cache misses.

The correct fields are usage.cache_read_input_tokens (how many tokens were served from cache) and usage.cache_creation_input_tokens (how many were written to cache this call).

15. An application currently uses Claude 3 Opus for all requests, has a 3,000-token system prompt sent on every call, and processes 500,000 requests/day. They can switch most requests to Haiku and add prompt caching. Approximately what combined cost reduction is achievable?

✓ Correct — Correct. The compounding effect is dramatic: moving from Opus to Haiku reduces per-token cost by ~95%. Applying prompt caching then eliminates 90% of the system prompt input cost (which is significant at 3,000 tokens × 500,000 calls). The combined reduction on an eligible workload can exceed 97–99%.

The combination is extremely powerful: Haiku costs ~95% less than Opus per token. Prompt caching then removes 90% of the already-reduced system prompt cost. Together, these optimizations on an appropriate workload can achieve 95–99% cost reduction.