Working with the Anthropic API

1. Anthropic's documentation advises NOT combining which two sampling parameters simultaneously?

Correct. temperature and top_p should not be used together — their interaction produces unpredictable sampling behavior.

Anthropic advises against combining temperature and top_p — their interaction is unpredictable. Use one or the other.

2. Which of these workloads is the best candidate for Anthropic's Message Batches API?

Correct. Nightly batch processing with no realtime requirement is ideal for the Batches API: separate quota pool, 50% discount, 24-hour completion window fits perfectly.

The Batches API is for non-realtime workloads with flexible SLAs (up to 24 hours). Real-time, streaming, and interactive use cases require the standard synchronous API.

3. How long does a prompt cache entry remain valid without being accessed?

✓ Correct — Correct. Cache entries expire after 5 minutes of inactivity. Each cache hit resets the 5-minute timer, so active applications effectively maintain persistent caches.

Cache entries expire after 5 minutes of inactivity. Frequent access keeps the cache alive indefinitely — each use resets the 5-minute countdown.

4. What happens when a streaming response from the Anthropic API fails mid-stream?

Correct. There is no mid-stream resume mechanism. A failure requires full restart — production code should retry the entire request and surface an error only after retries are exhausted.

Streaming is stateless. A mid-stream connection failure requires restarting from request zero. Production systems typically retry the full request and show an error to users only after all retries fail.

5. A system prompt is 4,000 tokens. An application makes 1,000,000 API calls per day without caching. How many input tokens does the system prompt contribute daily?

✓ Correct — Correct. 4,000 × 1,000,000 = 4,000,000,000 (4 billion) tokens per day from the system prompt alone.

Without caching, the full system prompt is sent and billed on every call: 4,000 × 1,000,000 = 4 billion tokens per day.

6. When Claude outputs a tool_use content block, what must happen next before Claude can produce a final answer?

Correct. Your code runs the function, then sends back the assistant's tool_use turn plus a user tool_result message.

Claude cannot self-execute tools. Your code must execute the function and return the result in the proper message format.

7. What discount do cache HIT tokens receive compared to standard input token pricing?

✓ Correct — Correct. Cache hits are billed at 10% of the normal input token price — a 90% discount. Cache creation costs 1.25× normal, but breaks even after ~1.4 subsequent hits.

Cache hits are 10% of normal input pricing (90% off). The first call that creates the cache pays 1.25× as a one-time creation fee.

8. Why should max_tokens be set conservatively rather than at the model maximum?

Correct. max_tokens caps output length, which bounds OTPM consumption per request. Tighter caps mean more requests can fit within the same per-minute token budget.

The rate management benefit: lower max_tokens limits OTPM per request, leaving more token budget for concurrent or subsequent requests. This directly increases throughput without changing your limit tier.

9. Which content type provides the MOST cost benefit from prompt caching?

✓ Correct — Correct. Caching benefits maximize when the cached content is large (more tokens saved per hit) and reused frequently (more hits to amortize creation cost). Static system prompts and reference documents are ideal.

Caching delivers maximum ROI on large, stable content used in many requests — system prompts, product documentation, reference manuals. Personalized or dynamic content changes too often to cache effectively.

10. Why do output tokens cost more per million than input tokens in Anthropic's pricing?

Correct. Each output token requires a full forward pass, while input tokens are processed in a single parallel attention computation.

The compute asymmetry: input processing is one parallel attention pass over all tokens; output generation is N sequential forward passes, one per token — hence higher cost.

11. What happens when you send a GIF file to the Anthropic API?

Correct. GIF animation is not processed — only the first frame is used. Plan accordingly when working with animated content.

For GIFs, only the first frame is processed. Animation frames are ignored. The API accepts the format but treats it as a static image.

12. Which HTTP status code does Anthropic use for infrastructure saturation (distinct from rate limiting)?

Correct. HTTP 529 (overloaded_error) is Anthropic's custom status for capacity saturation. It's retryable with backoff.

Anthropic uses the non-standard HTTP 529 for infrastructure saturation (overloaded_error). 429 is for rate limits, 500 for server errors.

13. Which field in the API response gives the actual output token count (not the cap)?

Correct. usage.output_tokens is the precise completion token count. It's always ≤ max_tokens and is the value to use for accurate quota accounting.

usage.output_tokens is the direct field. While usage.total_tokens - usage.input_tokens would give the same number mathematically, usage.output_tokens is the explicit and correct field to read.

14. What does stop_reason: "tool_use" tell you?

Correct. "tool_use" stop reason signals that Claude has produced tool_use content blocks and is waiting for results before it can continue.

"tool_use" means Claude is pausing to request tool execution. "end_turn" is the stop reason when Claude has produced a final answer.

15. Which of the following best describes the Anthropic API's memory behavior between separate API calls?

Correct. The Anthropic API is stateless — each API call is independent, and the developer must include all relevant history in the messages array.

Incorrect. The API is entirely stateless. There is no server-side session, cache, or timeout-based memory. All history must be passed explicitly.

16. When is streaming generally NOT the right choice?

Correct. When you need the complete response before doing anything with it, streaming adds complexity without UX benefit.

Batch pipelines needing the full response before processing are better served by blocking calls.

17. A 429 response includes "retry-after: 15". Your computed full-jitter backoff for this attempt is 6 seconds. What should your actual wait be?

Correct. retry-after is a hard minimum floor. Your 6-second backoff is irrelevant when it's below the server's required wait. Wait at least 15 seconds; you may add jitter on top.

The server's retry-after value is a minimum — you must wait at least that long. Your computed backoff only matters if it exceeds the server minimum. Here 15 > 6, so 15 seconds is the floor.

18. What is the PRIMARY method to reduce output token costs?

✓ Correct — Correct. Specifying structured, concise output formats directly controls response length. Open-ended prompts invite elaboration; precise format instructions produce leaner, more predictable outputs — often 30–60% shorter.

Low max_tokens can truncate responses mid-sentence without reducing verbosity in shorter outputs. Prompt caching doesn't affect output tokens. Specifying structured output formats is the primary lever for output token reduction.

19. An 80-page contract needs to be processed. Given the 20-image-per-request limit, what is the minimum number of API calls required?

Correct. 80 pages ÷ 20 images/call = 4 calls exactly. With overlapping windows, you would need slightly more.

80 ÷ 20 = 4 calls. With overlapping windows for context continuity, you'd need slightly more, but the minimum is 4.

20. In a streaming tool use response, what delta type carries partial tool input JSON?

Correct. content_block_delta events during tool use carry delta.type = "input_json_delta" with partial JSON strings.

Tool use deltas use delta.type = "input_json_delta" to carry incremental JSON fragments.

Final Exam