L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 4 · Lesson 1

How Claude Sees Images

Vision capabilities, supported formats, and the mechanics of sending image data through the API
What exactly happens when you pass an image to Claude — and what constraints govern what it can and cannot perceive?

In March 2024, Anthropic released Claude 3 with native vision capabilities — the first Claude models able to process images alongside text. Shortly after launch, developers building document-processing pipelines discovered that the models could read handwritten lab reports, parse complex financial tables, and describe architectural drawings with accuracy that surprised even internal researchers. The capability was not bolted on — it was trained into the model weights from scratch, meaning vision and language share the same representational space.

Supported Image Formats

The Anthropic API accepts four image formats: JPEG, PNG, GIF, and WebP. GIF support covers only the first frame — the API does not process animation sequences. Each image may be up to 5 MB in size before base64 encoding. After encoding, the payload grows by approximately one-third, so a 3.75 MB raw PNG becomes roughly 5 MB of base64 text.

Images can be delivered in two ways: as a base64-encoded data string embedded directly in the request body, or as a public URL that Claude fetches at inference time. URL-based delivery is convenient for large images already hosted on the web, but base64 is preferable when you need deterministic, latency-controlled requests without external dependencies.

Formats
JPEG · PNG · GIF (frame 1) · WebP
Max Size
5 MB per image (pre-encoding)
Max Images / Request
Up to 20 images per API call
Delivery Modes
Base64 data URI · Public HTTPS URL
The Message Structure

Vision requests use the same messages array as text requests, but individual message content becomes an array of content blocks rather than a plain string. Each block has a type field — either "text" or "image" — and blocks can be freely interleaved. This means you can ask Claude to compare two images by inserting text between them, or to annotate a chart by placing your question after it.

# Base64 image in a message import anthropic, base64, pathlib client = anthropic.Anthropic() image_data = base64.standard_b64encode( pathlib.Path("chart.png").read_bytes() ).decode("utf-8") message = client.messages.create( model="claude-opus-4-5", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": image_data, }, }, { "type": "text", "text": "Describe the trends shown in this chart." } ], }] ) print(message.content[0].text)
Token Cost of Images

Images consume tokens from your context window, and the cost scales with image dimensions. Anthropic uses a tiling system: images are divided into 512×512 pixel tiles, and each tile costs approximately 1,600 tokens plus a base cost of 85 tokens. A standard 1,000×1,000 PNG consumes roughly 6,900 tokens — more than a full page of text. This matters for cost estimation and for multi-image workflows where context limits become a concern.

The practical implication: resize images to the smallest dimensions that preserve the information you need before sending them. For a receipt OCR task, a 600-pixel-wide scan is often as effective as a 4,000-pixel photograph and costs a fraction of the tokens.

Token Calculation Formula

tokens = 85 + (1600 × ceil(width/512) × ceil(height/512)). A 1024×1024 image uses 85 + (1600 × 2 × 2) = 6,485 tokens. Always resize to task-appropriate dimensions to control cost.

What Claude Can and Cannot Perceive

Claude's vision performs well on printed and handwritten text (OCR-level accuracy in most Latin scripts), charts and graphs, diagrams and technical drawings, photographs of real-world scenes, and screenshots of UIs and documents. It can reason spatially about object positions, compare multiple images for differences, and interpret visual metaphors in infographics.

Current limitations include reduced accuracy on very small text, heavy watermarks obscuring content, extremely low-contrast images, and non-Latin scripts with complex diacritics. The model also does not identify real people by face — it will describe physical appearance but will not name private individuals from photographs. This is a deliberate policy constraint, not a technical limitation.

Privacy Constraint

Claude will not identify or name private individuals from photographs, even when presented with high-quality images. This applies to faces in crowd scenes, employee photos, and similar contexts. The restriction is enforced at the model level and cannot be overridden via prompt.

Lesson 1 Quiz

How Claude Sees Images — check your understanding
Which image formats does the Anthropic API accept? (Choose the complete list.)
Correct. JPEG, PNG, GIF (first frame only), and WebP are the four supported formats. SVG, TIFF, and BMP are not accepted.
Not quite. The supported formats are JPEG, PNG, GIF (first frame only), and WebP — not SVG, TIFF, or BMP.
What is the maximum size of a single image before base64 encoding?
Correct. The limit is 5 MB per image before base64 encoding. Base64 adds roughly one-third overhead, so the encoded payload can reach about 6.7 MB.
The correct limit is 5 MB per image before encoding. After base64 encoding it grows by about one-third.
Why does resizing images before sending them matter for API usage?
Correct. Token cost uses a tiling formula (85 + 1600 × tiles). Reducing dimensions reduces tile count, directly lowering both cost and context usage.
The key reason is token cost. Image tokens scale with dimensions via a tiling formula, so unnecessarily large images waste tokens without improving task accuracy.
Can Claude identify named private individuals from photographs?
Correct. Claude will not identify private individuals by face regardless of image quality or prompt instructions. This is a policy constraint baked into the model.
No. Claude does not identify private individuals from photos under any prompt instruction. The restriction is a model-level policy, not a capability gap.

Lab 1 — Image Input Mechanics

Practice structuring vision requests and understanding format constraints

Your Mission

You are building a document-processing pipeline that needs to send images to Claude. Discuss image format selection, base64 vs URL delivery, token cost estimation, and how to structure multi-image requests.

Try asking: "I have a 2400×3000 PNG receipt scan — how many tokens will it use?" or "When should I use URL delivery instead of base64?" or "How do I send two images in one message and ask Claude to compare them?"
Vision API Assistant
Lab 1
Hello! I'm your Vision API lab assistant. Ask me anything about image formats, delivery methods, token costs, or how to structure vision requests in the Anthropic API. What would you like to explore?
Module 4 · Lesson 2

Document Understanding at Scale

PDF processing, multi-page documents, and extracting structured data from complex layouts
How do you handle real-world documents — invoices, contracts, research papers — that span multiple pages and mix text with tables and charts?

In late 2024, Harvey AI — a legal-tech company — integrated Claude to process multi-hundred-page court filings and merger agreements. Their engineers reported that Claude could accurately extract defined terms from dense contract definitions sections, identify cross-references between clauses, and summarize indemnification provisions with accuracy comparable to junior associates reviewing documents for the first time. The key engineering challenge was not capability but throughput: splitting documents into overlapping chunks that preserved context across page boundaries.

The PDF Problem

The Anthropic API does not accept PDF files directly. PDFs must be converted to images before being sent — one image per page. This is not a limitation unique to Claude; most vision models operate this way. The conversion process is a standard part of document-processing pipelines, typically handled by libraries such as pdf2image (Python), pdftoppm (command line), or cloud document AI services that return page images.

When converting PDFs to images, 150–200 DPI is usually sufficient for text legibility. Higher DPI produces larger images that cost more tokens without improving comprehension for standard documents. For documents with very small print — footnotes, fine print in contracts — 300 DPI may be warranted.

Note — As of early 2025

Anthropic's API documentation notes that direct PDF support may be introduced. Always check the current docs at docs.anthropic.com for the latest supported input types. This module reflects the state of the API as trained.

Chunking Multi-Page Documents

A single API request supports up to 20 images. A 100-page contract therefore requires at least five requests. The challenge is that information often spans page boundaries — a table may start on page 14 and end on page 15, or a defined term on page 3 may be referenced on page 47. Two strategies handle this:

  • Overlapping windows: Include the last page of the previous chunk as the first page of the next chunk. This ensures Claude sees context across boundaries. A 20-page window with 1-page overlap requires roughly 5% more requests but preserves continuity.
  • Semantic pre-splitting: Use a lighter text extraction pass (e.g., pdfplumber) to identify natural boundaries like section headings, then split on those boundaries rather than fixed page counts. This often produces cleaner results for structured documents.
Extracting Structured Data

For tables, invoices, and forms, prompt Claude to return structured output rather than narrative prose. Asking for JSON or CSV output directly from a visual scan is reliable when the layout is consistent. The pattern below is used in production document-processing pipelines at companies including Vanta (compliance automation) and Thoughtful AI (revenue cycle management).

# Invoice extraction — structured output system = """You are a document extraction AI. When given invoice images, extract data as JSON. Return ONLY valid JSON, no prose.""" user_content = [ {"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": invoice_b64 }}, {"type": "text", "text": """Extract: { "vendor": "", "invoice_number": "", "date": "", "line_items": [{"description":"","qty":0,"unit_price":0}], "subtotal": 0, "tax": 0, "total": 0 }"""} ] response = client.messages.create( model="claude-opus-4-5", max_tokens=2048, system=system, messages=[{"role": "user", "content": user_content}] )
Handling Tables and Mixed Layouts

Complex table layouts — merged cells, multi-level headers, rotated text — can confuse extraction. When accuracy is critical, use a two-pass approach: first ask Claude to describe the table structure ("How many columns are there? What are the headers?"), then ask it to extract the data. The structural awareness gained in the first pass reduces errors in the second.

For financial documents with regulatory importance, always implement a validation step: parse the extracted JSON, recompute totals, and flag discrepancies for human review. Claude is highly accurate but not infallible on complex layouts, and production pipelines should treat its output as a high-quality first draft requiring light verification.

Production Pattern

For high-stakes extraction (legal, financial, medical), implement a confidence-flagging step: ask Claude to rate its own certainty for each field on a 1–3 scale. Fields rated 1 or 2 are routed to human review. This hybrid approach typically achieves 95%+ accuracy with less than 10% human review burden.

Lesson 2 Quiz

Document Understanding at Scale — check your understanding
How must PDF documents be prepared before sending to the Anthropic API?
Correct. The API does not accept PDF files directly. Each page must be rendered as an image (JPEG or PNG) and sent as an image content block.
The API does not accept PDFs directly. Each page must be converted to an image (JPEG or PNG) and sent as an image content block.
What is the maximum number of images per API request?
Correct. A single API request supports up to 20 images. Longer documents require multiple sequential requests.
The limit is 20 images per request. Documents longer than 20 pages must be processed in multiple requests.
What DPI is generally sufficient for converting standard PDF documents to images?
Correct. 150–200 DPI is sufficient for standard text. Higher DPI produces larger images, more tokens, and higher cost without meaningful accuracy improvement for typical documents.
150–200 DPI is the practical sweet spot. Higher DPI costs more tokens without improving comprehension for standard documents. Reserve 300 DPI for very small print.
Why is a two-pass approach recommended for complex table extraction?
Correct. First asking Claude to describe the table structure (columns, headers, layout) gives it the structural awareness needed to extract data more accurately in the second pass.
The reason is accuracy: understanding structure before extracting data reduces errors on complex layouts with merged cells, rotated text, or multi-level headers.

Lab 2 — Document Extraction Design

Design chunking strategies and extraction prompts for real document workflows

Your Mission

You're building a document AI pipeline for a legal firm that processes 50–200 page contracts. Design your chunking strategy, DPI settings, and structured extraction prompts. Discuss the trade-offs and best practices for production accuracy.

Try asking: "How should I chunk a 150-page contract for Claude?" or "Design a JSON extraction schema for vendor invoices" or "How do I handle tables that span multiple pages?" or "What's the overlapping window strategy in practice?"
Document Pipeline Assistant
Lab 2
Ready to design your document processing pipeline! Ask me about chunking strategies for long documents, DPI settings for PDF conversion, structured extraction schemas, or how to handle complex table layouts. What's your document processing challenge?
Module 4 · Lesson 3

Prompting Strategies for Visual Reasoning

How to write prompts that elicit accurate, detailed, and structured responses from visual inputs
What prompt techniques produce reliably accurate visual analysis — and how do you guide Claude through complex multi-image comparisons?

In 2024, researchers at Stanford Medicine published a study evaluating Claude 3 Opus on radiology report generation from chest X-rays. They found that zero-shot performance — asking the model to describe what it sees without any example — was less accurate than prompting the model to first identify anatomical landmarks, then assess each region systematically. Structured chain-of-thought prompting improved diagnostic accuracy by roughly 18 percentage points compared to open-ended image description prompts. The insight generalized: visual reasoning improves when Claude is asked to reason step by step through a visual, not just describe it holistically.

The Placement Principle

Where you position images relative to your text prompt matters. The Anthropic documentation recommends placing images before the text instructions for most tasks. This mirrors how humans process visual context — we look at an image, then read a question about it. Placing an image after a long prompt can cause the model to anchor on the text framing before fully processing the visual.

For comparison tasks (e.g., "Which of these two designs is more accessible?"), place images first, then ask your comparative question. For annotation tasks (e.g., "Circle the defects in this PCB image"), place the image first and describe what to look for in the text that follows.

Single Image
Image → then question. Works for description, extraction, OCR, classification.
Multi-Image Comparison
Image A → Image B → then comparative question. Label each image in your text.
Annotated Sequence
Image → instruction → next image → instruction chain. Useful for step-by-step walkthroughs.
Visual + Text Context
Text context → image → question. Use when text context is needed to interpret the visual.
Chain-of-Thought for Visual Tasks

Asking Claude to reason step by step before concluding dramatically improves accuracy on ambiguous or complex visual tasks. The pattern works because each reasoning step constrains the next, reducing the chance of hallucination in the final answer.

# Chain-of-thought visual analysis prompt = """Analyze this circuit board image systematically: Step 1: List all visible components (capacitors, ICs, connectors, etc.) Step 2: Note any visible damage, burns, or physical anomalies Step 3: Check solder joint quality — look for cold joints or bridges Step 4: Based on steps 1-3, give your assessment: pass / fail / needs-rework Format your answer with each step clearly labeled."""
Labeling Images in Multi-Image Prompts

When sending multiple images, always reference them by position in your text: "In the first image…", "Comparing image 1 and image 2…", "The third image shows…". Claude will correctly map these references to the corresponding image blocks. Without explicit labeling, comparisons can become ambiguous in Claude's response.

For workflows where image order may change dynamically, you can embed label text as a content block between images:

# Label images explicitly in the content array content = [ {"type": "text", "text": "Image A (before treatment):"}, {"type": "image", "source": {"type": "url", "url": before_url}}, {"type": "text", "text": "Image B (after treatment):"}, {"type": "image", "source": {"type": "url", "url": after_url}}, {"type": "text", "text": "Compare image A and image B. What changed? " "Rate the improvement on a scale of 0-10."} ]
Handling Uncertainty and Low-Confidence Regions

For tasks where accuracy is critical, explicitly prompt Claude to flag uncertainty. Phrases like "If any part of the image is unclear or ambiguous, say so explicitly rather than guessing" significantly reduce hallucinated details. The model is capable of epistemic humility when instructed — it will not always volunteer uncertainty unprompted.

You can also ask Claude to describe what it cannot see: "What information in this form is illegible or cut off?" This is particularly useful for scanned documents with shadows, folds, or damage.

Prompt Pattern — Explicit Uncertainty

Add to any high-stakes visual prompt: "If any text is illegible, any region is obscured, or you are less than 80% confident in any extracted value, mark that field with [UNCERTAIN] rather than guessing." This pattern catches problematic extractions before they reach downstream systems.

Output Format Control

Visual tasks benefit from strict output format instructions because the content is inherently unstructured. For classification: "Respond with exactly one word: pass, fail, or review." For extraction: "Return only valid JSON, no prose." For description: "Use bullet points, one observation per bullet, maximum 15 bullets." Tight output specs prevent the model from adding contextual commentary that complicates downstream parsing.

Lesson 3 Quiz

Prompting Strategies for Visual Reasoning — check your understanding
For a single-image analysis task, where should the image be placed relative to your text prompt?
Correct. Anthropic recommends placing images before text instructions. This mirrors natural human visual processing — view the image, then read the question.
Anthropic recommends placing images before text instructions for most tasks. This lets the model process the visual fully before reading the question framing.
What is the primary benefit of chain-of-thought prompting for visual analysis tasks?
Correct. Structured step-by-step reasoning constrains each subsequent step, making the final conclusion more accurate and less prone to hallucination on ambiguous visuals.
Chain-of-thought works because each intermediate reasoning step constrains the next. The accumulated structured reasoning produces more accurate final conclusions than holistic description.
You're sending two images for comparison and want Claude to clearly distinguish between them in its response. What should you do?
Correct. Inserting text content blocks with explicit labels (Image A, Image B) between image blocks is the recommended pattern for unambiguous multi-image comparison.
The best approach is inserting labeled text blocks between images in the content array. This creates unambiguous references Claude can use in its response.
To reduce hallucinated values in high-stakes extraction, what prompt instruction is most effective?
Correct. Explicitly instructing the model to flag uncertainty rather than guess is the most direct way to prevent hallucinated values in extraction tasks.
The most effective approach is instructing Claude to mark any uncertain or illegible field as [UNCERTAIN] rather than guessing. This surfaces problems before they reach downstream systems.

Lab 3 — Visual Prompting Workshop

Craft and refine prompts for visual analysis, extraction, and comparison tasks

Your Mission

Practice writing and critiquing prompts for visual AI tasks. You're a prompt engineer improving the reliability of a quality-control vision system that inspects manufactured components. Design prompts that produce consistent, structured, uncertainty-aware output.

Try asking: "Write a chain-of-thought prompt for inspecting PCB solder joints" or "How should I structure a prompt comparing before/after product images?" or "What's the best way to get Claude to output a JSON classification from an image?" or "How do I instruct Claude to flag when an image is too blurry to assess?"
Visual Prompting Assistant
Lab 3
Let's craft better visual prompts! I can help you write chain-of-thought inspection prompts, design structured output schemas for classification tasks, and build in uncertainty-flagging for quality-critical workflows. What visual analysis task are you working on?
Module 4 · Lesson 4

Building Vision Pipelines in Production

Error handling, cost management, caching, and architectural patterns for vision-enabled applications
How do you move from a working prototype to a scalable, cost-controlled, fault-tolerant vision pipeline that handles real-world document volume?

Klarna, the Swedish fintech company, announced in early 2024 that it had deployed AI systems handling the equivalent of 700 full-time customer service agents. A subset of that workload involved processing images — screenshots of transactions, photos of damaged goods for return disputes, and scanned receipts for expense reimbursement claims. Their engineering team published retrospective notes describing the key challenges: rate limit management during claim spikes, cost control as image volume scaled, and maintaining accuracy across varying image quality from smartphone cameras in poor lighting conditions. These are the canonical problems of production vision pipelines.

Cost Architecture

Image tokens are expensive relative to text tokens. A well-designed pipeline optimizes costs at three levels:

  • Pre-processing gate: Before sending any image to Claude, check whether it passes minimum quality thresholds (minimum resolution, not entirely black/white, file not corrupted). Reject unprocessable images early rather than consuming API tokens on them.
  • Resolution optimization: Resize images to task-appropriate dimensions before encoding. For receipt OCR, 800px wide is usually sufficient. For architectural drawings, 1600px may be needed. Profile your task accuracy at different resolutions to find the minimum effective size.
  • Prompt caching: If you're sending the same system prompt with many different images, use Anthropic's prompt caching feature. Cache the system prompt and any static context; pay only input token costs for the variable image content. This can reduce costs by 80–90% for high-volume workflows.
Prompt Caching with Vision

Anthropic's prompt caching (available on Claude 3.5 and later) caches the prefix of a request and reuses it across calls. For vision pipelines with consistent system prompts, mark the system prompt with a cache control breakpoint:

# Prompt caching for high-volume vision pipeline response = client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, system=[ { "type": "text", "text": "You are an invoice processing AI. Extract fields as JSON...", "cache_control": {"type": "ephemeral"} # Cache this prefix } ], messages=[{ "role": "user", "content": [ {"type": "image", "source": { # Not cached — varies per request "type": "base64", "media_type": "image/jpeg", "data": invoice_b64 }}, {"type": "text", "text": "Extract all fields."} ] }] )
Rate Limit Management

Vision requests are heavier than text requests and therefore consume rate limits faster. Anthropic rate limits are expressed in both requests-per-minute (RPM) and tokens-per-minute (TPM). Image-heavy workloads hit TPM limits before RPM limits. Strategies:

  • Exponential backoff: On 429 (rate limit) responses, wait 1s, then 2s, then 4s, etc. before retrying. The Anthropic Python SDK has this built in; set max_retries when instantiating the client.
  • Request queuing: For batch processing (e.g., end-of-day invoice processing), use an async queue with concurrency control. Process 3–5 images in parallel rather than serializing or flooding the API.
  • Tier planning: Monitor token-per-minute usage. If you regularly approach limits, apply for a higher usage tier through Anthropic's developer console.
Error Handling Patterns

Vision pipelines encounter several failure modes not present in text-only workflows. Each requires a specific handling strategy:

Image Too Large
Compress or resize before retry. Log original dimensions for audit.
Unsupported Format
Convert to PNG or JPEG using Pillow or ImageMagick before sending.
Extraction Mismatch
If JSON parse fails, retry with explicit schema re-statement. Route to human after 2 failures.
Rate Limit (429)
Exponential backoff with jitter. Queue excess requests rather than dropping them.
Low Confidence Fields
Route [UNCERTAIN] fields to human review queue. Track uncertainty rate by document type.
URL Fetch Failure
Fall back to base64 encoding if URL delivery returns a 4xx error.
Monitoring and Quality Control

Production vision pipelines require active monitoring. Log per-request: model used, image dimensions, token counts (input/output), latency, extracted fields, and whether any fields were marked [UNCERTAIN]. Aggregate these logs to track:

Uncertainty rate by document type — if invoice uncertainty rises above 5%, the document quality may have changed (e.g., a new vendor using a different invoice format). Cost per document — alert if average token cost per document exceeds your budget model. Extraction accuracy — spot-check 1–2% of documents against ground truth to detect model drift or prompt degradation over time.

Architecture Recommendation

For volumes above ~500 documents/day, consider an async pipeline: an ingestion queue (SQS, Pub/Sub) → image pre-processing workers → Claude API workers with rate limiting → output validation → human review queue. This decouples ingestion from processing speed and provides natural retry infrastructure.

Lesson 4 Quiz

Building Vision Pipelines in Production — check your understanding
What Anthropic API feature can reduce costs by 80–90% in high-volume vision pipelines with consistent system prompts?
Correct. Prompt caching caches the prefix (including the system prompt) and reuses it across calls. For pipelines sending many images with the same system prompt, this eliminates the vast majority of input token costs.
Prompt caching is the key feature. By marking the system prompt with cache_control, you pay full input token cost only on the first call; subsequent calls reuse the cached prefix at reduced cost.
Why do vision workloads typically hit token-per-minute (TPM) rate limits before requests-per-minute (RPM) limits?
Correct. A single 1024×1024 image uses ~6,500 tokens. At that rate, a pipeline sending just a few images per minute can exhaust its TPM allowance while barely touching its RPM limit.
The reason is token volume: each image consumes thousands of tokens via the tiling formula. TPM budgets exhaust quickly even at moderate request rates, making TPM the binding constraint.
What should a production pipeline do when Claude returns JSON with [UNCERTAIN] fields?
Correct. [UNCERTAIN] fields should trigger human review. Tracking the uncertainty rate by document type also helps detect when document quality changes or a new format appears.
Uncertain fields should go to human review — that's the entire point of the [UNCERTAIN] pattern. Automatic retry or silent replacement risks propagating bad data into downstream systems.
What is the recommended approach when an image exceeds the 5 MB size limit?
Correct. The standard handling is to compress or resize to meet the limit, then retry. Logging original dimensions is important for audit trails in regulated industries.
The correct approach is to compress or resize the image, then retry. GIF has the same limits, URL delivery doesn't bypass size limits, and manual tiling would break spatial context.

Lab 4 — Production Vision Pipeline Design

Architect a scalable, cost-controlled, fault-tolerant vision pipeline

Your Mission

You're the lead engineer at a fintech company processing 2,000 expense receipts per day through Claude Vision. Design the full pipeline architecture including pre-processing, caching, rate limiting, error handling, and monitoring. Discuss cost projections and scaling trade-offs.

Try asking: "Estimate the daily API cost for 2,000 receipts at 800×1200px" or "How should I implement exponential backoff for vision requests?" or "Design the error handling flow for my extraction pipeline" or "When should I use async queuing vs synchronous processing?"
Pipeline Architecture Assistant
Lab 4
Let's design your production vision pipeline! I can help you calculate cost projections, design error handling flows, architect async queuing systems, set up monitoring, and implement prompt caching for maximum efficiency. What aspect of your pipeline should we tackle first?

Module 4 — Vision and Document Input

15 questions · Score 80% or higher to pass
1. Which four image formats does the Anthropic API accept?
Correct. JPEG, PNG, GIF (first frame only), and WebP are the four accepted formats.
The four accepted formats are JPEG, PNG, GIF (first frame only), and WebP. SVG, TIFF, HEIC, and BMP are not supported.
2. An image is 1536×1536 pixels. How many 512×512 tiles does it divide into, and approximately how many tokens does it consume?
Correct. ceil(1536/512) = 3 per side, so 3×3 = 9 tiles. Tokens = 85 + (1600 × 9) = 14,485.
1536/512 = 3 per side, so 3×3 = 9 tiles. Tokens = 85 + (1600 × 9) = 14,485. Remember the formula: 85 + (1600 × ceil(w/512) × ceil(h/512)).
3. What happens when you send a GIF file to the Anthropic API?
Correct. GIF animation is not processed — only the first frame is used. Plan accordingly when working with animated content.
For GIFs, only the first frame is processed. Animation frames are ignored. The API accepts the format but treats it as a static image.
4. Which delivery method is preferable when you need deterministic, latency-controlled requests without external network dependencies?
Correct. Base64 embeds the image directly in the request body, eliminating network fetch latency and external dependency failures.
Base64 is preferable for deterministic requests. URL delivery requires Claude to fetch the image at runtime, introducing latency and potential failure points.
5. How must PDF documents be prepared before sending to the Anthropic API?
Correct. The API does not accept PDFs. Each page must be rendered as a JPEG or PNG and sent as an image content block.
PDFs must be converted to images — one per page — and sent as image content blocks. There is no PDF file type or document storage API.
6. What DPI range is recommended for converting standard PDF documents to images for Claude processing?
Correct. 150–200 DPI balances text legibility with token cost. Higher DPI increases cost without meaningful accuracy improvement for standard documents.
150–200 DPI is the recommended range. Higher DPI produces larger images with more tokens and no accuracy benefit for standard text documents.
7. An 80-page contract needs to be processed. Given the 20-image-per-request limit, what is the minimum number of API calls required?
Correct. 80 pages ÷ 20 images/call = 4 calls exactly. With overlapping windows, you would need slightly more.
80 ÷ 20 = 4 calls. With overlapping windows for context continuity, you'd need slightly more, but the minimum is 4.
8. What is the purpose of the overlapping window strategy when chunking long documents?
Correct. Including the last page(s) of one chunk as the first page(s) of the next ensures Claude sees content that spans chunk boundaries.
Overlapping windows ensure context continuity — information spanning a page boundary is visible in both chunks. It slightly increases request count but prevents missed cross-boundary content.
9. For a complex table with merged cells, what is the recommended extraction approach?
Correct. A structural description pass gives Claude the layout awareness needed to extract data accurately in the second pass, reducing errors from complex layouts.
Two-pass extraction — structure first, data second — is the recommended approach for complex tables. The structural awareness from the first pass reduces errors in extraction.
10. Where should images be placed relative to text instructions for most single-image analysis tasks?
Correct. Placing images before text mirrors natural visual processing — view first, then read the question — and is the Anthropic-recommended pattern for most tasks.
Images should come before text instructions. This mirrors how humans process visual information and is the recommended pattern in Anthropic's documentation.
11. What prompt instruction most effectively reduces hallucinated values in high-stakes visual extraction?
Correct. Explicit uncertainty flagging is the most direct way to prevent hallucinated values reaching downstream systems.
Instructing Claude to mark uncertain fields as [UNCERTAIN] is the most effective mitigation. The model is capable of epistemic humility when explicitly instructed.
12. What prompt caching feature reduces costs in high-volume vision pipelines?
Correct. Caching the system prompt (the fixed prefix) means you pay full input token cost only on the first call. Subsequent calls with different images reuse the cached prefix at reduced cost.
You cache the system prompt — the fixed prefix — not individual images. This eliminates repeated system prompt token costs across the many calls in a high-volume pipeline.
13. Why do vision workloads exhaust token-per-minute limits faster than requests-per-minute limits?
Correct. A single 1024×1024 image uses ~6,500 tokens. At that rate, even a modest request volume quickly exhausts TPM budgets.
Image tokens are large — thousands per image via the tiling formula. TPM budgets exhaust quickly even at moderate request rates, making TPM the binding constraint for vision workloads.
14. What should a production pipeline do when it receives an image file in an unsupported format (e.g., TIFF)?
Correct. Unsupported formats must be converted to JPEG, PNG, GIF, or WebP before sending. Pillow (Python) and ImageMagick are standard tools for this conversion.
Unsupported formats must be converted before sending. The API will reject TIFF, BMP, SVG, etc. Use Pillow or ImageMagick to convert to a supported format first.
15. At what volume does Anthropic recommend moving to an async queue architecture for vision pipelines?
Correct. Above ~500 documents/day, an async queue (ingestion → pre-processing workers → API workers → validation) provides rate limit management, retry infrastructure, and decoupled scaling.
Above ~500 documents/day, async queue architecture is recommended. It decouples ingestion speed from processing capacity and provides natural retry and rate-limit management infrastructure.