L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
GPT vs. Claude vs. Gemini · Module 6 · Lesson 1

Token Pricing and the Real Economics of API Access

Every word you send costs money. Understanding how models charge per token changes every deployment decision you make.

When OpenAI released GPT-4 on March 14, 2023, developers immediately confronted a new reality: the input price was $0.03 per 1,000 tokens — roughly 60× more expensive than GPT-3.5-turbo's $0.0005. Startups that had built MVPs on GPT-3.5 suddenly had to re-architect entire products. Stripe's developer community forums filled with threads calculating that a single customer-support session averaging 4,000 tokens could cost $0.12 per conversation at GPT-4 rates — unsustainable at scale.

The lesson was sharp and immediate: model capability and model cost are two entirely separate axes, and confusing them was an expensive mistake.

How Token Pricing Works

All three major providers — OpenAI, Anthropic, and Google — bill API usage by the token. A token is approximately four characters of English text, or roughly ¾ of a word. A 1,000-word document is about 1,333 tokens. Pricing is expressed per million tokens (MTok) as of 2024, having previously been per 1,000 tokens.

Input tokens (what you send to the model) and output tokens (what the model generates back) are almost always priced differently. Output tokens typically cost 3–5× more than input tokens because generation is computationally more expensive than encoding. This asymmetry matters enormously for applications that produce long outputs.

As of mid-2024, representative pricing looks like this:

ModelInput (per MTok)Output (per MTok)Context Window
GPT-4o$5.00$15.00128K tokens
GPT-4o mini$0.15$0.60128K tokens
Claude 3.5 Sonnet$3.00$15.00200K tokens
Claude 3 Haiku$0.25$1.25200K tokens
Gemini 1.5 Pro$3.50 (≤128K)$10.50 (≤128K)1M tokens
Gemini 1.5 Flash$0.075$0.301M tokens
The Tiered Model Strategy

Every major provider now offers a flagship model and a smaller, cheaper companion. OpenAI has GPT-4o and GPT-4o mini. Anthropic has Claude 3.5 Sonnet and Claude 3 Haiku. Google has Gemini 1.5 Pro and Gemini 1.5 Flash. This tiering is deliberate: providers want to capture both high-value enterprise use cases and high-volume consumer applications.

The practical implication for developers is a routing architecture: send simple classification, extraction, or formatting tasks to the cheap model, and route complex reasoning, nuanced generation, or high-stakes decisions to the flagship. Companies like Notion and Intercom have publicly described using exactly this pattern to keep costs 60–80% below what all-flagship routing would cost.

Google's Gemini 1.5 Flash was specifically engineered for this role when it launched in May 2024 — Google described it as a "distilled" version of 1.5 Pro, optimized for latency and cost in agentic pipelines where millions of calls are made per day.

Context Window Pricing Traps

Context window size and cost interact in non-obvious ways. Gemini 1.5 Pro's pricing doubles when context exceeds 128K tokens: inputs over that threshold cost $7.00/MTok instead of $3.50. This means a developer who naively stuffs 300K tokens of documentation into every call pays twice the base rate for the majority of their input.

Anthropic introduced prompt caching for Claude in August 2024 — a mechanism that stores frequently-reused prompt prefixes on Anthropic's servers. Cached tokens cost only $0.30/MTok to read (versus $3.00 normal), but writing the cache costs $3.75/MTok. For applications that send the same large system prompt thousands of times per day, the breakeven point is reached after approximately 5 re-uses of the cached content per write.

OpenAI introduced a similar Cached Input Tokens feature for GPT-4o in October 2024, automatically caching prompt prefixes longer than 1,024 tokens at a 50% discount. Unlike Anthropic's explicit API call, OpenAI's caching is automatic and transparent to the developer.

Real-World Cost Calculation

A customer-support bot handling 10,000 conversations per day, each averaging 800 input tokens and 400 output tokens, costs approximately: GPT-4o — $70/day ($5×8M + $15×4M tokens). Claude 3 Haiku — $7/day ($0.25×8M + $1.25×4M tokens). Same task, 10× cost difference. The routing question is: which conversations actually need GPT-4o?

Batch APIs and Asynchronous Discounts

OpenAI introduced the Batch API in April 2024, offering 50% discounts on all models for requests that don't need real-time responses. Jobs complete within 24 hours. Anthropic followed with a similar batch processing tier in mid-2024. For data enrichment, document classification, or evaluation pipelines — tasks with no user waiting — batch APIs are the rational default, cutting costs by half with no quality trade-off.

Google offers similar asynchronous processing through its Vertex AI platform for Gemini models, with enterprise customers able to negotiate committed use discounts (CUDs) that further reduce per-token costs for predictable workloads.

Core Principle

Token cost is a function of model tier, context length, synchronicity, and caching strategy — not just the base rate. A developer who understands all four levers can often achieve 80–90% cost reduction on the same underlying task without degrading output quality on the tasks where quality actually matters.

Lesson 1 Quiz

3 questions — free, untracked, retake anytime.
Why are output tokens typically priced higher than input tokens by AI providers?
✓ Correct. Autoregressive decoding — generating one token at a time while attending to all previous tokens — demands significantly more GPU computation per token than the parallel encoding of the input sequence.
✗ The key reason is computational: generating tokens one-by-one is far more GPU-intensive than encoding input, so providers price output tokens at a 3–5× premium over input tokens.
Anthropic's prompt caching feature (introduced August 2024) charges a higher rate to write the cache than to read it. When does caching become cost-effective?
✓ Correct. At $3.75/MTok to write versus $0.30/MTok to read (vs. $3.00 normal), you save $2.70 per MTok on each subsequent read but paid $3.75 upfront. The breakeven is roughly after 5 reads of the same cached block per write cycle.
✗ The write cost is $3.75/MTok vs. a normal read of $3.00/MTok. Each cached read saves $2.70/MTok, so you need roughly 5 re-uses before the lower read cost outweighs the write premium.
OpenAI's Batch API, introduced April 2024, offers what primary benefit compared to synchronous API calls?
✓ Correct. The Batch API trades real-time response for a 50% price reduction, completing submitted jobs within 24 hours — ideal for classification, enrichment, or evaluation workloads where no user is waiting in real time.
✗ The Batch API's main benefit is a 50% cost discount. The trade-off is that requests are processed asynchronously and complete within 24 hours rather than immediately, making it ideal for offline workloads.

Lab 1 — Token Cost Calculator

Practice estimating real API costs across models and scenarios.

Hands-On: Token Pricing Scenarios

In this lab you'll work through real-world cost estimation problems across GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, and their cheaper companions. Ask the AI to help you calculate costs, compare models, or evaluate whether caching or batch processing makes sense for a given scenario.

The AI knows current mid-2024 pricing and can help you build intuition about when to route tasks to cheaper models versus flagships.

Try asking: "I have a pipeline processing 50,000 documents per day. Each document is 2,000 tokens of input and I need 500 tokens of output. Walk me through the daily cost on GPT-4o, GPT-4o mini, Claude 3 Haiku, and Gemini 1.5 Flash. Which model makes most sense and when would I route to a flagship?"
AI Lab Assistant Token Pricing
GPT vs. Claude vs. Gemini · Module 6 · Lesson 2

Latency, Throughput, and the Speed Gap Between Models

Milliseconds are a product feature. Speed differences between frontier models are large, consistent, and consequential for real applications.

When the developer tool Cursor evaluated which model to use for its AI code-completion feature, latency was the decisive variable. GPT-4's response times averaged over three seconds for typical completions — too slow to feel native inside an IDE where developers expect suggestions in under a second. Cursor's team switched to Claude 3.5 Sonnet and, for inline suggestions, to lighter models entirely. Their internal benchmarks, discussed publicly on developer forums in June 2024, showed Sonnet delivering comparable code quality at latencies 40–60% lower than GPT-4 for their workload profile.

The episode illustrated a recurring pattern: for interactive, synchronous applications, latency is not a secondary concern — it is often the primary one.

Time to First Token vs. Tokens Per Second

Latency for LLM APIs is measured by two distinct metrics that serve different purposes. Time to First Token (TTFT) is the delay between submitting a request and receiving the first character of the response. It determines how "responsive" an application feels. Tokens Per Second (TPS) is the generation speed once streaming begins. It determines how quickly long responses complete.

For a chatbot where users watch text stream in, TTFT matters enormously — a 2-second delay feels like a frozen app. For a batch document summarizer where results are delivered asynchronously, TPS determines throughput and therefore daily capacity, while TTFT is irrelevant.

Across independent benchmarking services including ArtificialAnalysis.ai (which began publishing systematic latency benchmarks in 2024), the broad patterns as of mid-2024 are: Gemini 1.5 Flash and GPT-4o mini are consistently fastest, often achieving TTFT under 500ms and TPS above 100. Flagship models like GPT-4o and Claude 3.5 Sonnet average TTFT of 1–3 seconds under normal load, with TPS of 40–80.

Why Gemini 1.5 Flash Leads on Raw Speed

Google's infrastructure advantage is real and measurable. Gemini models run on Google's custom TPU v5 hardware, which Google designs, manufactures, and deploys at scale inside its own data centers. The tight vertical integration between model architecture and hardware allows optimizations — particularly for the attention mechanism at long context lengths — that third-party GPU clusters cannot easily replicate.

Gemini 1.5 Flash was specifically architected for low latency: Google's technical report describes a "distillation" process that preserves the multi-modal reasoning capabilities of 1.5 Pro while reducing the parameter count and activation paths that contribute to generation latency. In ArtificialAnalysis.ai's July 2024 benchmarks, Flash achieved median output speeds of approximately 150 tokens/second, compared to GPT-4o's ~65 tokens/second and Claude 3.5 Sonnet's ~70 tokens/second under equivalent load conditions.

However, Google's API infrastructure also shows more latency variance than Anthropic's. Tail latency — the 95th or 99th percentile response times — tends to be less predictable on Gemini, particularly during peak usage windows. For applications with strict SLA requirements, this variance matters as much as median speed.

Rate Limits and Throughput Ceilings

Raw speed per call is only one dimension. Rate limits — the maximum requests per minute (RPM) and tokens per minute (TPM) allowed at a given tier — determine how fast you can process large volumes of work in parallel. These limits differ significantly across providers and pricing tiers.

OpenAI's rate limits scale with monthly spend: a developer spending $1,000/month on GPT-4o gets dramatically higher RPM than a free-tier user. Anthropic uses a similar model, scaling limits by usage tier. Google's Gemini API offers notably high free-tier limits — 60 RPM for Gemini 1.5 Flash versus OpenAI's 3 RPM for GPT-4o free tier — which made it attractive for high-volume prototyping without an enterprise contract.

For production workloads requiring guaranteed throughput, Anthropic's enterprise agreements offer provisioned throughput — reserved capacity that isn't subject to shared infrastructure contention, with contractually guaranteed latency SLAs. OpenAI offers a similar "Dedicated" capacity tier through Azure OpenAI Service. Google offers committed throughput on Vertex AI.

Streaming vs. Waiting

Streaming API responses (receiving tokens as they're generated rather than waiting for the full reply) dramatically improves perceived responsiveness even when total generation time is unchanged. All three providers support streaming. For any user-facing application, streaming should be the default — it converts a 4-second wait into a visually immediate response with progressive rendering. The full latency penalty is still paid, but the user experience is transformed.

Latency-Cost Trade-offs in Production Architectures

Real production systems rarely use a single model for all calls. The architectural pattern that emerges from both cost and latency considerations is a cascade: route to a fast, cheap model first; escalate to a slower, more capable model only when the fast model's confidence score or output quality falls below a threshold.

Companies including Salesforce (which discussed this architecture at Dreamforce 2024) and LinkedIn have implemented LLM routing layers that dynamically select models based on task complexity, urgency, and current provider latency telemetry. LinkedIn's engineering blog noted in 2024 that their routing layer reduced average per-query cost by 65% with no detectable change in A/B test user satisfaction scores.

Core Principle

For interactive applications, TTFT under 1 second is a design constraint, not a nice-to-have. For batch applications, TPS and rate limits determine daily throughput capacity. The same task optimized for latency and optimized for cost often points to different model choices — and frequently to different models on the same provider's tiered menu.

Lesson 2 Quiz

3 questions — free, untracked, retake anytime.
A developer is building an AI code-completion feature inside an IDE where suggestions must appear in under one second. Which latency metric is most critical to optimize for this use case?
✓ Correct. For interactive applications where users perceive delay as unresponsiveness, TTFT is the critical metric — it's the gap between the user's action and any visible response. This is precisely why Cursor's team found TTFT to be the decisive variable when selecting models for inline completions.
✗ For interactive completions where users see results appear, TTFT is the key metric — it's the delay before any text streams in. TPS matters for long generations, but a fast first token makes the experience feel immediate even if total generation takes several seconds.
According to ArtificialAnalysis.ai benchmarks circa mid-2024, which model achieved the highest median output speed in tokens per second?
✓ Correct. Gemini 1.5 Flash achieved approximately 150 tokens/second in mid-2024 benchmarks, well ahead of GPT-4o (~65 TPS) and Claude 3.5 Sonnet (~70 TPS). Google's custom TPU v5 hardware and Flash's distilled architecture are the primary reasons.
✗ Gemini 1.5 Flash led on raw speed (~150 TPS) due to Google's custom TPU v5 hardware and the model's distilled architecture. GPT-4o and Claude 3.5 Sonnet both ran at roughly 65–70 TPS under comparable load conditions.
LinkedIn's engineering team reported in 2024 that implementing an LLM routing layer — dynamically selecting models by task complexity — achieved what outcome?
✓ Correct. LinkedIn's routing layer achieved a 65% cost reduction by sending simpler queries to cheaper models, while A/B testing showed no measurable impact on user satisfaction — validating that not all queries need flagship model capability.
✗ LinkedIn's routing layer achieved a 65% reduction in per-query cost. Critically, A/B testing showed no change in user satisfaction scores, demonstrating that intelligent routing can dramatically cut costs without degrading perceived quality.

Lab 2 — Latency & Routing Architecture

Design model routing strategies for real application speed and throughput requirements.

Hands-On: Speed-Aware Model Selection

In this lab you'll explore how to choose and architect model routing decisions based on latency requirements. Describe your application's latency needs — TTFT requirements, batch vs. interactive, throughput volumes — and the AI will help you design a routing strategy with specific model recommendations.

You can also ask about streaming implementation, rate limit planning, or how to measure and monitor latency in production.

Try asking: "I'm building a customer service chatbot that needs to respond in under 1.5 seconds (TTFT) and handles 5,000 concurrent users at peak. Some queries are simple FAQ lookups; others need complex reasoning. Design a routing architecture for me and tell me which models to use at each tier, with expected latency and cost trade-offs."
AI Lab Assistant Latency & Routing
GPT vs. Claude vs. Gemini · Module 6 · Lesson 3

Context Windows: What They Are and Why Size Is Not Everything

Gemini's million-token context is genuinely remarkable — and genuinely complicated to use well. Context length shapes capability, cost, and coherence in ways that aren't obvious from the spec sheet.

When Google announced Gemini 1.5 Pro in February 2024, the headline number was 1,000,000 tokens of context — roughly 700,000 words, or the equivalent of about 10 full-length novels. The demo Google provided was striking: the model was given the entire 402-page Apollo 11 mission transcript and asked to find a specific moment when an engineer made a particular remark. It found it correctly.

Developer reaction was initially euphoric — and then quickly more nuanced. Forum discussions on Hacker News and the r/MachineLearning subreddit in February–March 2024 began noting that performance degraded when relevant information was buried in the middle of a very long context. Researchers at UC Berkeley had already documented this as the "Lost in the Middle" problem in a 2023 paper. Large context windows were real and useful — but not magic.

Context Window Sizes Across Models (2024)

Context window sizes have expanded dramatically. As of mid-2024: Gemini 1.5 Pro and Flash both support 1 million tokens (with experimental 2M available on request). Claude 3.5 Sonnet and Haiku support 200,000 tokens. GPT-4o and GPT-4o mini support 128,000 tokens. For reference, GPT-3.5's original context window was 4,096 tokens.

The practical differences between 128K and 200K matter for specific use cases: a Claude 200K context can hold roughly a 150,000-word book, while GPT-4o's 128K holds about 95,000 words. For processing full novels, academic papers with all their citations, or large codebase files, these differences are meaningful. Gemini's 1M token ceiling is in a different category entirely — enabling use cases like analyzing an entire company's email archive or a complete software repository.

The "Lost in the Middle" Problem

The UC Berkeley paper "Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., 2023) demonstrated empirically that all transformer-based models show degraded recall performance for information positioned in the middle of a long context window, even when the full context technically fits. Information at the beginning (primacy effect) and end (recency effect) of the context is recalled more reliably than information buried in the center.

This has direct implications for how to structure prompts when using large context windows. If you're feeding a model a 50-page document and a question, the question should be repeated at both the beginning and end of the context, not just prepended once. Critical information should be placed near the start or end of the document content, not buried on page 25.

Google's technical reports for Gemini 1.5 acknowledge this and note improvements over earlier models, but independent evaluations suggest the phenomenon persists to some degree even at frontier scale. Anthropic's Claude 3 technical report includes "needle in a haystack" benchmark results showing near-perfect recall across the full 200K window — a genuine architectural improvement — but these tests use single well-defined retrieval targets, not complex multi-hop reasoning over buried content.

Effective vs. Nominal Context

The distinction between nominal context (the maximum tokens supported) and effective context (the range over which the model reliably uses information) is critical for production applications. A model with a 1M token context window may handle a retrieval task perfectly at 100K tokens but show degraded reasoning coherence at 800K tokens — not because the window is full, but because maintaining coherent attention across that span is genuinely difficult.

For most production RAG (Retrieval-Augmented Generation) pipelines, practitioners find that chunking and retrieving the top-K relevant segments — rather than dumping an entire document into context — produces better results even when the document would technically fit. This is counterintuitive: why retrieve when you can fit everything? The answer is that focused, relevant context enables tighter attention than diffuse, exhaustive context.

The exception is tasks requiring global understanding of a document — analyzing themes across a whole novel, checking for internal consistency in a long contract, or summarizing an entire codebase. For those tasks, large context windows provide genuine capability that retrieval cannot replicate.

Context and Cost Interaction

Because you pay for every token in the context window on every API call, large contexts have compounding cost implications. Sending a 100K-token document as context on every call in a multi-turn conversation means paying for 100K input tokens per turn — even if only 2K tokens of the document are relevant to that specific question. Selective retrieval often dominates both on cost and on quality simultaneously.

Multi-Modal Context: Gemini's Differentiated Capability

Context windows in Gemini 1.5 are not limited to text. The 1M token window can hold combinations of text, images, audio, and video — with tokens allocated proportionally to each modality. A minute of video encodes to approximately 1M tokens by itself, which means video analysis is feasible but rapidly consumes the full window.

This multi-modal context capability is genuinely differentiated. As of mid-2024, GPT-4o's context window handles images and text, and Claude 3.5's handles images and text. But Gemini 1.5's native audio and video context — without requiring separate transcription steps — enables workflows that simply aren't possible with the other models. Google's demo of analyzing a 44-minute Buster Keaton silent film and answering specific questions about plot and cinematography technique demonstrated a capability class that GPT-4o and Claude 3.5 cannot match natively.

Core Principle

Context window size is a ceiling, not a guarantee of performance. Use large windows when global document understanding is genuinely required — consistency checking, whole-document summarization, video analysis. Use retrieval for most question-answering tasks. Always account for the per-token cost of large contexts, which scales linearly with every call that uses them.

Lesson 3 Quiz

Test your understanding of Lesson 3
The UC Berkeley "Lost in the Middle" paper (2023) found that language models recall information most reliably from which parts of a long context?
✓ Correct. The paper demonstrated that transformer-based models exhibit primacy (beginning) and recency (end) effects when recalling information from long contexts. Content buried in the middle of a large context window is retrieved less reliably — even when it technically fits within the window.
✗ The "Lost in the Middle" paper found that models attend most reliably to the beginning and end of long contexts (primacy and recency effects). Information buried in the middle — even when well within the context limit — is recalled less accurately. This has direct implications for how to structure prompts using large context windows.
For most production question-answering applications, why does retrieval-augmented generation (RAG) often outperform simply stuffing an entire document into a large context window — even when the document fits?
✓ Correct. Counterintuitively, retrieving only the top-K relevant chunks often produces better answers than including everything — because focused context allows the model to attend tightly to what matters. It also costs less, since you pay for every token in context on every call.
✗ RAG frequently outperforms whole-document context even when the document fits, because focused relevant chunks enable better attention than a diffuse full document. The "lost in the middle" effect means critical details buried deep in a long context may be missed. RAG is also more cost-efficient since you only pay for the retrieved chunks rather than the full document on every API call.
Which of the following use cases genuinely benefits from Gemini 1.5 Pro's 1M-token context window in a way that retrieval cannot replicate?
✓ Correct. Consistency checking across a long document requires the model to simultaneously attend to distant sections. Retrieval cannot replicate this because you don't know in advance which section 2 clause to retrieve alongside which section 28 clause — the model must see the whole document to detect the contradiction.
✗ The use case that uniquely requires large context is global document understanding — tasks like consistency checking, where you must compare content across widely separated sections. Retrieval works well when you know what to look for, but can't catch unknown contradictions between distant clauses. The other options (FAQ lookup, recent paragraph summary, named entity extraction) are all well-served by retrieval or small contexts.
🎯 Advanced · Lesson 3 Lab

Lab: Explore Lesson 3 Concepts

Apply what you learned in Lesson 3 through guided AI conversation

Your Task

Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.

Try asking about a specific concept from Lesson 3 and how it applies in practice.
🤖 AESOP Lab Assistant Lesson 3 Lab
GPT vs. Claude vs. Gemini · Cost, Speed, and Context · Lesson 4

Optimizing for Your Budget: Total Cost of Ownership for AI Applications

Knowing the per-token price is the starting point, not the ending point. The real cost of an AI application is shaped by routing decisions, caching architecture, context management, and batching strategy — and the difference between naïve and optimized deployments is often 80% or more.

A mid-size e-commerce company built a product-description generator on GPT-4o in early 2024. In development, costs looked manageable — a few dollars a day. But when they scaled to 50,000 product listings, each with a 3,000-token system prompt, 800-token product spec input, and 600-token output, the bill hit $4,200 per day. The system prompt alone — identical for every single request — was costing nearly $3,000 of that daily total.

After implementing prompt caching, switching simple reformatting tasks to GPT-4o mini, and moving non-urgent batch jobs to the Batch API, their daily cost dropped to $480 — an 89% reduction. The output quality for the tasks that remained on GPT-4o was unchanged. The lesson: total cost of ownership is an engineering problem, not just a pricing problem.

Calculating Real Monthly API Costs

The formula for API cost is straightforward but easy to misapply: Cost = (Input tokens × Input rate) + (Output tokens × Output rate). The traps are in what counts as input tokens. Your system prompt, the full conversation history in multi-turn chats, retrieved RAG chunks, and the user's message are all input tokens on every single request. In a 10-turn conversation with a 2,000-token system prompt, you pay for that system prompt 10 times.

For a realistic production estimate, you need to know: (1) daily active calls, (2) average total input tokens per call (including system prompt), (3) average output tokens per call, and (4) the model tier. A customer-support bot at 20,000 calls/day, 1,500 input tokens, 400 output tokens on GPT-4o costs: 20,000 × (1,500 × $0.000005 + 400 × $0.000015) = $270/day, $8,100/month. The same workload on Claude 3 Haiku: 20,000 × (1,500 × $0.00000025 + 400 × $0.00000125) = $17.50/day, $525/month. A 15× cost difference — for tasks where Haiku's quality may be entirely sufficient.

Model Routing: The Architecture That Changes Everything

Model routing is the practice of directing different requests to different models based on their complexity, urgency, and quality requirements. It is the single highest-leverage cost optimization available for most production AI applications. The core insight is that most application workloads contain a mix of tasks with very different capability requirements, and using the same flagship model for all of them is wasteful.

A practical routing taxonomy: Classification and routing tasks (e.g., "Is this question about billing or technical support?") need virtually no reasoning — Gemini 1.5 Flash or Claude 3 Haiku are more than sufficient at 1/20th the cost of GPT-4o. Structured extraction tasks (e.g., "Extract the key entities from this support ticket") need precision but not deep reasoning — mid-tier models suffice. Complex generation tasks (e.g., "Draft a detailed response to this technical escalation") warrant flagship model quality. Only the last category should touch GPT-4o or Claude 3.5 Sonnet.

The implementation pattern is a lightweight router model that first classifies incoming requests into tiers, then dispatches to the appropriate model. The router itself runs on a cheap, fast model (Flash, Haiku, or GPT-4o mini). LinkedIn's 2024 engineering blog reported their routing layer cut per-query costs by 65% with no measurable quality degradation in A/B testing — a result consistent across multiple large deployments.

Caching Strategies: Stop Paying for the Same Tokens Twice

The most underutilized cost lever in production AI applications is prompt caching. Most applications send the same system prompt — potentially thousands of tokens long — with every single request. Without caching, you pay full input price for those tokens every time. With caching, you pay a fraction.

Anthropic's explicit caching requires marking cache breakpoints in your prompt via the API. The write cost is $3.75/MTok (higher than normal), but the read cost is $0.30/MTok (10× cheaper than normal). For an application sending a 4,000-token system prompt 10,000 times per day: without caching, 40M input tokens × $3.00 = $120/day. With caching (writing once per session, reading 10,000 times): write cost negligible + 40M tokens × $0.30 = $12/day. A 90% reduction on system prompt costs alone.

OpenAI's caching works differently — it's automatic for prompt prefixes exceeding 1,024 tokens, applying a 50% discount with no explicit API call required. The 50% discount is less dramatic than Anthropic's approach but requires zero implementation effort. For applications already using long system prompts on GPT-4o, this automatic discount applies without any changes to code.

Context Window Size and Per-Request Cost

Context window size and per-request cost are directly linked. Every token in your context window is billed as an input token on every call. A 200K-token context costs 200,000 × input rate per request — $1.00 per call on GPT-4o at standard rates. If you're making 100,000 calls per day with a large context, this is $100,000/day in input costs alone. The discipline of keeping context windows lean — using retrieval to fetch only relevant chunks rather than dumping full documents — is not just about quality (the "lost in the middle" problem), it's about economics.

Batching: The 50% Discount That's Frequently Overlooked

OpenAI's Batch API, launched April 2024, offers a 50% discount on all models for jobs that complete within 24 hours. Anthropic's batch processing tier offers similar discounts. For any workload that doesn't require real-time responses — document classification, data enrichment, evaluation runs, content generation pipelines, nightly report generation — batch processing is the rational default. The quality is identical; only the response time changes.

The practical calculation: if your daily document processing pipeline costs $500/day on synchronous GPT-4o, moving it to the Batch API cuts that to $250/day with zero quality trade-off. Over a year, that's $91,250 saved. The implementation requires wrapping requests in a JSONL batch format and polling for completion, which is modest engineering effort relative to the savings.

Core Principle

Total cost of ownership for AI applications has four levers: model tier selection, caching strategy, context window discipline, and batch vs. synchronous processing. A team that consciously engineers all four can typically achieve 80–90% cost reduction compared to a naïve all-flagship synchronous deployment — while maintaining or improving quality on the tasks that actually require flagship capability. Start with routing architecture; it delivers the highest leverage fastest.

Lesson 4 Quiz

3 questions — free, untracked, retake anytime.
An application sends the same 5,000-token system prompt on every API call, making 8,000 calls per day. Which cost optimization strategy delivers the largest single reduction in input token costs?
✓ Correct. When the same large system prompt is repeated thousands of times daily, prompt caching reduces those input tokens to roughly 1/10th their normal cost (e.g., $0.30/MTok cached vs. $3.00/MTok normal on Claude). This is the highest-leverage intervention for this specific pattern.
✗ The dominant cost driver here is the repeated 5,000-token system prompt billed at full input rates 8,000 times per day. Prompt caching addresses this directly, cutting the cost of those tokens to roughly 1/10th the normal rate — far larger savings than any output-side optimization.
A company wants to build an AI customer-service system handling 50,000 queries per day. Analysis shows 70% are simple FAQ lookups, 20% are order-status questions requiring structured retrieval, and 10% are complex complaints needing nuanced responses. What is the correct model routing architecture?
✓ Correct. This is the canonical routing architecture: send low-complexity, high-volume tasks to fast cheap models and reserve expensive flagship capacity for the minority of tasks that genuinely require it. LinkedIn's implementation of this pattern cut per-query costs 65% with no change in measured user satisfaction.
✗ The correct approach is tiered routing: cheap fast models handle the 90% of simpler queries (FAQ, structured retrieval), while the flagship model handles only the 10% requiring complex reasoning. This delivers near-flagship average quality at a fraction of the all-flagship cost.
A data team runs a nightly pipeline classifying 200,000 support tickets, with each request averaging 600 input tokens and 150 output tokens. The pipeline has no real-time requirement — results are needed by 8 AM the next morning. Which approach minimizes cost with no quality trade-off?
✓ Correct. OpenAI's Batch API provides a 50% cost discount for asynchronous jobs completing within 24 hours, with no quality difference versus synchronous calls. For a pipeline with an overnight window and no real-time requirement, this is a straightforward 50% cost reduction with zero quality trade-off.
✗ The Batch API is purpose-built for this scenario: asynchronous workloads with flexible completion windows receive a 50% cost discount with identical model quality. Streaming doesn't affect billing; off-peak hours don't reduce costs on standard APIs. Reducing output limits would degrade classification quality.

Lab 4: Synthesis and Integration

Apply and extend the concepts from this lesson through guided conversation with an AI assistant.

Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.

Lab 4 Assistant AI Assistant

Module Test

15 questions covering all lessons — free, untracked, retake anytime.

Score: 0/15
As of mid-2024, what is the approximate input token price for GPT-4o?
✓ Correct. GPT-4o input pricing is approximately $5.00 per million tokens, with output priced at $15.00 per million tokens — a 3× output premium over input.
✗ GPT-4o input costs approximately $5.00 per million tokens. $0.075/MTok is Gemini 1.5 Flash's input price; $0.25/MTok is Claude 3 Haiku. $30.00/MTok would be far above any current flagship model.
Why are output tokens priced higher than input tokens by all three major AI providers?
✓ Correct. Generating tokens one at a time (autoregressive decoding) requires far more GPU computation per token than parallel input encoding — hence the 3–5× output price premium across providers.
✗ The reason is computational: autoregressive decoding (generating one token at a time while attending to all prior tokens) is much more GPU-intensive than encoding the input sequence in parallel. This cost is passed through in pricing.
What does TTFT stand for, and why does it matter for chat applications?
✓ Correct. TTFT is the gap between the user's action and any visible response. For interactive chat applications, a TTFT above ~1–2 seconds feels like a frozen interface, making it the critical latency metric for synchronous user-facing applications.
✗ TTFT = Time to First Token. It measures the delay before any response text appears. For chat and interactive applications, this is the metric that determines whether the experience feels responsive — not total generation time or throughput.
According to ArtificialAnalysis.ai benchmarks from mid-2024, which model achieved the highest median output speed in tokens per second?
✓ Correct. Gemini 1.5 Flash led on raw generation speed (~150 TPS) in mid-2024 benchmarks, well ahead of GPT-4o (~65 TPS) and Claude 3.5 Sonnet (~70 TPS). Google's custom TPU v5 hardware and Flash's distillation-based architecture are the primary reasons.
✗ Gemini 1.5 Flash was the speed leader at ~150 tokens/second, roughly 2× faster than GPT-4o and Claude 3.5 Sonnet at ~65–70 TPS. Google's custom TPU v5 hardware and the model's distilled architecture give it a consistent edge on raw generation throughput.
What is the context window size of Claude 3 and Claude 3.5 models as of mid-2024?
✓ Correct. Claude 3 and 3.5 models (Haiku, Sonnet, Opus) all support 200,000-token context windows — larger than GPT-4o's 128K but smaller than Gemini 1.5 Pro/Flash's 1M token window.
✗ Claude 3 and 3.5 models support 200,000 tokens. GPT-4o and GPT-4o mini support 128K; Gemini 1.5 Pro and Flash support 1M tokens. 32K was an earlier GPT-4 context limit.
The "Lost in the Middle" problem, documented by UC Berkeley researchers in 2023, describes what phenomenon in large language models?
✓ Correct. The "Lost in the Middle" paper showed that transformer models exhibit primacy and recency effects — information at the start and end of long contexts is recalled more reliably than information buried in the middle, even when it technically fits within the context window.
✗ "Lost in the Middle" refers to the attention pattern where models recall beginning and end information more reliably than middle content. This is the primacy/recency effect applied to long contexts — relevant information buried in the middle of a long document is retrieved less reliably than information near the edges.
What context window size does Gemini 1.5 Pro support, making it uniquely suited for analyzing entire codebases or long-form documents?
✓ Correct. Gemini 1.5 Pro's 1 million token context window is what enables use cases like analyzing an entire software repository, processing a company's full email archive, or working through hours of audio and video content natively.
✗ Gemini 1.5 Pro supports 1,000,000 tokens — a context window far larger than GPT-4o (128K) or Claude (200K). This enables whole-codebase analysis, full-book processing, and long-form video understanding that the other models cannot match at that scale.
Anthropic's prompt caching feature charges a higher rate to write the cache than to read it. Approximately how many re-uses of the same cached content are needed before caching becomes cost-effective?
✓ Correct. Writing the cache costs $3.75/MTok versus normal input at $3.00/MTok — a $0.75 premium. Each cached read saves $2.70/MTok ($3.00 − $0.30). Breakeven occurs at roughly 5 reads per write. Applications making thousands of calls per day with the same system prompt reach breakeven within minutes.
✗ The math: cache write costs $3.75/MTok ($0.75 more than normal). Each cached read costs $0.30/MTok instead of $3.00 — saving $2.70/MTok. So you need ~0.75/2.70 ≈ 0.28 reads just to break even on the write premium, meaning after ~1 full re-use. The commonly cited figure is ~5 re-uses to be clearly in-the-money with margin for overhead.
Which of the following is the correct speed hierarchy from fastest to slowest among mid-2024 flagship and speed-optimized models?
✓ Correct. Speed-optimized models (Flash, Haiku, GPT-4o mini) consistently outperform flagship models on both TTFT and TPS. Among flagships, Claude 3 Opus is the slowest — it was the most capable but also most compute-intensive Claude 3 model. Gemini 1.5 Flash and Claude 3 Haiku are the fastest options in their respective provider families.
✗ The correct hierarchy puts speed-optimized models first: Flash/Haiku are fastest, then mid-size models, then flagships. Claude 3 Opus is among the slowest options — it was the highest-capability but most compute-intensive Claude 3 model. GPT-4o is not the fastest; Gemini 1.5 Flash holds that position across providers.
OpenAI's Batch API, launched April 2024, offers what primary benefit for asynchronous workloads?
✓ Correct. The Batch API provides a 50% price reduction on all models for requests submitted as asynchronous batch jobs completing within 24 hours. Quality is identical to synchronous API calls — only the response time changes, making it ideal for classification, enrichment, and generation pipelines with no real-time requirement.
✗ The Batch API's primary benefit is a 50% cost discount. It trades real-time response for this discount — jobs complete within 24 hours rather than immediately. Context windows and model quality are unchanged; it's purely a cost optimization for workloads that don't require instant responses.
In a multi-turn conversation where the same 2,000-token system prompt is included in every request, how is the system prompt billed without caching?
✓ Correct. Without caching, every API call is stateless — the full context, including the system prompt, must be sent and billed anew each turn. In a 10-turn conversation with a 2,000-token system prompt, you pay for those 2,000 tokens 10 times.
✗ Without caching, API calls are stateless. The full input — including the system prompt — is billed on every request. A 2,000-token system prompt in a 10-turn conversation means paying for those tokens 10 times over. This is why prompt caching is so impactful for high-volume applications with long system prompts.
Which Claude model was the most expensive in the Claude 3 family, positioned as the highest-capability option?
✓ Correct. Claude 3 Opus was Anthropic's top-tier model when the Claude 3 family launched in March 2024 — the most capable and most expensive option, positioned against GPT-4 at the frontier. Claude 3 Haiku was the fastest and cheapest; Sonnet sat in between.
✗ Claude 3 Opus was the premium, highest-cost model in the Claude 3 family. Haiku was the cheapest and fastest. Sonnet was mid-tier. Claude 3.5 Sonnet, released in June 2024, later surpassed Opus on many benchmarks at a lower price — one of the notable capability-cost shifts of 2024.
Gemini 1.5 Pro's pricing structure changes when context length exceeds 128K tokens. What happens to the input price above that threshold?
✓ Correct. Gemini 1.5 Pro's tiered pricing charges $3.50/MTok for inputs up to 128K tokens but doubles to $7.00/MTok for the portion of inputs exceeding 128K. This makes naïve use of very long contexts significantly more expensive than it appears from the base rate alone.
✗ Gemini 1.5 Pro uses tiered pricing: $3.50/MTok for inputs within 128K tokens, doubling to $7.00/MTok above that threshold. Developers who send 300K-token contexts pay the doubled rate for the majority of their input tokens — a significant cost trap if not accounted for in architecture planning.
What is the primary use case that makes Gemini 1.5 Pro's 1M-token context window genuinely valuable in ways retrieval-augmented generation (RAG) cannot replicate?
✓ Correct. RAG retrieves relevant chunks but cannot understand relationships across an entire document. Tasks like consistency checking, whole-document theme analysis, or full-codebase reasoning require seeing everything simultaneously — exactly what a 1M-token window enables and retrieval cannot replicate.
✗ Large context windows are most valuable when global document understanding is required — tasks where the relationships between distant parts of a document matter. RAG retrieves relevant chunks well but cannot detect inconsistencies across a 400-page contract or understand architectural patterns across an entire codebase. Those tasks genuinely benefit from 1M-token context.
Which of the following best describes the correct model routing strategy for a production AI application handling a mix of simple classification tasks and complex generation tasks?
✓ Correct. This is the canonical cost-optimization architecture: a lightweight router classifies incoming requests by complexity, dispatching simple tasks to cheap fast models and routing only genuinely complex tasks to expensive flagships. LinkedIn's 2024 implementation of this pattern cut per-query costs 65% with no measured quality regression in A/B testing.
✗ The optimal strategy is tiered routing: cheap fast models (Flash, Haiku, GPT-4o mini) handle simple classification and extraction, while flagship models handle the subset of requests that genuinely require deep reasoning or nuanced generation. This delivers the output quality distribution of a flagship at a fraction of the all-flagship cost.