When OpenAI released GPT-4 on March 14, 2023, developers immediately confronted a new reality: the input price was $0.03 per 1,000 tokens — roughly 60× more expensive than GPT-3.5-turbo's $0.0005. Startups that had built MVPs on GPT-3.5 suddenly had to re-architect entire products. Stripe's developer community forums filled with threads calculating that a single customer-support session averaging 4,000 tokens could cost $0.12 per conversation at GPT-4 rates — unsustainable at scale.
The lesson was sharp and immediate: model capability and model cost are two entirely separate axes, and confusing them was an expensive mistake.
All three major providers — OpenAI, Anthropic, and Google — bill API usage by the token. A token is approximately four characters of English text, or roughly ¾ of a word. A 1,000-word document is about 1,333 tokens. Pricing is expressed per million tokens (MTok) as of 2024, having previously been per 1,000 tokens.
Input tokens (what you send to the model) and output tokens (what the model generates back) are almost always priced differently. Output tokens typically cost 3–5× more than input tokens because generation is computationally more expensive than encoding. This asymmetry matters enormously for applications that produce long outputs.
As of mid-2024, representative pricing looks like this:
| Model | Input (per MTok) | Output (per MTok) | Context Window |
|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | 128K tokens |
| GPT-4o mini | $0.15 | $0.60 | 128K tokens |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K tokens |
| Claude 3 Haiku | $0.25 | $1.25 | 200K tokens |
| Gemini 1.5 Pro | $3.50 (≤128K) | $10.50 (≤128K) | 1M tokens |
| Gemini 1.5 Flash | $0.075 | $0.30 | 1M tokens |
Every major provider now offers a flagship model and a smaller, cheaper companion. OpenAI has GPT-4o and GPT-4o mini. Anthropic has Claude 3.5 Sonnet and Claude 3 Haiku. Google has Gemini 1.5 Pro and Gemini 1.5 Flash. This tiering is deliberate: providers want to capture both high-value enterprise use cases and high-volume consumer applications.
The practical implication for developers is a routing architecture: send simple classification, extraction, or formatting tasks to the cheap model, and route complex reasoning, nuanced generation, or high-stakes decisions to the flagship. Companies like Notion and Intercom have publicly described using exactly this pattern to keep costs 60–80% below what all-flagship routing would cost.
Google's Gemini 1.5 Flash was specifically engineered for this role when it launched in May 2024 — Google described it as a "distilled" version of 1.5 Pro, optimized for latency and cost in agentic pipelines where millions of calls are made per day.
Context window size and cost interact in non-obvious ways. Gemini 1.5 Pro's pricing doubles when context exceeds 128K tokens: inputs over that threshold cost $7.00/MTok instead of $3.50. This means a developer who naively stuffs 300K tokens of documentation into every call pays twice the base rate for the majority of their input.
Anthropic introduced prompt caching for Claude in August 2024 — a mechanism that stores frequently-reused prompt prefixes on Anthropic's servers. Cached tokens cost only $0.30/MTok to read (versus $3.00 normal), but writing the cache costs $3.75/MTok. For applications that send the same large system prompt thousands of times per day, the breakeven point is reached after approximately 5 re-uses of the cached content per write.
OpenAI introduced a similar Cached Input Tokens feature for GPT-4o in October 2024, automatically caching prompt prefixes longer than 1,024 tokens at a 50% discount. Unlike Anthropic's explicit API call, OpenAI's caching is automatic and transparent to the developer.
Real-World Cost Calculation
A customer-support bot handling 10,000 conversations per day, each averaging 800 input tokens and 400 output tokens, costs approximately: GPT-4o — $70/day ($5×8M + $15×4M tokens). Claude 3 Haiku — $7/day ($0.25×8M + $1.25×4M tokens). Same task, 10× cost difference. The routing question is: which conversations actually need GPT-4o?
OpenAI introduced the Batch API in April 2024, offering 50% discounts on all models for requests that don't need real-time responses. Jobs complete within 24 hours. Anthropic followed with a similar batch processing tier in mid-2024. For data enrichment, document classification, or evaluation pipelines — tasks with no user waiting — batch APIs are the rational default, cutting costs by half with no quality trade-off.
Google offers similar asynchronous processing through its Vertex AI platform for Gemini models, with enterprise customers able to negotiate committed use discounts (CUDs) that further reduce per-token costs for predictable workloads.
Core Principle
Token cost is a function of model tier, context length, synchronicity, and caching strategy — not just the base rate. A developer who understands all four levers can often achieve 80–90% cost reduction on the same underlying task without degrading output quality on the tasks where quality actually matters.
In this lab you'll work through real-world cost estimation problems across GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, and their cheaper companions. Ask the AI to help you calculate costs, compare models, or evaluate whether caching or batch processing makes sense for a given scenario.
The AI knows current mid-2024 pricing and can help you build intuition about when to route tasks to cheaper models versus flagships.
When the developer tool Cursor evaluated which model to use for its AI code-completion feature, latency was the decisive variable. GPT-4's response times averaged over three seconds for typical completions — too slow to feel native inside an IDE where developers expect suggestions in under a second. Cursor's team switched to Claude 3.5 Sonnet and, for inline suggestions, to lighter models entirely. Their internal benchmarks, discussed publicly on developer forums in June 2024, showed Sonnet delivering comparable code quality at latencies 40–60% lower than GPT-4 for their workload profile.
The episode illustrated a recurring pattern: for interactive, synchronous applications, latency is not a secondary concern — it is often the primary one.
Latency for LLM APIs is measured by two distinct metrics that serve different purposes. Time to First Token (TTFT) is the delay between submitting a request and receiving the first character of the response. It determines how "responsive" an application feels. Tokens Per Second (TPS) is the generation speed once streaming begins. It determines how quickly long responses complete.
For a chatbot where users watch text stream in, TTFT matters enormously — a 2-second delay feels like a frozen app. For a batch document summarizer where results are delivered asynchronously, TPS determines throughput and therefore daily capacity, while TTFT is irrelevant.
Across independent benchmarking services including ArtificialAnalysis.ai (which began publishing systematic latency benchmarks in 2024), the broad patterns as of mid-2024 are: Gemini 1.5 Flash and GPT-4o mini are consistently fastest, often achieving TTFT under 500ms and TPS above 100. Flagship models like GPT-4o and Claude 3.5 Sonnet average TTFT of 1–3 seconds under normal load, with TPS of 40–80.
Google's infrastructure advantage is real and measurable. Gemini models run on Google's custom TPU v5 hardware, which Google designs, manufactures, and deploys at scale inside its own data centers. The tight vertical integration between model architecture and hardware allows optimizations — particularly for the attention mechanism at long context lengths — that third-party GPU clusters cannot easily replicate.
Gemini 1.5 Flash was specifically architected for low latency: Google's technical report describes a "distillation" process that preserves the multi-modal reasoning capabilities of 1.5 Pro while reducing the parameter count and activation paths that contribute to generation latency. In ArtificialAnalysis.ai's July 2024 benchmarks, Flash achieved median output speeds of approximately 150 tokens/second, compared to GPT-4o's ~65 tokens/second and Claude 3.5 Sonnet's ~70 tokens/second under equivalent load conditions.
However, Google's API infrastructure also shows more latency variance than Anthropic's. Tail latency — the 95th or 99th percentile response times — tends to be less predictable on Gemini, particularly during peak usage windows. For applications with strict SLA requirements, this variance matters as much as median speed.
Raw speed per call is only one dimension. Rate limits — the maximum requests per minute (RPM) and tokens per minute (TPM) allowed at a given tier — determine how fast you can process large volumes of work in parallel. These limits differ significantly across providers and pricing tiers.
OpenAI's rate limits scale with monthly spend: a developer spending $1,000/month on GPT-4o gets dramatically higher RPM than a free-tier user. Anthropic uses a similar model, scaling limits by usage tier. Google's Gemini API offers notably high free-tier limits — 60 RPM for Gemini 1.5 Flash versus OpenAI's 3 RPM for GPT-4o free tier — which made it attractive for high-volume prototyping without an enterprise contract.
For production workloads requiring guaranteed throughput, Anthropic's enterprise agreements offer provisioned throughput — reserved capacity that isn't subject to shared infrastructure contention, with contractually guaranteed latency SLAs. OpenAI offers a similar "Dedicated" capacity tier through Azure OpenAI Service. Google offers committed throughput on Vertex AI.
Streaming vs. Waiting
Streaming API responses (receiving tokens as they're generated rather than waiting for the full reply) dramatically improves perceived responsiveness even when total generation time is unchanged. All three providers support streaming. For any user-facing application, streaming should be the default — it converts a 4-second wait into a visually immediate response with progressive rendering. The full latency penalty is still paid, but the user experience is transformed.
Real production systems rarely use a single model for all calls. The architectural pattern that emerges from both cost and latency considerations is a cascade: route to a fast, cheap model first; escalate to a slower, more capable model only when the fast model's confidence score or output quality falls below a threshold.
Companies including Salesforce (which discussed this architecture at Dreamforce 2024) and LinkedIn have implemented LLM routing layers that dynamically select models based on task complexity, urgency, and current provider latency telemetry. LinkedIn's engineering blog noted in 2024 that their routing layer reduced average per-query cost by 65% with no detectable change in A/B test user satisfaction scores.
Core Principle
For interactive applications, TTFT under 1 second is a design constraint, not a nice-to-have. For batch applications, TPS and rate limits determine daily throughput capacity. The same task optimized for latency and optimized for cost often points to different model choices — and frequently to different models on the same provider's tiered menu.
In this lab you'll explore how to choose and architect model routing decisions based on latency requirements. Describe your application's latency needs — TTFT requirements, batch vs. interactive, throughput volumes — and the AI will help you design a routing strategy with specific model recommendations.
You can also ask about streaming implementation, rate limit planning, or how to measure and monitor latency in production.
When Google announced Gemini 1.5 Pro in February 2024, the headline number was 1,000,000 tokens of context — roughly 700,000 words, or the equivalent of about 10 full-length novels. The demo Google provided was striking: the model was given the entire 402-page Apollo 11 mission transcript and asked to find a specific moment when an engineer made a particular remark. It found it correctly.
Developer reaction was initially euphoric — and then quickly more nuanced. Forum discussions on Hacker News and the r/MachineLearning subreddit in February–March 2024 began noting that performance degraded when relevant information was buried in the middle of a very long context. Researchers at UC Berkeley had already documented this as the "Lost in the Middle" problem in a 2023 paper. Large context windows were real and useful — but not magic.
Context window sizes have expanded dramatically. As of mid-2024: Gemini 1.5 Pro and Flash both support 1 million tokens (with experimental 2M available on request). Claude 3.5 Sonnet and Haiku support 200,000 tokens. GPT-4o and GPT-4o mini support 128,000 tokens. For reference, GPT-3.5's original context window was 4,096 tokens.
The practical differences between 128K and 200K matter for specific use cases: a Claude 200K context can hold roughly a 150,000-word book, while GPT-4o's 128K holds about 95,000 words. For processing full novels, academic papers with all their citations, or large codebase files, these differences are meaningful. Gemini's 1M token ceiling is in a different category entirely — enabling use cases like analyzing an entire company's email archive or a complete software repository.
The UC Berkeley paper "Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., 2023) demonstrated empirically that all transformer-based models show degraded recall performance for information positioned in the middle of a long context window, even when the full context technically fits. Information at the beginning (primacy effect) and end (recency effect) of the context is recalled more reliably than information buried in the center.
This has direct implications for how to structure prompts when using large context windows. If you're feeding a model a 50-page document and a question, the question should be repeated at both the beginning and end of the context, not just prepended once. Critical information should be placed near the start or end of the document content, not buried on page 25.
Google's technical reports for Gemini 1.5 acknowledge this and note improvements over earlier models, but independent evaluations suggest the phenomenon persists to some degree even at frontier scale. Anthropic's Claude 3 technical report includes "needle in a haystack" benchmark results showing near-perfect recall across the full 200K window — a genuine architectural improvement — but these tests use single well-defined retrieval targets, not complex multi-hop reasoning over buried content.
The distinction between nominal context (the maximum tokens supported) and effective context (the range over which the model reliably uses information) is critical for production applications. A model with a 1M token context window may handle a retrieval task perfectly at 100K tokens but show degraded reasoning coherence at 800K tokens — not because the window is full, but because maintaining coherent attention across that span is genuinely difficult.
For most production RAG (Retrieval-Augmented Generation) pipelines, practitioners find that chunking and retrieving the top-K relevant segments — rather than dumping an entire document into context — produces better results even when the document would technically fit. This is counterintuitive: why retrieve when you can fit everything? The answer is that focused, relevant context enables tighter attention than diffuse, exhaustive context.
The exception is tasks requiring global understanding of a document — analyzing themes across a whole novel, checking for internal consistency in a long contract, or summarizing an entire codebase. For those tasks, large context windows provide genuine capability that retrieval cannot replicate.
Context and Cost Interaction
Because you pay for every token in the context window on every API call, large contexts have compounding cost implications. Sending a 100K-token document as context on every call in a multi-turn conversation means paying for 100K input tokens per turn — even if only 2K tokens of the document are relevant to that specific question. Selective retrieval often dominates both on cost and on quality simultaneously.
Context windows in Gemini 1.5 are not limited to text. The 1M token window can hold combinations of text, images, audio, and video — with tokens allocated proportionally to each modality. A minute of video encodes to approximately 1M tokens by itself, which means video analysis is feasible but rapidly consumes the full window.
This multi-modal context capability is genuinely differentiated. As of mid-2024, GPT-4o's context window handles images and text, and Claude 3.5's handles images and text. But Gemini 1.5's native audio and video context — without requiring separate transcription steps — enables workflows that simply aren't possible with the other models. Google's demo of analyzing a 44-minute Buster Keaton silent film and answering specific questions about plot and cinematography technique demonstrated a capability class that GPT-4o and Claude 3.5 cannot match natively.
Core Principle
Context window size is a ceiling, not a guarantee of performance. Use large windows when global document understanding is genuinely required — consistency checking, whole-document summarization, video analysis. Use retrieval for most question-answering tasks. Always account for the per-token cost of large contexts, which scales linearly with every call that uses them.
Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.
A mid-size e-commerce company built a product-description generator on GPT-4o in early 2024. In development, costs looked manageable — a few dollars a day. But when they scaled to 50,000 product listings, each with a 3,000-token system prompt, 800-token product spec input, and 600-token output, the bill hit $4,200 per day. The system prompt alone — identical for every single request — was costing nearly $3,000 of that daily total.
After implementing prompt caching, switching simple reformatting tasks to GPT-4o mini, and moving non-urgent batch jobs to the Batch API, their daily cost dropped to $480 — an 89% reduction. The output quality for the tasks that remained on GPT-4o was unchanged. The lesson: total cost of ownership is an engineering problem, not just a pricing problem.
The formula for API cost is straightforward but easy to misapply: Cost = (Input tokens × Input rate) + (Output tokens × Output rate). The traps are in what counts as input tokens. Your system prompt, the full conversation history in multi-turn chats, retrieved RAG chunks, and the user's message are all input tokens on every single request. In a 10-turn conversation with a 2,000-token system prompt, you pay for that system prompt 10 times.
For a realistic production estimate, you need to know: (1) daily active calls, (2) average total input tokens per call (including system prompt), (3) average output tokens per call, and (4) the model tier. A customer-support bot at 20,000 calls/day, 1,500 input tokens, 400 output tokens on GPT-4o costs: 20,000 × (1,500 × $0.000005 + 400 × $0.000015) = $270/day, $8,100/month. The same workload on Claude 3 Haiku: 20,000 × (1,500 × $0.00000025 + 400 × $0.00000125) = $17.50/day, $525/month. A 15× cost difference — for tasks where Haiku's quality may be entirely sufficient.
Model routing is the practice of directing different requests to different models based on their complexity, urgency, and quality requirements. It is the single highest-leverage cost optimization available for most production AI applications. The core insight is that most application workloads contain a mix of tasks with very different capability requirements, and using the same flagship model for all of them is wasteful.
A practical routing taxonomy: Classification and routing tasks (e.g., "Is this question about billing or technical support?") need virtually no reasoning — Gemini 1.5 Flash or Claude 3 Haiku are more than sufficient at 1/20th the cost of GPT-4o. Structured extraction tasks (e.g., "Extract the key entities from this support ticket") need precision but not deep reasoning — mid-tier models suffice. Complex generation tasks (e.g., "Draft a detailed response to this technical escalation") warrant flagship model quality. Only the last category should touch GPT-4o or Claude 3.5 Sonnet.
The implementation pattern is a lightweight router model that first classifies incoming requests into tiers, then dispatches to the appropriate model. The router itself runs on a cheap, fast model (Flash, Haiku, or GPT-4o mini). LinkedIn's 2024 engineering blog reported their routing layer cut per-query costs by 65% with no measurable quality degradation in A/B testing — a result consistent across multiple large deployments.
The most underutilized cost lever in production AI applications is prompt caching. Most applications send the same system prompt — potentially thousands of tokens long — with every single request. Without caching, you pay full input price for those tokens every time. With caching, you pay a fraction.
Anthropic's explicit caching requires marking cache breakpoints in your prompt via the API. The write cost is $3.75/MTok (higher than normal), but the read cost is $0.30/MTok (10× cheaper than normal). For an application sending a 4,000-token system prompt 10,000 times per day: without caching, 40M input tokens × $3.00 = $120/day. With caching (writing once per session, reading 10,000 times): write cost negligible + 40M tokens × $0.30 = $12/day. A 90% reduction on system prompt costs alone.
OpenAI's caching works differently — it's automatic for prompt prefixes exceeding 1,024 tokens, applying a 50% discount with no explicit API call required. The 50% discount is less dramatic than Anthropic's approach but requires zero implementation effort. For applications already using long system prompts on GPT-4o, this automatic discount applies without any changes to code.
Context Window Size and Per-Request Cost
Context window size and per-request cost are directly linked. Every token in your context window is billed as an input token on every call. A 200K-token context costs 200,000 × input rate per request — $1.00 per call on GPT-4o at standard rates. If you're making 100,000 calls per day with a large context, this is $100,000/day in input costs alone. The discipline of keeping context windows lean — using retrieval to fetch only relevant chunks rather than dumping full documents — is not just about quality (the "lost in the middle" problem), it's about economics.
OpenAI's Batch API, launched April 2024, offers a 50% discount on all models for jobs that complete within 24 hours. Anthropic's batch processing tier offers similar discounts. For any workload that doesn't require real-time responses — document classification, data enrichment, evaluation runs, content generation pipelines, nightly report generation — batch processing is the rational default. The quality is identical; only the response time changes.
The practical calculation: if your daily document processing pipeline costs $500/day on synchronous GPT-4o, moving it to the Batch API cuts that to $250/day with zero quality trade-off. Over a year, that's $91,250 saved. The implementation requires wrapping requests in a JSONL batch format and polling for completion, which is modest engineering effort relative to the savings.
Core Principle
Total cost of ownership for AI applications has four levers: model tier selection, caching strategy, context window discipline, and batch vs. synchronous processing. A team that consciously engineers all four can typically achieve 80–90% cost reduction compared to a naïve all-flagship synchronous deployment — while maintaining or improving quality on the tasks that actually require flagship capability. Start with routing architecture; it delivers the highest leverage fastest.
Apply and extend the concepts from this lesson through guided conversation with an AI assistant.
Use this lab to explore how the concepts from Lesson 4 apply to your own questions and interests. The AI assistant is here to help you think through complex scenarios.
15 questions covering all lessons — free, untracked, retake anytime.