Module 6 · Lesson 1

Token Economics and API Pricing Models

How AI vendors measure consumption — and why the unit of cost is not what you think.

If you cannot measure what you spend, how do you know what you can afford to build?

In early 2023, Stability AI was burning through compute at a reported rate of roughly $8 million per month while simultaneously offering free API access to developers. CEO Emad Mostaque later acknowledged that the company had no sustainable unit-economics model. By mid-2024 Mostaque had resigned, the company underwent drastic restructuring, and investors were publicly questioning whether the business had ever understood its own cost structure. The failure was not technological — it was financial instrumentation.

What Is a Token?

All major large-language-model APIs — OpenAI, Anthropic, Google Gemini, Cohere, Mistral — price on tokens, not characters, words, or requests. A token is a chunk of text roughly equivalent to 0.75 English words, though the exact mapping depends on the tokenizer each vendor uses. The phrase "cost monitoring" tokenizes to approximately 3 tokens in OpenAI's cl100k_base tokenizer. "监控成本" (the same phrase in Mandarin) tokenizes to roughly 6 tokens, because non-Latin scripts are often less efficiently encoded.

Vendors distinguish between input tokens (everything sent to the model: system prompt, conversation history, user message) and output tokens (the model's response). Output tokens are almost always priced higher — typically 3–5× more — because generating text is computationally more intensive than ingesting it. This asymmetry has significant architectural implications.

GPT-4o Input

$2.50

per million tokens (May 2025)

GPT-4o Output

$10.00

per million tokens (May 2025)

Claude 3.5 Sonnet Input

$3.00

per million tokens (May 2025)

Claude 3.5 Sonnet Output

$15.00

per million tokens (May 2025)

Pricing Tiers and Volume Discounts

Most vendors offer tiered pricing: higher usage unlocks lower per-token rates. OpenAI's Batch API, for example, offers 50% discounted rates for requests that do not need real-time responses — asynchronous jobs that complete within 24 hours. In March 2024, Google launched the Gemini 1.5 Pro with a free tier limited to 2 requests per minute, then a paid tier beginning at $7 per million input tokens for prompts above 128k context. The free tier was explicitly designed as a loss-leader to drive adoption before monetizing at scale.

Enterprise agreements often include committed use discounts (CUDs), where committing to a minimum monthly spend — typically $50,000–$250,000 — yields 20–40% reductions. These commitments introduce their own budget risk: unused capacity still costs money, while overruns trigger overage pricing that may be higher than the standard rate.

Cost Driver: Context Window

Long-context models are priced per token regardless of whether you fill the context. Sending a 100,000-token system prompt on every request multiplies input costs dramatically. A team at Replit discovered in 2023 that their AI code assistant was re-sending the entire codebase as context on each message, generating input token costs 18× higher than projected.

Calculating Cost per Request

The fundamental unit to track is cost per API request, which compounds into cost per user, cost per session, and cost per feature. The formula is straightforward:

Token Cost Formula

Cost per request = (input_tokens × input_price_per_token) + (output_tokens × output_price_per_token)

Example — GPT-4o, 500 input + 300 output:
= (500 × $0.0000025) + (300 × $0.00001)
= $0.00125 + $0.003
= $0.00425 per request

At scale, $0.004 per request becomes $4,000 for one million requests. If your product serves 100,000 active users making 20 requests per day, that is 2 billion requests per month — potentially $8 million monthly in model costs alone, before infrastructure, storage, or personnel. This is why understanding the token economics before scaling is foundational to sustainable deployment.

Input tokensAll text sent to the model in a single API call: system prompt, history, user message. Priced at a lower rate than output.

Output tokensText generated by the model in response. Priced 3–5× higher than input tokens at most vendors.

Batch APIAn async request mode (offered by OpenAI, Anthropic) that processes jobs over hours at 50% discount vs. real-time API.

Committed Use DiscountA contractual minimum spend in exchange for reduced per-token rates. Common in enterprise AI contracts.

Key Insight

Because output tokens cost 3–5× more than input tokens, the single most impactful cost-reduction lever available to most teams is constraining max_tokens in the API call — setting a ceiling on how long the model's response can be. This alone can cut costs by 40–60% without changing model or prompt structure.

Module 6 · Lesson 1 Quiz

Token Economics and API Pricing

Three questions — click to reveal feedback.

1. Why are output tokens priced higher than input tokens by most AI API vendors?

Correct. Autoregressive generation — producing one token at a time while attending to all previous tokens — is far more GPU-intensive than a single forward pass over input text. This compute asymmetry drives the pricing asymmetry.

Not quite. The pricing difference reflects underlying compute cost. Generating tokens requires iterative forward passes through the model; ingesting input requires only one.

2. What was the primary cost-structure problem that contributed to Stability AI's financial difficulties in 2023?

Correct. Stability AI offered free API access while burning approximately $8 million per month without a clear path to recovering those costs from revenue. This was fundamentally a financial instrumentation failure — not understanding cost per unit served.

The core issue was lacking a sustainable unit-economics model — offering free access while spending ~$8M/month on compute with no clear path to cost recovery.

3. A request to GPT-4o uses 800 input tokens and 400 output tokens. Using May 2025 pricing ($2.50/M input, $10.00/M output), what is the cost?

Correct. (800 × $0.0000025) + (400 × $0.00001) = $0.002 + $0.004 = $0.006 per request.

Re-apply the formula: (input_tokens × $0.0000025) + (output_tokens × $0.00001). With 800 input and 400 output: $0.002 + $0.004 = $0.006.

Module 6 · Lab 1

Token Cost Calculator

Practice estimating API costs across different models and usage patterns.

Lab Scenario

You are a cost analyst evaluating AI API options for a customer-support chatbot expected to handle 500,000 sessions per month, with an average of 8 messages per session. Each message has approximately 350 input tokens (including system prompt and history) and 200 output tokens.

Ask the assistant to help you calculate monthly costs across different models, evaluate which model tiers are appropriate, and identify what levers would most reduce spend.

Start by asking: "Calculate the monthly token cost for this chatbot at GPT-4o pricing, then compare it to Claude 3.5 Sonnet."

Cost Analysis Assistant

Lab 1

Hello! I'm your cost analysis assistant for this lab. I'll help you work through token economics and API pricing calculations. Share your scenario or ask me to run cost comparisons across different models and usage levels. What would you like to calculate first?

Module 6 · Lesson 2

Budget Alerting and Spend Controls

Real-time guardrails that prevent a spike from becoming a catastrophe.

What is the difference between a budget and a budget with teeth?

In April 2024, a developer reported on the Google Cloud community forum that a misconfigured load-testing script had sent continuous generation requests to the Imagen API for approximately six hours overnight. By morning the bill was $47,000. Google had sent no real-time alert — only the next-day billing notification. After significant community pressure, Google expanded its budget alert options and introduced per-project daily spend limits for Vertex AI APIs. The incident illustrated that alerts without enforcement are notifications, not controls.

The Alert-vs-Enforce Distinction

Most cloud AI platforms offer two fundamentally different mechanisms: budget alerts (notifications sent when spend crosses a threshold) and spend caps or hard limits (enforcement actions that stop or throttle API calls). The gap between these two is where most costly incidents occur.

OpenAI's platform, as of 2024, offers both a soft notification alert and a hard usage limit. The hard limit, once reached, returns HTTP 429 errors to callers — meaning your application must handle this gracefully or it will surface errors to end users. Anthropic's Console similarly allows setting monthly usage caps at the API key or workspace level. GCP's Vertex AI allows per-resource quotas that can be set to hard ceilings.

Real Incident — Developer on Reddit, 2023

A developer building a GPT-4 powered writing assistant left a test environment running with no usage limit. A recursive prompt loop caused the assistant to repeatedly invoke itself. Within 90 minutes, 2.1 million tokens were consumed at GPT-4-32k pricing ($0.12/1k output tokens at the time), producing a $252 bill. The developer had configured an alert at $50 — but had not checked email. The fix: hard limit set to $30/month on test API keys, separate from production.

Layered Budget Architecture

Production AI cost control requires multiple layers — not a single alert. A layered architecture typically looks like this:

Layer	Mechanism	Action	Latency
1 — Application	Per-user token budget enforced in code	Reject request before API call	Instant
2 — API Key	Vendor hard limit per key	429 error returned by vendor	Real-time
3 — Workspace/Project	Vendor monthly cap	All keys under workspace blocked	Real-time
4 — Cloud Billing	Budget alert → PagerDuty/Slack	On-call engineer notification	Minutes
5 — Finance	Monthly invoice review	Retrospective correction	30 days

Layer 1 — the application layer — is the most powerful because it acts before any money is spent. A common pattern is maintaining a per-user token budget in Redis or a similar fast store: each request decrements the user's remaining allowance; once exhausted, the request is rejected with a friendly message rather than forwarded to the model API. This eliminates runaway spend caused by individual users without impacting other users.

Configuring Vendor Alerts

On OpenAI's platform: Settings → Billing → Usage limits. Set a soft limit (email notification) and a hard limit (API calls blocked). Best practice: set the hard limit at 120% of expected monthly spend, with the soft limit at 80%. This gives a 20% buffer for legitimate traffic spikes while ensuring you are alerted before hitting the ceiling.

On AWS Bedrock: CloudWatch Budget Alerts can be configured against the Bedrock service with SNS topics triggering Lambda functions that can revoke IAM policies — effectively implementing a hard limit through automation even where Bedrock itself does not offer native caps.

On Azure OpenAI: Token-per-minute (TPM) quotas are set per deployment. Setting a low TPM on development deployments prevents runaway costs while keeping production deployments at full quota.

Implementation Pattern: Separate API Keys by Environment

Always use distinct API keys for development, staging, and production — each with its own hard limit calibrated to that environment's expected usage. A development key should never have access to a $10,000 monthly limit. A $50/month hard limit on dev keys means the worst-case incident from a runaway test script costs less than a dinner.

Soft LimitA notification threshold — an email or webhook fires, but API calls continue. A warning without enforcement.

Hard LimitAn enforcement ceiling — once reached, the vendor returns errors and no further charges accrue. Calls stop.

Token Budget (App Layer)A per-user or per-session allowance maintained in application code that rejects requests before they reach the API.

TPM QuotaTokens Per Minute quota — Azure's mechanism for rate-limiting per deployment, which implicitly caps cost velocity.

Module 6 · Lesson 2 Quiz

Budget Alerting and Spend Controls

Three questions — click to reveal feedback.

1. What is the critical difference between a "soft limit" and a "hard limit" in AI API cost management?

Correct. A soft limit is a notification mechanism — you get an email or webhook, but charges continue accruing. A hard limit is an enforcement mechanism — the vendor returns 429 errors and billing stops at that ceiling.

The distinction is about enforcement. Soft limits notify; hard limits block. This is the difference that determined whether the Google Imagen incident ($47K) could have been prevented automatically.

2. In a layered budget architecture, which layer acts BEFORE any money is spent?

Correct. The application layer checks token budgets before forwarding the request to the API. If the budget is exhausted, the request never reaches the vendor — so no cost is incurred. This is the most cost-efficient enforcement point.

Only the application layer can reject a request before it reaches the vendor API. All vendor-side and billing mechanisms act after the API call has already been initiated — meaning cost has already been incurred or is in progress.

3. What best practice is recommended for managing API keys across development and production environments?

Correct. Separate keys with environment-appropriate hard limits contain blast radius. A $50/month hard limit on a dev key means the worst runaway script costs at most $50 — not $47,000 as in the Imagen incident.

Sharing keys or using unlimited dev keys is how the costly incidents happen. Separate keys with small hard limits on dev environments is the standard best practice.

Module 6 · Lab 2

Designing a Budget Alert Architecture

Map out a layered spend control system for a production AI application.

Lab Scenario

You are building spend controls for an AI-powered legal document drafting SaaS. The application serves 2,000 law firm users, has a $15,000/month AI budget, and must never surface billing-caused errors to paying customers during business hours.

Work with the assistant to design a complete alert and enforcement architecture: what layers you need, how to configure each, where to place hard vs. soft limits, and how to handle graceful degradation when limits approach.

Start by asking: "What layers should I implement for a legal SaaS with a $15,000/month AI budget and 2,000 users?"

Budget Architecture Assistant

Lab 2

Ready to help you design a budget alert architecture for your legal SaaS. This is a scenario where the constraints are meaningful — paying customers, business-hours SLA, and a fixed monthly budget. Let's build a system that prevents surprise bills without degrading user experience. What would you like to design first?

Module 6 · Lesson 3

Cost Attribution and Chargeback Models

Knowing total spend is the beginning. Knowing who spent it is where accountability starts.

If every team sees "AI costs" as someone else's line item, who is responsible for optimizing it?

Dropbox disclosed in its 2023 annual report that AI infrastructure costs had grown substantially, but internal teams struggled to attribute those costs to specific product lines. Drew Houston acknowledged in a Q3 2023 earnings call that the company was investing in "better tooling to understand which features are consuming which AI resources." Without attribution, Dropbox could not determine whether the cost of its AI-summarization feature was justified by the retention it generated — a fundamental gap between investment and accountability that persists at many companies deploying generative AI.

Why Attribution Matters

AI API costs on a consolidated bill tell you what you spent. Cost attribution tells you why you spent it and which business activity drove the spend. Without attribution, a $200,000 monthly AI bill is an opaque number. With attribution, it becomes: $82,000 to the document-summarization feature (which drives 34% of upgrades), $71,000 to the chatbot (which serves 60% of users), and $47,000 to experimental features used by fewer than 200 people.

This granularity enables ROI measurement at the feature level — a capability that most enterprise software teams did not need before AI, because compute costs were rarely variable enough to track this way.

Attribution Dimensions

Teams typically attribute AI costs along several dimensions simultaneously:

Dimension	Example Tags	Use Case
Product Feature	feature=summarization, feature=chat	Feature ROI analysis
Customer Tier	tier=enterprise, tier=free	Margin by customer segment
Team/Department	team=search, team=onboarding	Internal chargeback
Model Used	model=gpt4o, model=claude-haiku	Model selection optimization
Environment	env=prod, env=staging	Non-prod cost containment
User Plan	plan=pro, plan=basic	Pricing model validation

Implementation: Tagging at the Proxy Layer

The most common implementation pattern is to route all model API calls through an internal proxy or gateway that appends metadata to each request log before forwarding. This proxy records: timestamp, user_id, session_id, feature_tag, model_name, input_token_count, output_token_count, and latency. The token counts multiplied by current pricing yield a cost per request that can be aggregated by any dimension.

Open-source tools that support this pattern include LiteLLM (which provides a unified proxy supporting 100+ LLM APIs with built-in cost tracking), Helicone (a logging proxy with a dashboard UI), and LangSmith (LangChain's observability platform with cost tracking per run). Commercial options include Datadog LLM Observability and Arize Phoenix.

Real Pattern: Helicone at Pika Labs

Pika Labs, the AI video generation startup, publicly disclosed using Helicone for LLM cost tracking in 2023. Routing all model calls through Helicone gave them per-request cost data, latency tracking, and the ability to tag requests by user tier — essential for understanding whether their free tier was economically viable as they scaled to millions of users.

Chargeback vs. Showback

Showback means reporting AI costs back to teams for awareness — they can see what they spend, but it does not affect their budget. Chargeback means those costs are actually debited from each team's budget. Most organizations begin with showback (lower organizational friction) and graduate to chargeback once teams trust the attribution data and have built tooling to optimize their own usage.

Microsoft's internal FinOps practices (described in their 2023 Azure cost-optimization documentation) recommend a phased approach: Month 1–3: showback only. Month 4–6: soft chargeback (teams see allocated costs, no financial consequence). Month 7+: hard chargeback (costs transferred to business-unit P&L).

Cost AttributionThe practice of linking AI API costs to specific features, teams, customers, or business activities using metadata tags.

Proxy/Gateway LayerAn internal service that sits between your application and the model API, logging metadata on every request for cost attribution.

ShowbackReporting costs to teams for visibility without financial consequence — awareness without accountability.

ChargebackAllocating AI costs directly to teams' budgets so they bear financial responsibility for their usage.

Key Insight: The Free-Tier Profitability Problem

Without per-user cost attribution, you cannot determine whether your free tier is a viable acquisition channel or a loss-generating liability. Many AI companies discovered in 2023–2024 that their free tiers were being consumed disproportionately by power users who never converted to paid plans — a discovery only possible with user-level cost attribution data.

Module 6 · Lesson 3 Quiz

Cost Attribution and Chargeback Models

Three questions — click to reveal feedback.

1. What is the primary business benefit of feature-level AI cost attribution?

Correct. When you can see that your summarization feature costs $82K/month and drives 34% of upgrades, you can make a data-driven judgment about whether that cost is justified. Without attribution, all AI spend is undifferentiated.

The primary benefit is ROI visibility at the feature level. Attribution answers "which features are worth their AI costs" — without it, you cannot make rational product investment decisions.

2. What is the role of a proxy or gateway layer in AI cost attribution?

Correct. The proxy layer is the attribution instrumentation point — it intercepts every API call, appends metadata, logs the request and response details, and forwards to the vendor. Tools like LiteLLM, Helicone, and LangSmith implement this pattern.

The proxy layer's role in attribution is logging metadata. It sits between your app and the model API, recording feature tags, user IDs, token counts, and latency on every request so costs can be attributed later.

3. What distinguishes "showback" from "chargeback" in internal AI cost management?

Correct. Showback = visibility without accountability. Chargeback = financial accountability. Most organizations start with showback to establish trust in attribution data before moving to chargeback where teams' budgets are directly affected.

The distinction is financial consequence. Showback informs; chargeback allocates. Teams see the same cost data in both, but chargeback means those costs actually hit their P&L.

Module 6 · Lab 3

Building a Cost Attribution Schema

Design the metadata tagging system for a multi-feature AI product.

Lab Scenario

You are the platform engineer at an AI-powered HR software company. Your product has five AI features: resume screening, interview question generation, offer letter drafting, employee sentiment analysis, and a general HR chatbot. You serve enterprise customers (charged per seat) and a free tier. Your AI bill last month was $94,000 — and nobody knows which features drove it.

Work with the assistant to design a complete tagging schema, choose a proxy tool, and plan the transition from showback to chargeback for your internal teams.

Start by asking: "Help me design a tagging schema for cost attribution across our five HR AI features and two customer tiers."

Attribution Design Assistant

Lab 3

Let's build your cost attribution schema. A $94K monthly bill with no visibility is a real problem — you can't make rational product decisions without knowing what's driving it. I'll help you design the tagging dimensions, choose tooling, and plan your showback-to-chargeback rollout. What features should we tag first?

Module 6 · Lesson 4

Cost Optimization Strategies

Systematic techniques to reduce AI spend without reducing capability.

What is the most expensive token you send? The one you didn't need to send at all.

Brex, the corporate card and expense management company, publicly shared in a 2024 engineering blog post that they reduced their LLM costs by over 80% through a combination of strategies: routing simpler classification tasks to smaller, cheaper models (Claude Haiku instead of Sonnet), implementing semantic caching for repeated queries, and aggressively trimming system prompt size. The expense categorization feature, which had initially used Claude Opus for every classification, was refactored to use Haiku for 90% of requests — with Opus reserved only for ambiguous edge cases. The per-request cost dropped from $0.018 to $0.0031.

Strategy 1: Model Routing and Cascading

Not every task requires your most capable model. The practice of model routing — directing requests to different models based on task complexity — is one of the highest-leverage cost levers available. A simple customer lookup query does not need GPT-4o; Claude 3 Haiku or GPT-4o-mini can handle it at 10–20× lower cost with equivalent accuracy on simple tasks.

Cascading extends this: start with a cheap model and escalate to a more powerful one only when the first model signals low confidence or the response fails a quality gate. LLM-routing libraries like LiteLLM's router and RouteLLM (an open-source tool from LMSYS released in 2024) implement this pattern with configurable thresholds.

Strategy 2: Prompt Optimization

System prompts are paid for on every single request. A 2,000-token system prompt sent with each of 10 million daily requests costs 20 billion input tokens per day — at GPT-4o pricing, that is $50,000 per day from the system prompt alone. Prompt compression techniques include:

Prompt compression tools like LLMLingua (from Microsoft Research, 2023) use a smaller model to compress verbose prompts by removing less-informative tokens while preserving meaning. In benchmarks, LLMLingua achieves 4–20× compression with less than 5% performance degradation on many tasks.

Semantic caching: store embeddings of past queries and their responses. When a new query is semantically similar to a cached one (cosine similarity above a threshold), return the cached response without an API call. GPTCache (open-source, 2023) reports cache hit rates of 30–40% in customer-support applications where users ask similar questions repeatedly.

Model Routing Savings

40–80%

Routing 70–90% of tasks to cheaper models

Semantic Cache Hit Rate

30–40%

In customer-support applications

Prompt Compression

4–20×

LLMLingua compression ratio

max_tokens Reduction

40–60%

Cost saved by constraining output length

Strategy 3: Batch Processing

Many AI workloads do not require real-time responses. Document processing, report generation, data classification, and embedding generation are all candidates for batch API processing. OpenAI's Batch API and Anthropic's equivalent offer 50% discounts for jobs submitted asynchronously. A company processing 100,000 documents per day for classification could halve its model costs simply by shifting from real-time to batch — with no change to model, prompt, or output quality.

Real Pattern: Shopify Embedding Optimization

Shopify disclosed in engineering posts that they optimized their product search embeddings by generating embeddings only when product data changes (event-driven) rather than re-embedding everything on a schedule. Combined with a local vector cache, this reduced their embedding API calls by approximately 73%. Embeddings are cheap per token but aggregate to significant cost at Shopify's scale of millions of active product listings.

Strategy 4: Fine-Tuning for Cost Reduction

Counter-intuitively, fine-tuning a smaller model on your specific task can be cheaper than using a larger frontier model with few-shot examples. A fine-tuned GPT-4o-mini may outperform GPT-4o with an 8-shot prompt on a narrow task — at 10× lower inference cost. The fine-tuning cost is a one-time investment; the savings accrue on every inference call thereafter.

OpenAI's fine-tuning pricing as of 2024: $25 per million training tokens plus higher inference pricing per token than the base model. The break-even analysis depends on request volume: high-volume, narrow tasks (e.g., product categorization, intent classification) typically reach break-even within weeks and generate substantial long-term savings.

Strategy 5: Output Constraints

The most underutilized cost lever is max_tokens. Set it explicitly on every API call. If your application displays summaries in a 200-word UI element, there is no reason to allow 1,000 output tokens — yet many teams leave max_tokens unset, allowing the model to produce as many tokens as it generates. Structured output formats (JSON mode, function calling) also constrain verbosity — a structured JSON response is typically 40–60% fewer tokens than an equivalent prose response.

Model RoutingDirecting requests to different model tiers based on task complexity — sending simple tasks to cheap models, complex tasks to powerful ones.

Semantic CachingStoring past query–response pairs by embedding and returning cached responses for semantically similar new queries.

Prompt CompressionUsing a smaller model (e.g., LLMLingua) to compress verbose system prompts by removing low-information tokens.

CascadingStarting with a cheap model and escalating to a powerful one only when the first model's output fails a quality check.

The Optimization Hierarchy

Apply in this order for fastest ROI: (1) Set max_tokens on every call — immediate, zero-risk savings. (2) Route simple tasks to cheaper models — moderate effort, large savings. (3) Implement semantic caching — moderate effort, high savings for repetitive queries. (4) Compress system prompts — higher effort, high savings at scale. (5) Fine-tune for narrow tasks — high upfront investment, large long-term savings at volume.

Module 6 · Lesson 4 Quiz

Cost Optimization Strategies

Three questions — click to reveal feedback.

1. How did Brex reduce its LLM costs by over 80% for expense categorization?

Correct. Brex's primary lever was model routing — using Haiku for straightforward categorizations (the vast majority) and escalating to a more capable model only for edge cases. The per-request cost dropped from $0.018 to $0.0031.

Brex's main technique was model routing — routing 90% of expense categorization requests to the cheaper Claude Haiku, reserving the powerful model for genuinely ambiguous cases. This alone drove most of the 80%+ reduction.

2. What does "semantic caching" accomplish in AI cost optimization?

Correct. Semantic caching embeds queries and compares new queries against cached embeddings. When similarity is above a threshold, the cached response is returned without an API call. In customer-support applications, hit rates of 30–40% have been reported.

Semantic caching stores past query–response pairs by embedding. New queries are embedded and compared to the cache; if sufficiently similar, the cached answer is returned without calling the model API.

3. Which of these cost optimization strategies should typically be applied FIRST because it is immediate and zero-risk?

Correct. Setting max_tokens is a one-line change with immediate effect, zero risk, and no architecture changes required. Every other optimization involves more complexity. If your application shows summaries in a 200-word UI, capping output at 300 tokens costs nothing to implement and immediately reduces output token spend.

Per the optimization hierarchy: start with max_tokens (immediate, zero-risk, one-line change), then model routing, then caching, then prompt compression, then fine-tuning. Fine-tuning has the highest upfront investment and takes longest to show ROI.

Module 6 · Lab 4

Cost Optimization Planning

Apply the optimization hierarchy to a real product scenario and build an action plan.

Lab Scenario

You are the engineering lead at an e-commerce AI startup. Your product uses GPT-4o for three features: product description generation ($41K/month), customer service chatbot ($38K/month), and personalized recommendation explanations ($29K/month). Total: $108K/month. Your CFO has asked for a plan to reach $65K/month within 90 days without degrading user experience.

Work with the assistant to apply the optimization hierarchy to each feature, calculate projected savings, and build a 90-day implementation roadmap.

Start by asking: "Apply the optimization hierarchy to our three AI features and estimate which combination gets us to $65K/month."

Optimization Planning Assistant

Lab 4

Let's build your cost optimization plan. $108K down to $65K — that's a 40% reduction needed, which is ambitious but achievable with the right combination of techniques. I'll help you analyze each feature's optimization potential, model the savings, and sequence the work so you hit the target within 90 days. Ready to start with the feature breakdown?

Module 6

Module Test — Cost Monitoring and Budgeting

15 questions · Score 80% or above to pass this module.

1. What is a "token" as used in LLM API pricing?

Correct. A token is approximately 0.75 English words, though the exact mapping varies by tokenizer and language. Non-Latin scripts often tokenize less efficiently.

A token is roughly 0.75 English words — a subword unit defined by the vendor's tokenizer. Non-Latin scripts may tokenize less efficiently (more tokens per word).

2. Why do most LLM APIs price output tokens at 3–5× the rate of input tokens?

Correct. Autoregressive generation — attending to all previous tokens to produce each new one — requires repeated model forward passes. This compute asymmetry directly drives the pricing asymmetry.

The pricing difference is compute-driven. Autoregressive generation requires one forward pass per output token; input processing is a single pass. This makes output generation fundamentally more expensive.

3. OpenAI's Batch API offers what discount compared to the standard real-time API?

Correct. The Batch API (and Anthropic's equivalent) offers 50% discounts for asynchronous jobs that complete within 24 hours. Ideal for document processing, classification, and embedding generation.

The Batch API offers 50% off standard pricing, making it ideal for non-real-time workloads like document classification, report generation, and bulk embedding tasks.

4. What was the primary lesson from the Google Cloud Imagen billing incident of 2024?

Correct. Google sent no real-time alert during the $47,000 incident — only a next-day notification. The lesson is that alerts (notifications) are not controls (enforcement). Hard limits that block API calls are necessary alongside alerts.

The Imagen incident demonstrated that alerts alone are insufficient. A $47,000 overnight bill accumulated with no real-time intervention because there was no hard enforcement limit — only delayed notification.

5. In a layered budget architecture, what does Layer 1 (application layer) do that vendor-side controls cannot?

Correct. Only the application layer can prevent cost before it is incurred — by rejecting a request without ever calling the API. Vendor controls only activate after the API call has been initiated.

The application layer is unique in acting before the API call. If a per-user token budget is exhausted, the request never reaches the vendor — so zero cost is incurred. Vendor controls can only block or throttle after they receive the call.

6. What best practice should be applied to API keys across development, staging, and production environments?

Correct. Separate keys with appropriate hard limits per environment contain blast radius. A $50/month hard limit on dev keys means a runaway test script costs at most $50.

Each environment should have its own key with a calibrated hard limit. Dev keys should have small hard limits — a runaway script in dev should cost $50, not $47,000.

7. What is the purpose of cost attribution in AI deployment?

Correct. Attribution transforms an opaque total bill into actionable data — enabling ROI analysis at the feature level, identifying waste, and establishing team accountability for AI spend.

Attribution answers "who spent what and why." Without it, a large AI bill is undifferentiated — you can't determine which features are worth their cost or where to optimize.

8. Which open-source tool provides a unified proxy for 100+ LLM APIs with built-in cost tracking?

Correct. LiteLLM is an open-source proxy that provides a unified interface to 100+ LLM APIs with built-in cost tracking, model routing, and logging. Helicone is a logging proxy; LangSmith is LangChain's observability platform; GPTCache is a semantic caching library.

LiteLLM is the unified proxy for 100+ APIs. Helicone is specifically a logging/cost-tracking proxy. LangSmith is LangChain's observability tool. GPTCache is for semantic caching.

9. What distinguishes "showback" from "chargeback" in internal AI cost management?

Correct. Same data, different consequence. Showback informs; chargeback allocates. Teams typically move from showback to chargeback as trust in attribution data grows.

Showback = visibility without financial consequence. Chargeback = costs actually hit the team's P&L. The distinction is whether awareness translates to financial accountability.

10. What is "model cascading" in cost optimization?

Correct. Cascading means trying the cheap model first; if its response passes a quality gate, return it. If not, escalate to a more powerful model. This achieves near-frontier quality at near-budget-model cost for most requests.

Cascading: cheap model first → quality check → escalate to powerful model only on failure. Tools like RouteLLM implement this with configurable quality thresholds.

11. Microsoft's recommended phased approach to internal AI chargeback begins with:

Correct. Microsoft's Azure FinOps documentation recommends: Months 1–3 showback only, Months 4–6 soft chargeback (teams see allocated costs, no financial consequence), Month 7+ hard chargeback. This builds trust before applying financial pressure.

The phased approach starts with showback only. Jumping to hard chargeback before teams trust the attribution data creates organizational friction and resistance.

12. What compression ratio can LLMLingua achieve on verbose prompts?

Correct. LLMLingua (Microsoft Research, 2023) achieves 4–20× compression by using a smaller model to remove low-information tokens from prompts, with less than 5% performance degradation on many benchmarks.

LLMLingua benchmarks show 4–20× compression with less than 5% performance impact. The technique uses a smaller model to identify and remove tokens that contribute little to meaning.

13. What cache hit rate does GPTCache report in customer-support applications using semantic caching?

Correct. GPTCache reports 30–40% cache hit rates in customer-support applications where users ask semantically similar questions repeatedly. Each hit eliminates one API call entirely.

GPTCache reports 30–40% semantic cache hit rates in customer-support use cases — meaning 30–40% of queries can be answered without any API call, purely from cached responses to similar past queries.

14. What is the first optimization that should be applied in the cost optimization hierarchy?

Correct. max_tokens is a one-line change that is immediate, zero-risk, and requires no architectural changes. It is the highest-ROI-per-effort optimization available and should always be applied first.

Start with max_tokens — it's a one-line change with immediate impact and zero risk. The hierarchy continues: model routing → semantic caching → prompt compression → fine-tuning. Fine-tuning has the highest upfront cost and takes longest to see ROI.

15. Shopify reduced its embedding API calls by approximately 73% through which technique?

Correct. Shopify switched from scheduled re-embedding to event-driven embedding (only when product data changes), combined with a local vector cache. This eliminated redundant embedding calls — the main source of wasted spend.

Shopify's approach was event-driven + local cache. Rather than re-embedding everything on a schedule, they only generate embeddings when the underlying product data actually changes — eliminating the massive redundant re-computation cost.