Module 6 · Lesson 1

Serving Models at Scale

From Jupyter notebook to production endpoint — the gap most teams underestimate.

What does it actually take to serve an LLM to 100,000 users without it falling over?

When ChatGPT crossed one million users in five days after its November 2022 launch, OpenAI's infrastructure team faced continuous capacity emergencies. CTO Mira Murati publicly acknowledged that the service was "incredibly expensive" and capacity-constrained for months. The API — launched separately in March 2023 — had to use rate-limiting and waitlists precisely because GPU memory is finite and serving transformer models at scale is orders of magnitude harder than serving a REST API that returns JSON from a database.

Why Model Serving Is Different

A typical web API is stateless and cheap — spin up more containers and add load balancing. LLM serving is fundamentally different. Every forward pass through a 7B-parameter model requires roughly 14 GB of GPU VRAM just to hold weights in BFloat16. Then inference itself — the KV cache — grows with every token generated. A single 4,096-token context window on a 7B model can consume another 4–8 GB of VRAM.

This means batching strategy, GPU memory management, and tokenizer throughput are first-class engineering concerns, not afterthoughts.

Concept

Continuous Batching

Rather than waiting for all requests in a batch to finish before starting new ones, vLLM's PagedAttention dynamically swaps requests in and out as sequences complete — cutting GPU idle time dramatically.

Concept

Quantization

INT8 or INT4 quantization reduces VRAM by 2–4×. Llama 3 8B at INT4 via llama.cpp fits on a single consumer GPU with ~5 GB VRAM, enabling local and edge deployment that BF16 cannot.

Concept

KV Cache

The key-value attention cache grows linearly with sequence length. Systems like vLLM use paged memory (like virtual memory in OSes) to prevent fragmentation and OOM errors under variable-length batches.

Concept

Tensor Parallelism

Large models (70B+) split weight matrices across multiple GPUs. Mistral-7B can run on one A100; Llama 3 70B requires at least two A100 80GB cards in tensor-parallel mode.

Serving Frameworks in 2024–2025

Three frameworks dominate production LLM serving today:

vLLM — Open source, supports OpenAI-compatible API, continuous batching, PagedAttention. Default choice for self-hosted open-weight models like Llama, Mistral, and Qwen.
TGI (Text Generation Inference) — Hugging Face's production server. Strong integration with HF Hub model cards. Flash Attention 2 built in. Used by Hugging Face Inference Endpoints and many European enterprises.
LiteLLM + OpenAI SDK — A proxy layer that normalises multiple backend APIs (OpenAI, Anthropic, Bedrock, Vertex) behind a single OpenAI-compatible interface. Useful for multi-provider routing and fallbacks.

Production Pattern

A common 2024 stack: Nginx → LiteLLM proxy → vLLM backend (self-hosted) with Anthropic Claude API as overflow fallback. The proxy handles rate limiting, logging, and cost allocation; vLLM handles on-prem inference; Anthropic handles burst traffic that exceeds GPU capacity.

Deployment Targets

Where your model runs determines everything about latency, cost, and compliance:

Cloud GPU (A100/H100)Highest performance, ~$2–4/hr per GPU. Modal, RunPod, Together.ai, AWS p4d instances. Good for high-throughput batch jobs and large models.

Managed EndpointsAWS Bedrock, Azure AI Foundry, Google Vertex AI. No infrastructure to manage; pay per token. Vendor lock-in risk but fastest path to production.

Edge / On-PremOllama, llama.cpp on MacBook M-series or NVIDIA workstations. Latency can be low, cost near-zero, data never leaves premises. Maximum model size limited by local VRAM.

Serverless InferenceReplicate, Modal, Hugging Face Serverless. Cold starts (5–30s) are the tradeoff for zero idle cost. Acceptable for async pipelines; bad for real-time chat.

Core Serving Metrics to Track

TTFT Time to First Token

TPS Tokens per Second

P99 latency 99th percentile

Throughput Requests/sec

GPU utilization %

OOM rate Out-of-memory errors

TTFT is often more important than total latency for conversational UX — users tolerate a 3-second wait for the first token far better than a 3-second wait between every few words. Streaming output (server-sent events or WebSockets) is therefore essential in any chat interface.

Key Insight

The gap between "model works in a notebook" and "model serves 1,000 concurrent users" is not an API call — it's an infrastructure engineering problem. Choosing the right serving framework, batching strategy, and deployment target before you write application code prevents painful rewrites at scale.

Lesson 1 Quiz — Model Serving

Four questions · Select the best answer · Instant feedback

1. What does "continuous batching" solve in LLM serving?

Correct. Continuous batching is the key innovation in vLLM's PagedAttention — it avoids waiting for the slowest sequence before starting the next batch.

Not quite. Continuous batching is about how requests are scheduled, not about compression, parallelism, or tokenizer caching.

2. You need a model that never sends data to external servers due to HIPAA requirements. Which deployment target is most appropriate?

Correct. On-premises deployment ensures data never leaves your infrastructure — the most reliable path to HIPAA compliance for inference workloads.

Cloud APIs — even with encryption or VPC endpoints — process data on external infrastructure. HIPAA compliance requires business associate agreements and often mandates on-premises or private cloud.

3. Which metric is most critical for a real-time conversational AI experience?

Correct. Users perceive TTFT as responsiveness. Streaming the first token quickly (even if subsequent tokens trickle) makes a system feel fast even when total generation is slow.

Throughput, GPU utilization, and parameter count matter for capacity planning, but TTFT directly drives user-perceived latency in chat applications.

4. What is the primary advantage of using LiteLLM as a proxy layer?

Correct. LiteLLM lets you switch between OpenAI, Anthropic, Bedrock, and self-hosted models without changing application code — and adds cost tracking and fallback routing.

LiteLLM is a proxy/routing layer. Quantization is handled by frameworks like llama.cpp; GPU memory paging is vLLM's job.

Lab 1 — Design a Production Serving Stack

Hands-on · AI-assisted · Complete 3+ exchanges to finish

Your Mission

You are the lead AI engineer at a healthcare analytics company. You need to deploy an open-weight LLM (Llama 3 8B) for an internal clinical note summarisation tool. Requirements: data must not leave your data center, latency must be under 2 seconds TTFT for 95% of requests, and up to 50 concurrent users during peak hours.

Work with the AI advisor below to design your serving stack. Explore trade-offs in framework choice, hardware selection, batching configuration, and fallback strategy. Push for specific, deployable recommendations.

Start by describing your current infrastructure constraints (e.g., "We have two NVIDIA A100 40GB GPUs on-prem and a 10Gbps internal network") and ask the advisor to recommend a serving configuration.

AI Serving Stack Advisor

LAB 1

0 / 3

Welcome. I'm your production serving stack advisor for this lab. Tell me about your current hardware — GPUs, CPU cores, RAM, and network topology — and I'll help you design a vLLM or TGI configuration that meets your clinical note summarisation requirements. What does your data center look like right now?

Module 6 · Lesson 2

Observability & Monitoring

You cannot improve what you cannot measure — and in production AI, the signals that matter are not the ones you're used to.

How did Meta detect that their ranking models had silently degraded months before any user complaint?

Meta's integrity team described a class of failures they called "silent model rot" — cases where a deployed model continued to produce outputs that looked correct to automated quality checks, but whose distribution had quietly shifted due to upstream feature drift. The team only detected the degradation when they implemented output distribution monitoring: tracking the statistical shape of model outputs (not just accuracy) over rolling 7-day windows. By the time the feature drift was traced, the model had been silently miscalibrated for eleven weeks. Meta now mandates output distribution alerts for all ranking and recommendation models.

The Three Layers of AI Observability

Monitoring an LLM in production requires thinking at three distinct levels simultaneously:

Infrastructure layer — GPU utilization, VRAM usage, latency percentiles (P50/P95/P99), request queue depth, OOM errors, container restarts. These are standard DevOps metrics but critical thresholds differ for ML workloads.
Model layer — token generation speed, output length distribution, refusal rate, confidence scores (if exposed), prompt token counts hitting context window limits.
Application layer — user satisfaction signals (thumbs up/down, session length, re-query rate), downstream task success rates, cost per successful completion.

What "Drift" Means for LLMs

Traditional ML drift monitoring compares input feature distributions to training data. LLMs are more complex. The relevant drift signals are:

Input driftUser prompts shifting to topics, languages, or complexity levels the model handles poorly. Detectable via embedding similarity to known failure cases.

Output driftDistribution of response lengths, vocabulary, or topic coverage shifting — often indicating upstream context changes or system prompt drift.

Behavioral driftModel refusal rate, hallucination rate, or toxicity rate changing — critical for safety-critical applications. Requires sampling-based evaluation pipelines.

Model version driftThe underlying model weights changed (e.g., OpenAI silently updated gpt-4-0314 to gpt-4-0613 in 2023) and your application behavior changed without you noticing.

Historical Case — OpenAI GPT-4 Version Change, 2023

In mid-2023, Stanford researchers published a study showing GPT-4's reasoning quality had measurably changed between March and June 2023 model snapshots. Coding tasks that GPT-4 completed correctly in March had a 95.2% success rate; by June that dropped to 86.8% on the same benchmark. Teams relying on the model endpoint without version-pinning had no visibility into this change. The lesson: always pin model versions in production, and run regression benchmarks on version bumps.

Tooling: What to Actually Deploy

The observability stack for production LLMs in 2024–2025 has largely converged around these tools:

Tracing

LangSmith / LangFuse

Traces every prompt, retrieval step, tool call, and response in LangChain or custom pipelines. LangFuse is open-source self-hosted; LangSmith is LangChain's managed product. Both export to OTEL.

Metrics

Prometheus + Grafana

vLLM and TGI expose Prometheus metrics natively. Build Grafana dashboards for TTFT, TPS, GPU memory, queue depth, and error rates. Alert on P99 latency breaches.

Evaluation

Continuous Evals

Run a subset of your golden test set against production endpoints daily. Tools: Braintrust, Patronus AI, or a custom eval loop using the same LLM as judge. Alert on >5% regression.

Logs

Structured Logging

Log every request with prompt hash, user ID, model version, latency, token counts, and output hash. Ship to Elasticsearch or Loki. Never log raw PII — log anonymized hashes for traceability.

Alerting Thresholds — Practical Starting Points

# Prometheus alert rules for LLM serving
groups:
  - name: llm_serving
    rules:
      - alert: HighTTFT
        expr: histogram_quantile(0.99, rate(ttft_seconds_bucket[5m])) > 3.0
        for: 2m
        annotations:
          summary: "P99 TTFT exceeded 3s — check GPU saturation"

      - alert: HighRefusalRate
        expr: rate(refusal_total[10m]) / rate(requests_total[10m]) > 0.1
        for: 5m
        annotations:
          summary: "Refusal rate above 10% — possible prompt injection or system prompt drift"

      - alert: GPUMemoryPressure
        expr: gpu_memory_used_bytes / gpu_memory_total_bytes > 0.92
        for: 1m
        annotations:
          summary: "GPU memory at 92%+ — OOM risk imminent"

Key Insight

The most dangerous production failures are not crashes — they are silent degradations. Your monitoring stack must be able to detect when your model's behavior shifts even if infrastructure metrics look healthy. Build output distribution tracking from day one, not as an afterthought.

Lesson 2 Quiz — Observability

Four questions · Select the best answer · Instant feedback

1. Meta's "silent model rot" was detected via which monitoring approach?

Correct. Meta specifically called out output distribution monitoring — tracking how the shape and range of outputs changes — not just binary correctness.

User complaints lagged by months. The key was tracking output distributions proactively.

2. You notice your GPT-4 application's code generation success rate dropped from 95% to 87% with no code changes on your side. What is the most likely cause?

Correct. This is the exact pattern Stanford researchers documented in 2023 — GPT-4 behavior changed between the March and June snapshots while the endpoint name remained the same.

This pattern — stable API endpoint, changed model behavior — is model version drift. The solution is to pin to dated model versions (e.g., gpt-4-0314) and run regression benchmarks on upgrades.

3. Which tool is best suited for tracing every step of a multi-step LLM pipeline (retrieval, reranking, generation) with full prompt/response visibility?

Correct. LangFuse and LangSmith are purpose-built for tracing LLM pipeline steps — they capture prompts, responses, latencies, token counts, and costs at each step.

Prometheus and Grafana handle infrastructure metrics; PagerDuty handles alerting. Pipeline tracing with prompt/response capture requires LLM-specific tools like LangFuse.

4. A refusal rate alert fires after 5 minutes above 10%. What are the two most likely root causes?

Correct. A sudden spike in refusals typically means either malicious users are probing the system (prompt injection) or the system prompt was accidentally changed, altering guard rails.

Network issues and batching configs affect latency and throughput, not refusal rate. Refusal rate is a behavioral signal pointing to prompt-level issues.

Lab 2 — Build a Monitoring & Alerting Plan

Hands-on · AI-assisted · Complete 3+ exchanges to finish

Your Mission

Your clinical note summarisation tool is in production. Now you need to build a monitoring and alerting stack. You have a Prometheus + Grafana setup already running for infrastructure monitoring. You need to design LLM-specific monitoring that catches silent model degradation, prompt injection attempts, and latency regressions.

Work with the monitoring advisor to define: which metrics to track, specific alert thresholds, which tool to use for trace-level observability, and how to detect output drift without reading every response.

Start by asking for a recommended Grafana dashboard layout for LLM serving, or describe a specific monitoring gap you want to close.

AI Monitoring Advisor

LAB 2

0 / 3

I'm your LLM monitoring advisor. Your clinical note summarisation tool is live — let's make sure you'll catch problems before users do. What's your biggest monitoring blind spot right now? Or tell me what infrastructure you already have and I'll help you design the LLM-specific layer on top.

Module 6 · Lesson 3

Continuous Evaluation Pipelines

Shipping model updates safely requires automated evaluation that runs before, during, and after every deployment.

How do you know if your new prompt template is actually better — before you push it to all users?

Google's Gemini team described their evaluation pipeline at a 2024 systems talk: every candidate model version runs against a tiered eval suite before any traffic reaches it. Tier 1 runs in minutes (unit-test-style golden examples), Tier 2 runs in hours (sampling-based evals with LLM-as-judge), Tier 3 runs in days (human preference comparisons on a stratified sample). A version must pass Tier 1 and Tier 2 before it ever receives shadow traffic. This prevents regressions from reaching users — even when regressions only appear on edge cases that weren't in the original test set.

The Evaluation Hierarchy

Think of evaluations in three tiers, each with different speed/depth tradeoffs:

Golden set evals (minutes) — A curated set of 100–500 input/output pairs with exact or near-exact expected outputs. Run on every commit. These catch obvious regressions and confirm core capabilities. Tools: pytest with custom assert functions, or Braintrust's unit test mode.
LLM-as-judge evals (hours) — Send a random 5% sample of production prompts through both old and new versions; use a stronger model (GPT-4o or Claude Sonnet) to score both responses on relevance, accuracy, and safety. Alert if new version loses head-to-head comparison on >5% of samples.
Human evals (days/weeks) — For major model upgrades, route a small percentage of live traffic to the new version and collect real user signals (thumbs up/down, session continuation rate). This is the ground truth but has the highest cost and slowest signal.

Building an LLM-as-Judge Pipeline

LLM-as-judge has become standard practice. Here is how to implement it reliably:

# Minimal LLM-as-judge eval loop (Python pseudocode)
import anthropic, json, statistics

client = anthropic.Anthropic()

JUDGE_PROMPT = """You are an impartial evaluator.
Given a user query and two AI responses (A and B),
score each on a scale of 1-5 for:
1. Factual accuracy  2. Completeness  3. Clarity

Return JSON: {{"score_a": X, "score_b": X, "winner": "A"|"B"|"tie"}}
Query: {query}
Response A: {response_a}
Response B: {response_b}"""

def judge_pair(query, resp_a, resp_b):
    result = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            query=query, response_a=resp_a, response_b=resp_b)}])
    return json.loads(result.content[0].text)

# Run on 5% sample, aggregate results
wins_new = sum(1 for r in results if r["winner"] == "B")
if wins_new / len(results) < 0.45:
    rollback_deployment()  # New version loses on >55% — revert

Canary Deployments for AI

A canary deployment sends a small percentage of real traffic (typically 1–5%) to the new model version while the old version handles the rest. This is the standard pattern in software engineering, but AI canaries need additional signal collection:

Shadow modeNew model receives all requests but its responses are discarded. Used for performance testing without user impact. Run for 24–48 hours to detect OOM, latency, and output anomalies.

1% canaryReal users see new model. Monitor user satisfaction signals (re-query rate, session length) vs. control group. Roll back if >10% degradation in any metric.

Progressive rollout1% → 5% → 25% → 50% → 100% with automated gates at each stage. If evals pass, traffic increases automatically. If not, it rolls back without human intervention.

Tooling for Canary Deployments

Kubernetes with Argo Rollouts or Flagger handles progressive traffic shifting. For LLM-specific routing, add a thin application-layer router that logs which model version handled each request — essential for correlating model version with outcome metrics.

Regression Testing Golden Sets

Your golden set must be curated, version-controlled, and continuously expanded. Best practices from teams at Anthropic, Google, and OpenAI (as described in published engineering posts):

Capture real failures — Every production incident where the model produced a bad output should generate a new golden test case. This prevents the same regression twice.
Stratify by capability — Don't test only easy cases. Include adversarial inputs, edge cases, multilingual queries, and long-context prompts proportional to your real traffic distribution.
Version control your evals — Store golden sets in Git alongside your code. Eval failures block deployment just like failing unit tests.
Rotate a holdout set — Keep 20% of golden cases hidden from developers to prevent overfitting to the test set during fine-tuning.

Key Insight

The best teams treat eval as a product, not a chore. Investing one week in a robust continuous eval pipeline saves months of debugging silent regressions. An LLM-as-judge loop that runs in hours is almost always worth the API cost — typically under $10 per eval run on a 500-sample test set.

Lesson 3 Quiz — Continuous Evaluation

Four questions · Select the best answer · Instant feedback

1. In Google's tiered eval system, what must happen before a model version receives shadow traffic?

Correct. Shadow traffic (where the model receives real requests but responses are discarded) only begins after Tier 1 and Tier 2 evals pass — preventing bad models from reaching any users.

Tier 3 human evals come after shadow traffic, not before. The tiers are sequential gates, not parallel processes.

2. When should you add a new case to your golden eval set?

Correct. Every production failure is a test case waiting to be written. Capturing failures in the golden set is how teams prevent the same bug from shipping twice.

A static or schedule-driven golden set quickly falls out of sync with real-world failure modes. The golden set should grow with every production incident.

3. You run an LLM-as-judge eval and find the new model version wins on only 42% of head-to-head comparisons. What should you do?

Correct. 42% win rate means the new version is worse on 58% of cases — a significant regression. The pipeline should block this deployment automatically before any users are affected.

A 42% win rate indicates the new version is meaningfully worse. Progressive rollout only makes sense when baseline evals pass — not when there's a known regression.

4. Why should 20% of your golden eval cases be kept as a hidden holdout set during fine-tuning?

Correct. This is the same principle as train/test split in ML — if developers see all eval cases, they may (consciously or not) tune their model to pass those specific cases rather than generalize.

The holdout is about measurement integrity, not cost or compliance. Visible eval cases can be "gamed" — the holdout set ensures you're measuring true generalization.

Lab 3 — Design Your Eval Pipeline

Hands-on · AI-assisted · Complete 3+ exchanges to finish

Your Mission

Your clinical note summarisation tool has been live for 3 months. You're about to update the system prompt and switch from Llama 3 8B to a fine-tuned version of Llama 3 8B that your team trained on 10,000 clinical note examples. Before you deploy, you need a continuous eval pipeline that ensures the new version is actually better.

Work with the eval advisor to: design your golden set (what cases to include, how many), write a concrete LLM-as-judge prompt for clinical summarisation quality, and define the automated deployment gates for your canary rollout.

Start by describing what "good" looks like for clinical note summarisation in your context, then ask the advisor to help you operationalise that into specific eval criteria and a judge prompt.

AI Eval Pipeline Advisor

LAB 3

0 / 3

Let's build your eval pipeline for the clinical note summarisation fine-tuned model upgrade. Before I can help you write the judge prompt and golden set spec, I need to understand your quality bar. What does an excellent clinical note summary look like? What are the failure modes that would be unacceptable in a clinical context?

Module 6 · Lesson 4

Cost Management & Optimization

At scale, inference cost is not a line item — it is your product's margin. Every token matters.

How did one startup cut its LLM API bill by 70% without degrading user experience?

Brex, the corporate card and spend management company, publicly documented their LLM cost optimization journey in their engineering blog. After launching an internal AI assistant, their monthly OpenAI bill was scaling faster than their user growth. Their engineering team implemented a tiered model routing strategy: simple queries (account balance, spending category lookup) were routed to GPT-3.5-turbo at $0.002/1K tokens; complex queries (financial analysis, policy explanations) went to GPT-4 at $0.06/1K tokens. Combined with prompt compression and semantic caching, they reduced inference costs by approximately 70% while maintaining user satisfaction scores.

The Cost Landscape (2024–2025)

Token prices have dropped dramatically since 2023, but inference costs still dominate ML budgets at scale. A rough cost reference for current frontier models:

GPT-4o ~$5/M input tokens

Claude Sonnet ~$3/M input

Gemini Flash ~$0.075/M input

Llama 3 8B self-hosted ~$0.05/M input (GPU cost)

GPT-4o mini ~$0.15/M input

At 10 million tokens per day (a moderate production workload), the difference between using GPT-4o and Llama 3 8B self-hosted is over $1,400 per day — $500,000 per year. This makes model selection a core business decision, not just a technical one.

Optimization Strategy 1: Model Routing

Route different request types to appropriately-sized models. This requires a lightweight classifier that runs before the main model call:

# Routing logic (Python pseudocode)
def route_request(prompt: str, context: dict) -> str:
    complexity = classify_complexity(prompt)  # fast heuristic

    if complexity == "simple" or len(prompt) < 200:
        return "llama-3-8b"          # $0.05/M tokens
    elif complexity == "moderate":
        return "gpt-4o-mini"         # $0.15/M tokens
    else:
        return "claude-sonnet"       # $3/M tokens

def classify_complexity(prompt: str) -> str:
    # Heuristics: word count, question type, domain keywords
    if any(kw in prompt for kw in ["analyze", "compare", "explain why"]):
        return "complex"
    if len(prompt.split()) < 30:
        return "simple"
    return "moderate"

Optimization Strategy 2: Semantic Caching

Many production workloads have highly repetitive prompts. A user asking "What is your return policy?" is functionally identical to "How do returns work?" — semantic caching detects this and returns a cached response without an API call.

How it worksEmbed each incoming prompt with a cheap embedding model (e.g., text-embedding-3-small at $0.02/M tokens). Compare cosine similarity against a cache of recent embeddings. If similarity > 0.97, return cached response.

Cache hit rateHighly workload-dependent. FAQ bots: 60–80% hit rate. Creative writing assistants: 5–10%. Clinical note summarisation: typically 15–25% (similar patient demographics produce similar queries).

ToolsGPTCache (open source), Redis with vector search, Qdrant or Chroma as vector stores. Most add <10ms latency on a cache hit vs. 500ms–3s for a full inference call.

Optimization Strategy 3: Prompt Compression

Long system prompts and context windows cost money. LLMLingua (from Microsoft Research, published 2023) demonstrated that prompts can be compressed by removing tokens that contribute least to the output distribution — achieving 2–5× compression with under 5% quality degradation in many tasks. For RAG systems, this means compressing retrieved document chunks before passing them to the LLM.

Strategy

Output Caching

Cache exact match outputs for deterministic queries (temperature=0). Identical prompts return immediately from cache. Effective for template-heavy use cases.

Strategy

Batching Async Jobs

OpenAI's Batch API and Anthropic's Message Batches API offer 50% cost reduction for async workloads. Process overnight; latency doesn't matter.

Strategy

Max Token Limits

Set explicit max_tokens on every call. A misconfigured call that generates 4,000 tokens when 200 suffice costs 20× more. Add per-user and per-request token budgets.

Strategy

Prompt Prefix Caching

Anthropic and Google cache common prompt prefixes (system prompts, shared context). Prefix cache hits are 80–90% cheaper. Structure prompts with stable content first.

Building a Cost Attribution System

You cannot optimize what you cannot measure. Every LLM call should be tagged with:

# Required tags on every LLM API call
{
  "user_id": "hashed_id",
  "feature_id": "clinical_note_summary",
  "model": "llama-3-8b",
  "prompt_tokens": 412,
  "completion_tokens": 187,
  "cost_usd": 0.000030,
  "cache_hit": false,
  "session_id": "abc123",
  "timestamp": "2025-01-15T14:23:01Z"
}

Aggregate these in a data warehouse (BigQuery, Snowflake, or even Postgres at small scale) and build weekly cost-per-feature dashboards. This reveals which product features are profitable and which are burning money — essential information for both engineering and product decisions.

Key Insight

Cost optimization is not a one-time project — it is a continuous practice. As models get cheaper and your usage patterns evolve, re-evaluate your routing thresholds quarterly. A model routing decision that was correct in January may be suboptimal by July as new models launch and prices drop.

Lesson 4 Quiz — Cost Management

Four questions · Select the best answer · Instant feedback

1. What is the primary mechanism of semantic caching for LLM cost reduction?

Correct. Semantic caching uses embedding similarity (typically cosine similarity > 0.95–0.97) to detect functionally equivalent prompts and serve cached responses without inference.

Semantic caching is about detecting equivalent queries using embeddings — not model compression, routing, or batching.

2. Brex reduced their LLM costs by ~70% primarily through which strategy?

Correct. The combination of routing (right-sizing model to task complexity), semantic caching (avoiding redundant calls), and prompt compression drove Brex's 70% cost reduction.

The Brex case study specifically attributes gains to tiered routing plus caching plus compression — no single lever achieved the result.

3. Which API feature allows a 50% cost reduction on non-real-time batch processing workloads?

Correct. Both OpenAI and Anthropic offer roughly 50% price reductions for async batch processing — where results can be delivered within 24 hours rather than in real time.

Streaming affects UX latency; prefix caching helps with shared context; batch APIs specifically target the async use case with a significant price break.

4. Why should every LLM API call be tagged with user_id, feature_id, and cost_usd in your logging system?

Correct. Without attribution, you cannot know which features are driving cost growth. Tagging enables cost-per-feature dashboards that drive both engineering optimization and product prioritization decisions.

The purpose of per-call tagging is cost attribution — understanding which product features generate which costs so you can make informed optimization and product decisions.

Lab 4 — Build a Cost Optimization Strategy

Hands-on · AI-assisted · Complete 3+ exchanges to finish

Your Mission

Your clinical note summarisation tool is processing 500,000 requests per month. Your current setup routes all requests to Claude Sonnet at $3/M input tokens. Monthly bill: approximately $1,800/month just for input tokens, plus output. Your CTO wants to cut inference costs by at least 50% without degrading the quality of critical summaries.

Work with the cost advisor to: design a tiered routing strategy specific to clinical note summarisation, estimate the cost impact of each optimization, decide where semantic caching makes sense in this use case, and identify which requests absolutely must use the most capable model.

Start by breaking down your request types — what kinds of clinical notes does your system process? What are the simple vs. complex cases? Ask the advisor to help you design a routing matrix.

AI Cost Optimization Advisor

LAB 4

0 / 3

Let's cut that inference bill. 500K requests/month all hitting Claude Sonnet is almost certainly over-provisioned for many of your request types. Before I build a routing matrix, help me understand your traffic: what are the three or four most common types of clinical notes you're summarising? Are they all similar in length and complexity, or highly variable?

Module 6 — Deployment & Monitoring Test

15 questions · 80% to pass · Covers all four lessons

1. What does PagedAttention in vLLM solve?

Correct. PagedAttention applies OS-style virtual memory paging to the KV cache, eliminating fragmentation and allowing vLLM to pack more requests into GPU memory efficiently.

PagedAttention is specifically about KV cache memory management — not quantization, tensor parallelism, or tokenization.

2. A 7B parameter model in BFloat16 requires approximately how much VRAM just to hold the weights?

Correct. BFloat16 uses 2 bytes per parameter. 7 billion × 2 bytes = 14 GB, before accounting for the KV cache during inference.

BFloat16 = 2 bytes per parameter. 7B × 2 = 14 GB. INT4 quantization would bring it to ~3.5 GB; INT8 to ~7 GB.

3. Which deployment scenario is most appropriate for a zero-idle-cost async document processing pipeline that can tolerate 20-second cold starts?

Correct. Serverless inference scales to zero when idle (zero cost) and scales up on demand. Cold starts are the tradeoff — acceptable for async workloads, problematic for real-time chat.

Dedicated GPUs cost money even when idle. The async + cold-start-tolerant use case is exactly what serverless inference is designed for.

4. Meta's "silent model rot" case demonstrates which monitoring gap?

Correct. Binary accuracy checks said the model was "correct enough" while the output distribution was quietly shifting — only distribution-level monitoring caught the 11-week degradation.

The Meta case is specifically about the failure to monitor output distributions — behavioral drift that binary accuracy checks cannot detect.

5. You should set a high-refusal-rate alert because a sudden spike likely indicates:

Correct. Refusal rate is a behavioral signal. Sudden spikes indicate either adversarial users probing guardrails (prompt injection) or a system prompt change that altered how the model interprets what it should refuse.

Refusal rate is a model behavior metric, not an infrastructure metric. It points to prompt-level causes, not GPU or network issues.

6. What does LangFuse offer that Prometheus + Grafana does NOT?

Correct. Prometheus tracks numeric metrics (latency, throughput, error rates). LangFuse captures the content of each pipeline step — what was the prompt, what was retrieved, what did the model say — enabling root cause analysis of bad outputs.

Prometheus/Grafana are for infrastructure metrics. LangFuse is for content-level tracing of what happened inside each LLM pipeline execution.

7. The Stanford research on GPT-4's performance change between March and June 2023 is a case study in:

Correct. The endpoint name "gpt-4" was stable, but the underlying model changed — and coding task accuracy dropped from 95.2% to 86.8%. This is model version drift, and the mitigation is pinning to dated model versions.

The Stanford study is specifically about undocumented model checkpoint changes causing behavioral shifts — the definition of model version drift.

8. In Google's tiered eval system, what is the purpose of shadow traffic (before any real users see the new model)?

Correct. Shadow mode is about infrastructure and behavioral validation at production scale, with zero risk to users since responses are never shown.

Shadow mode discards all responses — users never see them. It's a load test with behavioral monitoring, not user preference collection or further training.

9. You have a holdout set of 100 golden eval cases that developers never see during development. What does passing the holdout set prove that passing the visible golden set does NOT?

Correct. This is the train/test split principle applied to evals. Developers can unconsciously overfit to visible test cases; a hidden holdout provides an unbiased generalization estimate.

The holdout set specifically guards against evaluation overfitting — it proves generalization, not latency, multilingual capability, or cost performance.

10. LLMLingua (Microsoft Research, 2023) demonstrated what capability for cost optimization?

Correct. LLMLingua identifies tokens that contribute least to the output distribution and removes them — compressing prompts significantly while preserving most of the semantic content needed for good responses.

LLMLingua is specifically about prompt token compression — not routing, quantization, or semantic caching.

11. What is the main tradeoff of using Anthropic or OpenAI's Batch API compared to real-time inference?

Correct. The Batch API tradeoff is latency (up to 24 hours) for cost (50% reduction). Same models, same quality — just delayed delivery suitable for non-real-time workflows.

The Batch API uses the same models and system prompt support — the only tradeoff is latency for cost savings.

12. Why must cost-attribution tags include feature_id alongside user_id?

Correct. Without feature_id, you know total cost but not which features drive it. With feature_id, you can build dashboards showing "clinical note summarisation costs $X/day, chatbot costs $Y/day" — enabling targeted optimization.

feature_id attribution is about cost visibility and optimization prioritization, not compliance, personalization, or token reduction.

13. In a progressive canary rollout (1%→5%→25%→50%→100%), what should trigger an automatic rollback at any stage?

Correct. Rollback gates must be automated and threshold-based — not triggered by individual complaints (too noisy) or any difference at all (too sensitive). Define specific thresholds per metric before deploying.

Rollbacks should be automated on pre-defined metric thresholds, not individual complaints (too noisy) or GPU metrics (wrong layer) or any distribution difference (too sensitive).

14. Which framework provides an OpenAI-compatible API endpoint for self-hosted open-weight models with continuous batching built in?

Correct. vLLM serves open-weight models (Llama, Mistral, Qwen, etc.) behind an OpenAI-compatible REST API with continuous batching and PagedAttention built in.

LangChain is an orchestration framework; LiteLLM is a proxy/routing layer; LangFuse is for observability. vLLM is the inference server.

15. A clinical note summarisation system processes 500K requests/month, all routed to Claude Sonnet. What is the FIRST optimization to implement for cost reduction?

Correct. Blind cost-cutting (switching everything to cheap models) risks quality. The correct first step is traffic analysis to understand complexity distribution, then build evidence-based routing — the Brex approach.

Blind switches or arbitrary token limits risk quality degradation. The disciplined approach is traffic analysis → routing design → staged rollout with quality monitoring.