When ChatGPT crossed one million users in five days after its November 2022 launch, OpenAI's infrastructure team faced continuous capacity emergencies. CTO Mira Murati publicly acknowledged that the service was "incredibly expensive" and capacity-constrained for months. The API — launched separately in March 2023 — had to use rate-limiting and waitlists precisely because GPU memory is finite and serving transformer models at scale is orders of magnitude harder than serving a REST API that returns JSON from a database.
A typical web API is stateless and cheap — spin up more containers and add load balancing. LLM serving is fundamentally different. Every forward pass through a 7B-parameter model requires roughly 14 GB of GPU VRAM just to hold weights in BFloat16. Then inference itself — the KV cache — grows with every token generated. A single 4,096-token context window on a 7B model can consume another 4–8 GB of VRAM.
This means batching strategy, GPU memory management, and tokenizer throughput are first-class engineering concerns, not afterthoughts.
Three frameworks dominate production LLM serving today:
A common 2024 stack: Nginx → LiteLLM proxy → vLLM backend (self-hosted) with Anthropic Claude API as overflow fallback. The proxy handles rate limiting, logging, and cost allocation; vLLM handles on-prem inference; Anthropic handles burst traffic that exceeds GPU capacity.
Where your model runs determines everything about latency, cost, and compliance:
TTFT is often more important than total latency for conversational UX — users tolerate a 3-second wait for the first token far better than a 3-second wait between every few words. Streaming output (server-sent events or WebSockets) is therefore essential in any chat interface.
The gap between "model works in a notebook" and "model serves 1,000 concurrent users" is not an API call — it's an infrastructure engineering problem. Choosing the right serving framework, batching strategy, and deployment target before you write application code prevents painful rewrites at scale.
You are the lead AI engineer at a healthcare analytics company. You need to deploy an open-weight LLM (Llama 3 8B) for an internal clinical note summarisation tool. Requirements: data must not leave your data center, latency must be under 2 seconds TTFT for 95% of requests, and up to 50 concurrent users during peak hours.
Work with the AI advisor below to design your serving stack. Explore trade-offs in framework choice, hardware selection, batching configuration, and fallback strategy. Push for specific, deployable recommendations.
Meta's integrity team described a class of failures they called "silent model rot" — cases where a deployed model continued to produce outputs that looked correct to automated quality checks, but whose distribution had quietly shifted due to upstream feature drift. The team only detected the degradation when they implemented output distribution monitoring: tracking the statistical shape of model outputs (not just accuracy) over rolling 7-day windows. By the time the feature drift was traced, the model had been silently miscalibrated for eleven weeks. Meta now mandates output distribution alerts for all ranking and recommendation models.
Monitoring an LLM in production requires thinking at three distinct levels simultaneously:
Traditional ML drift monitoring compares input feature distributions to training data. LLMs are more complex. The relevant drift signals are:
In mid-2023, Stanford researchers published a study showing GPT-4's reasoning quality had measurably changed between March and June 2023 model snapshots. Coding tasks that GPT-4 completed correctly in March had a 95.2% success rate; by June that dropped to 86.8% on the same benchmark. Teams relying on the model endpoint without version-pinning had no visibility into this change. The lesson: always pin model versions in production, and run regression benchmarks on version bumps.
The observability stack for production LLMs in 2024–2025 has largely converged around these tools:
The most dangerous production failures are not crashes — they are silent degradations. Your monitoring stack must be able to detect when your model's behavior shifts even if infrastructure metrics look healthy. Build output distribution tracking from day one, not as an afterthought.
Your clinical note summarisation tool is in production. Now you need to build a monitoring and alerting stack. You have a Prometheus + Grafana setup already running for infrastructure monitoring. You need to design LLM-specific monitoring that catches silent model degradation, prompt injection attempts, and latency regressions.
Work with the monitoring advisor to define: which metrics to track, specific alert thresholds, which tool to use for trace-level observability, and how to detect output drift without reading every response.
Google's Gemini team described their evaluation pipeline at a 2024 systems talk: every candidate model version runs against a tiered eval suite before any traffic reaches it. Tier 1 runs in minutes (unit-test-style golden examples), Tier 2 runs in hours (sampling-based evals with LLM-as-judge), Tier 3 runs in days (human preference comparisons on a stratified sample). A version must pass Tier 1 and Tier 2 before it ever receives shadow traffic. This prevents regressions from reaching users — even when regressions only appear on edge cases that weren't in the original test set.
Think of evaluations in three tiers, each with different speed/depth tradeoffs:
LLM-as-judge has become standard practice. Here is how to implement it reliably:
A canary deployment sends a small percentage of real traffic (typically 1–5%) to the new model version while the old version handles the rest. This is the standard pattern in software engineering, but AI canaries need additional signal collection:
Kubernetes with Argo Rollouts or Flagger handles progressive traffic shifting. For LLM-specific routing, add a thin application-layer router that logs which model version handled each request — essential for correlating model version with outcome metrics.
Your golden set must be curated, version-controlled, and continuously expanded. Best practices from teams at Anthropic, Google, and OpenAI (as described in published engineering posts):
The best teams treat eval as a product, not a chore. Investing one week in a robust continuous eval pipeline saves months of debugging silent regressions. An LLM-as-judge loop that runs in hours is almost always worth the API cost — typically under $10 per eval run on a 500-sample test set.
Your clinical note summarisation tool has been live for 3 months. You're about to update the system prompt and switch from Llama 3 8B to a fine-tuned version of Llama 3 8B that your team trained on 10,000 clinical note examples. Before you deploy, you need a continuous eval pipeline that ensures the new version is actually better.
Work with the eval advisor to: design your golden set (what cases to include, how many), write a concrete LLM-as-judge prompt for clinical summarisation quality, and define the automated deployment gates for your canary rollout.
Brex, the corporate card and spend management company, publicly documented their LLM cost optimization journey in their engineering blog. After launching an internal AI assistant, their monthly OpenAI bill was scaling faster than their user growth. Their engineering team implemented a tiered model routing strategy: simple queries (account balance, spending category lookup) were routed to GPT-3.5-turbo at $0.002/1K tokens; complex queries (financial analysis, policy explanations) went to GPT-4 at $0.06/1K tokens. Combined with prompt compression and semantic caching, they reduced inference costs by approximately 70% while maintaining user satisfaction scores.
Token prices have dropped dramatically since 2023, but inference costs still dominate ML budgets at scale. A rough cost reference for current frontier models:
At 10 million tokens per day (a moderate production workload), the difference between using GPT-4o and Llama 3 8B self-hosted is over $1,400 per day — $500,000 per year. This makes model selection a core business decision, not just a technical one.
Route different request types to appropriately-sized models. This requires a lightweight classifier that runs before the main model call:
Many production workloads have highly repetitive prompts. A user asking "What is your return policy?" is functionally identical to "How do returns work?" — semantic caching detects this and returns a cached response without an API call.
Long system prompts and context windows cost money. LLMLingua (from Microsoft Research, published 2023) demonstrated that prompts can be compressed by removing tokens that contribute least to the output distribution — achieving 2–5× compression with under 5% quality degradation in many tasks. For RAG systems, this means compressing retrieved document chunks before passing them to the LLM.
You cannot optimize what you cannot measure. Every LLM call should be tagged with:
Aggregate these in a data warehouse (BigQuery, Snowflake, or even Postgres at small scale) and build weekly cost-per-feature dashboards. This reveals which product features are profitable and which are burning money — essential information for both engineering and product decisions.
Cost optimization is not a one-time project — it is a continuous practice. As models get cheaper and your usage patterns evolve, re-evaluate your routing thresholds quarterly. A model routing decision that was correct in January may be suboptimal by July as new models launch and prices drop.
Your clinical note summarisation tool is processing 500,000 requests per month. Your current setup routes all requests to Claude Sonnet at $3/M input tokens. Monthly bill: approximately $1,800/month just for input tokens, plus output. Your CTO wants to cut inference costs by at least 50% without degrading the quality of critical summaries.
Work with the cost advisor to: design a tiered routing strategy specific to clinical note summarisation, estimate the cost impact of each optimization, decide where semantic caching makes sense in this use case, and identify which requests absolutely must use the most capable model.