When Zillow launched its iBuying arm Zillow Offers, engineers faced a deceptively simple specification: "predict home prices accurately." The system eventually lost $881 million in a single quarter and was shut down in November 2021. Post-mortems revealed the root cause was a requirements failure β the team had not adequately constrained the model's operational envelope, ignored distributional shift in volatile markets, and never defined acceptable error tolerances in dollar terms before deployment.
AI system design begins with a requirements and constraints document β not a model choice, not a cloud provider, not a dataset. The document forces three honest conversations: what does success look like in measurable terms, what can the system never do, and what resources are actually available.
Google's Site Reliability Engineering practice, documented publicly since 2016, introduced Service Level Objectives (SLOs) as the translation layer between business intent and engineering decisions. AI systems need an extended version: SLOs for accuracy, latency, fairness, and degradation behavior under distribution shift.
| Dimension | What to specify | Example metric |
|---|---|---|
| Performance | Accuracy floor, latency ceiling, throughput | P99 inference <120ms; F1 β₯ 0.88 on held-out test set |
| Operational | Uptime, graceful degradation, rollback triggers | 99.9% uptime; fallback to rule-based system if model error rate >5% |
| Data | Volume, velocity, freshness, legal basis | Retrain on data β€ 30 days old; no PII in training features |
| Resource | Budget, compute, team skills, timeline | Inference cost β€ $0.002 per request; single ML engineer on-call |
Every AI system serves multiple stakeholders with conflicting incentives. The product team wants features shipped fast. The legal team wants audit trails. The operations team wants stability. The end user wants accuracy and speed. A requirements process that interviews only one group produces a system optimized for one constituency at the expense of others.
The EU AI Act (2024) now legally mandates that high-risk AI systems document affected stakeholder groups and their risk exposure before deployment β making stakeholder mapping a compliance requirement, not just good practice.
Write your requirements document as if the ML model does not exist yet. Every requirement must be verifiable without knowing which algorithm you will use. If a requirement references a specific model architecture, it is a design decision, not a requirement.
Consider the transformation of a real product brief from a healthcare AI startup (2022, published in an IEEE case study): the original ask was "detect abnormal lab results." The engineering team ran a structured requirements workshop and produced: sensitivity β₯ 0.94 for critical abnormals (K+ >6.0 mEq/L), specificity β₯ 0.88, false negative alert latency β€ 2 minutes from result availability, zero PHI persisted beyond the inference window, and automatic escalation to a human reviewer for edge cases flagged by a calibrated uncertainty threshold.
Each of those five sub-requirements maps to a testable acceptance criterion. The vague original goal maps to nothing testable at all.
In the Lab for this lesson you will conduct a live requirements elicitation session with an AI mentor. You will be given a real-world scenario and must identify stakeholders, define measurable success criteria, specify hard constraints, and produce a structured requirements document β all through guided dialogue.
You are a lead engineer at a logistics company. Your VP just said: "We need AI to reduce delivery failures." That is your entire brief. Through dialogue with your AI design mentor, you must turn this vague directive into a structured requirements document covering: measurable success criteria, hard constraints, stakeholder impacts, and rollback conditions.
The mentor will challenge your assumptions, push for specificity, and flag anything that is not testable. Ask questions, propose metrics, and defend your choices.
Netflix's recommendation system has cycled through three distinct architectural generations since 2016, each documented in their engineering blog. The first was a batch-offline architecture: models trained weekly, recommendations served from a precomputed cache. Fast, cheap, but stale. The second generation added an online layer β lightweight models ran at request time to personalize against the weekly batch candidates. The third, post-2020, introduced a two-tower neural architecture with near-real-time feature computation. Each transition was forced by a measurable failure of the prior architecture to meet new user expectations around content freshness.
| Pattern | Latency | Freshness | Cost | Complexity |
|---|---|---|---|---|
| Batch Offline | ~1ms (cache hit) | Hoursβdays | Very low | Low |
| Lambda | 10β200ms | Secondsβminutes | Medium | High (two code paths) |
| Kappa | 10β500ms | Seconds | Medium-high | Medium |
| RAG | 500msβ3s | Near-real-time (index) | High (LLM tokens) | Medium |
| Agentic | 1sβ60s+ | Real-time (tool calls) | Very high | Very high |
A critical design decision often missed by teams new to production AI: is your ML model one component inside a larger software system, or is the model the system itself? Uber's surge pricing, documented in their 2018 engineering post, uses ML as a module inside a larger pricing engine β the ML output is clipped, overridden by business rules, and A/B tested against heuristics. This is model-in-the-loop.
By contrast, GitHub Copilot (launched 2021) is model-as-the-system β the model output goes directly to the user with minimal post-processing. This shifts the entire reliability, safety, and quality burden onto the model itself, which has fundamentally different architectural implications for monitoring, fallback, and versioning.
Lambda architecture (maintaining parallel batch and streaming code paths) is often chosen prematurely. Jay Kreps, who coined the term "Kappa architecture" at LinkedIn in 2014, noted that most teams that built Lambda architectures later wished they had started with Kappa β the dual-path maintenance burden compounds over time.
Map your requirements document from Lesson 1 directly to pattern constraints: if your latency SLO is under 50ms, streaming and agentic patterns are eliminated. If your cost budget is under $0.001 per request at scale, RAG with large models is eliminated. If your data has a legal freshness requirement (e.g., financial regulations requiring intraday recalculation), batch-offline is eliminated. The requirements document should make the architecture choice almost automatic.
In Lab 2 you will receive a set of requirements (latency, cost, freshness, team constraints) and work with the AI mentor to select and justify an architecture pattern. You will then sketch the key components and identify the top three failure modes of your chosen design.
You are designing an AI system for a retail bank's fraud detection pipeline. The requirements are: P99 latency β€ 80ms, data freshness β€ 5 minutes, inference cost β€ $0.0005 per transaction, 99.95% uptime, team of 3 ML engineers.
Work with the architect mentor to: select the right architecture pattern for these constraints, sketch the key components, and identify the top 3 failure modes of your design. You must justify every choice with reference to the requirements.
In 2017, Amazon scrapped an internal AI recruiting tool after discovering it had been systematically downgrading rΓ©sumΓ©s containing the word "women's" β as in "women's chess club." The root cause was not the model architecture. It was the feature engineering pipeline: rΓ©sumΓ©s were represented in ways that encoded historical hiring biases present in the training labels. The data pipeline had no bias audit stage, no feature documentation, and no mechanism to detect that protected attributes were leaking into learned representations. Amazon's Reuters-reported disclosure in 2018 became a canonical case study in pipeline design responsibility.
Reliable ML pipelines share a common structure regardless of domain. Skipping any stage is a documented source of production failures.
The single most common production ML bug is training-serving skew: the features computed during training are subtly different from the features computed at inference time. Google's ML engineers estimated in their 2020 Practitioners Guide to MLOps whitepaper that training-serving skew accounts for the majority of unexplained production performance degradation in ML systems.
This happens because training pipelines and serving pipelines are written by different people, in different languages, at different times. The canonical fix is a feature store β a shared system that computes features once and serves the same computation to both training and inference. Uber's Michelangelo platform (documented in their 2017 engineering blog) was one of the first production feature stores, and Feast (open-sourced by Gojek in 2020) became the widely adopted open-source standard.
Great Expectations (open-sourced 2019, now maintained by a dedicated company) introduced the concept of data contracts β machine-readable specifications of what valid data looks like, that run as automated pipeline gates. A data contract specifies: column existence, type, range, null rate, uniqueness, and distributional properties. When incoming data violates the contract, the pipeline raises an alert before the bad data reaches the model.
Spotify's ML engineering team published a 2021 post describing how they use data contracts across 200+ data pipelines to catch upstream schema changes before they corrupt model training jobs. The contract approach turned a reactive debugging task into a proactive quality gate.
Data leakage β when information from the future or from the label leaks into training features β is the most seductive pipeline bug. The model trains to perfect accuracy. The pipeline looks clean. Only in production does performance collapse. Leakage gates must be architectural, not just practices: temporally-indexed train/test splits and feature provenance logging should be built into the pipeline infrastructure, not left to individual engineer discipline.
| Dimension | Option A | Option B |
|---|---|---|
| Freshness | Batch-computed (hourly/daily) | Streaming-computed (seconds) |
| Serving | Offline store (Parquet/S3) | Online store (Redis/DynamoDB) |
| Point-in-time | Manual temporal joins | Built-in time-travel queries |
| Governance | Shared schema registry | Per-team feature namespaces |
In Lab 3 you will audit a realistic "broken" data pipeline. The mentor will describe a pipeline with embedded flaws β training-serving skew, a leakage risk, and a missing validation gate. Your job is to identify each flaw, explain the failure mode it creates, and redesign the pipeline to fix it.
A colleague's credit scoring pipeline just went to production. Six weeks later, the model's Gini coefficient has fallen from 0.71 to 0.48 and the team can't explain why. You've been brought in to audit the pipeline. Here's what you know:
When COVID-19 lockdowns began in March 2020, virtually every ML system trained on pre-pandemic data experienced sudden, severe distributional shift. Demand forecasting models at retailers like Target and Walmart, documented in supply chain industry reports, began generating absurd reorder recommendations for toilet paper and hand sanitizer because training data contained no event with that demand signature. The systems produced outputs β no errors, no alerts, no monitoring flags. They just produced confidently wrong predictions. Teams discovered the failures through business outcomes (empty shelves, excess inventory) rather than through technical alerts. The core failure was the absence of concept drift detection in the monitoring layer.
Production ML systems require monitoring at three distinct levels, each catching different failure classes. Many teams implement only the first layer and call it done.
| Layer | What it monitors | Alert trigger examples |
|---|---|---|
| Infrastructure | CPU, memory, disk, network, container health | GPU utilization >95%, inference pod restarts >3/hour |
| Data / Feature | Input distribution, null rates, schema drift | Feature X mean shifted >2Ο from training baseline; null rate >5% |
| Model / Concept | Prediction distribution, confidence calibration, downstream outcomes | Output entropy increased 30%; business KPI decoupled from model confidence |
The most widely deployed drift detection method in production systems is Population Stability Index (PSI), originally from credit scoring (published in financial literature by 1990s risk modelers). PSI measures the shift in a feature or output distribution relative to a reference baseline. PSI < 0.1 is considered stable; 0.1β0.25 warrants investigation; > 0.25 triggers retraining or rollback.
Evidently AI (launched 2021), Whylogs (WhyLabs, 2020), and Arize AI (2020) all built commercial products around this core statistical concept, adding dashboards and alerting infrastructure. Meta's internal ML monitoring system, described in their 2022 engineering blog, monitors over 200 features per model with automated PSI gates that pause production inference when drift exceeds thresholds.
A system that fails gracefully is explicitly designed for its failure modes before they occur. The pattern has three components:
Many production models produce high-confidence scores even when they are wrong β particularly neural networks, which are known to be poorly calibrated out-of-the-box. Geifman & El-Yaniv's 2017 research and Guo et al.'s seminal 2017 paper "On Calibration of Modern Neural Networks" demonstrated that modern neural networks are significantly more overconfident than older logistic regression models.
Temperature scaling (Guo et al., 2017) became the standard post-hoc calibration technique: a single scalar parameter T is fit on a held-out validation set to soften the model's output distribution. Without calibration, confidence thresholds in graceful degradation systems are meaningless β a model outputting 0.95 confidence when it is wrong provides no safety benefit.
Before retiring an old model, run the new model in shadow mode: both models process every request, but only the old model's output is served. Log both outputs and compare. Booking.com's ML engineering team documented this pattern in 2019 as their standard release process for high-traffic ML systems β shadow mode runs for at least one business cycle before a new model takes production traffic.
The requirements document from Lesson 1 defined your SLOs. The monitoring system from this lesson measures them. The connection is: every SLO must have a corresponding alert, and every alert must have a documented response procedure. This documentation β often called a runbook β specifies exactly what an on-call engineer does when an alert fires: check these dashboards, run this query, trigger this rollback, page this escalation path.
Stripe's on-call engineering culture, described in their public engineering blog posts from 2020β2023, treats every alert without a runbook as a system design defect. An alert that requires human judgment to interpret is an incomplete alert.
In Lab 4 you will design a complete monitoring and graceful degradation plan for a production AI system. The mentor will give you a system description and a list of observed failure events, and you must: specify the three monitoring layers, define PSI-based alert thresholds, design the fallback chain, and write a runbook entry for the most critical alert. This is the capstone lab β expect the mentor to push hard on completeness.
You maintain a real-time loan approval AI system at a fintech. It processes 50,000 applications per day. Last Tuesday at 14:32 UTC, the model's approval rate dropped from a stable 68% to 41% over 90 minutes, then recovered β with no alerts firing and no infrastructure issues. The business only noticed because a VP asked why application-to-funding conversion had dropped.
The incident is over, but you need to prevent the next one. Design a complete monitoring and degradation system by working through this with the mentor: