L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
Module 7 Β· Lesson 1

Requirements & Constraints Analysis

Before writing a line of code, the system's boundaries decide everything.
How do you translate a vague business need into a precise AI system specification?

When Zillow launched its iBuying arm Zillow Offers, engineers faced a deceptively simple specification: "predict home prices accurately." The system eventually lost $881 million in a single quarter and was shut down in November 2021. Post-mortems revealed the root cause was a requirements failure β€” the team had not adequately constrained the model's operational envelope, ignored distributional shift in volatile markets, and never defined acceptable error tolerances in dollar terms before deployment.

Why Requirements Come Before Architecture

AI system design begins with a requirements and constraints document β€” not a model choice, not a cloud provider, not a dataset. The document forces three honest conversations: what does success look like in measurable terms, what can the system never do, and what resources are actually available.

Google's Site Reliability Engineering practice, documented publicly since 2016, introduced Service Level Objectives (SLOs) as the translation layer between business intent and engineering decisions. AI systems need an extended version: SLOs for accuracy, latency, fairness, and degradation behavior under distribution shift.

The Four Constraint Dimensions
DimensionWhat to specifyExample metric
PerformanceAccuracy floor, latency ceiling, throughputP99 inference <120ms; F1 β‰₯ 0.88 on held-out test set
OperationalUptime, graceful degradation, rollback triggers99.9% uptime; fallback to rule-based system if model error rate >5%
DataVolume, velocity, freshness, legal basisRetrain on data ≀ 30 days old; no PII in training features
ResourceBudget, compute, team skills, timelineInference cost ≀ $0.002 per request; single ML engineer on-call
Stakeholder Mapping

Every AI system serves multiple stakeholders with conflicting incentives. The product team wants features shipped fast. The legal team wants audit trails. The operations team wants stability. The end user wants accuracy and speed. A requirements process that interviews only one group produces a system optimized for one constituency at the expense of others.

The EU AI Act (2024) now legally mandates that high-risk AI systems document affected stakeholder groups and their risk exposure before deployment β€” making stakeholder mapping a compliance requirement, not just good practice.

Design Principle

Write your requirements document as if the ML model does not exist yet. Every requirement must be verifiable without knowing which algorithm you will use. If a requirement references a specific model architecture, it is a design decision, not a requirement.

From Vague Goal to Testable Specification

Consider the transformation of a real product brief from a healthcare AI startup (2022, published in an IEEE case study): the original ask was "detect abnormal lab results." The engineering team ran a structured requirements workshop and produced: sensitivity β‰₯ 0.94 for critical abnormals (K+ >6.0 mEq/L), specificity β‰₯ 0.88, false negative alert latency ≀ 2 minutes from result availability, zero PHI persisted beyond the inference window, and automatic escalation to a human reviewer for edge cases flagged by a calibrated uncertainty threshold.

Each of those five sub-requirements maps to a testable acceptance criterion. The vague original goal maps to nothing testable at all.

SLOService Level Objective β€” a target value for a service reliability metric, forming the contract between engineering and business stakeholders.
Distributional ShiftWhen the statistical properties of production data diverge from training data, degrading model performance without triggering obvious errors.
Operational EnvelopeThe bounded set of conditions under which a system is designed and tested to perform reliably.
Lab Preview

In the Lab for this lesson you will conduct a live requirements elicitation session with an AI mentor. You will be given a real-world scenario and must identify stakeholders, define measurable success criteria, specify hard constraints, and produce a structured requirements document β€” all through guided dialogue.

Lesson 1 Quiz

Requirements & Constraints

Three questions β€” choose the best answer.
1. Zillow's Offers program failure in 2021 is primarily attributed to which systems design failure?
Correct. Post-mortems identified the root cause as a requirements failure β€” the team never defined acceptable dollar-error tolerances or constrained the model's operational envelope for volatile markets.
Not quite. The fundamental issue was upstream: the requirements never specified what "accurate" meant in dollar terms or defined market conditions that would invalidate the model.
2. Which of the following is a properly formed AI system requirement?
Correct. This is testable, measurable, and architecture-agnostic β€” the hallmarks of a valid system requirement.
Not quite. Valid requirements must be measurable and verifiable without referencing a specific implementation. "Accurate and fast" and "whenever performance drops" are not testable.
3. The EU AI Act (2024) makes stakeholder mapping a legal requirement for which category of AI systems?
Correct. The EU AI Act applies its strictest documentation and stakeholder impact requirements specifically to high-risk AI systems enumerated in its annexes.
Not quite. The Act uses a tiered risk classification β€” only high-risk systems face mandatory stakeholder impact documentation before deployment.
Lesson 1 Lab β€” Hands-On

Requirements Elicitation Workshop

Live AI-guided session Β· Minimum 3 exchanges to complete

Your Mission

You are a lead engineer at a logistics company. Your VP just said: "We need AI to reduce delivery failures." That is your entire brief. Through dialogue with your AI design mentor, you must turn this vague directive into a structured requirements document covering: measurable success criteria, hard constraints, stakeholder impacts, and rollback conditions.

The mentor will challenge your assumptions, push for specificity, and flag anything that is not testable. Ask questions, propose metrics, and defend your choices.

Start by telling the mentor what you know about your system context β€” or ask the mentor where to begin. Aim to produce at least 3 concrete, testable requirements before the session ends.
AI Design Mentor
Requirements Workshop
Welcome to your requirements elicitation session. I'm your AI design mentor for this workshop.

Your VP has given you the brief: "We need AI to reduce delivery failures."

Before we touch architecture or models, we need to understand the problem space. Let's start with the most important question: what does a "delivery failure" actually mean in your system? Is it a missed time window, a returned package, a damaged item, a wrong address? Each has very different data signatures and success metrics.

Describe your company's delivery operation and how you currently track failures.
Module 7 Β· Lesson 2

Architecture Patterns for AI Systems

The shape of your system determines what failures are possible.
Which architectural pattern fits your latency, accuracy, and cost requirements β€” and what does each pattern trade away?

Netflix's recommendation system has cycled through three distinct architectural generations since 2016, each documented in their engineering blog. The first was a batch-offline architecture: models trained weekly, recommendations served from a precomputed cache. Fast, cheap, but stale. The second generation added an online layer β€” lightweight models ran at request time to personalize against the weekly batch candidates. The third, post-2020, introduced a two-tower neural architecture with near-real-time feature computation. Each transition was forced by a measurable failure of the prior architecture to meet new user expectations around content freshness.

The Five Core AI Architecture Patterns
BATCH OFFLINE
Train Job (scheduled) Model Artifact Precomputed Cache Serving Layer (lookup)
↓ add request-time scoring ↓
LAMBDA
Batch Layer Speed Layer (online model) Serving Layer (merge)
↓ collapse batch into streaming ↓
KAPPA
Stream Processor Online Model Serving Layer
↓ add LLM reasoning layer ↓
RAG
Vector Store Retrieval LLM Generator Response
↓ add autonomous action ↓
AGENTIC
Planner LLM Tool Registry Memory Store Executor Output
Pattern Trade-offs
PatternLatencyFreshnessCostComplexity
Batch Offline~1ms (cache hit)Hours–daysVery lowLow
Lambda10–200msSeconds–minutesMediumHigh (two code paths)
Kappa10–500msSecondsMedium-highMedium
RAG500ms–3sNear-real-time (index)High (LLM tokens)Medium
Agentic1s–60s+Real-time (tool calls)Very highVery high
The Model-in-the-Loop vs. Model-as-the-System Distinction

A critical design decision often missed by teams new to production AI: is your ML model one component inside a larger software system, or is the model the system itself? Uber's surge pricing, documented in their 2018 engineering post, uses ML as a module inside a larger pricing engine β€” the ML output is clipped, overridden by business rules, and A/B tested against heuristics. This is model-in-the-loop.

By contrast, GitHub Copilot (launched 2021) is model-as-the-system β€” the model output goes directly to the user with minimal post-processing. This shifts the entire reliability, safety, and quality burden onto the model itself, which has fundamentally different architectural implications for monitoring, fallback, and versioning.

Anti-Pattern Alert

Lambda architecture (maintaining parallel batch and streaming code paths) is often chosen prematurely. Jay Kreps, who coined the term "Kappa architecture" at LinkedIn in 2014, noted that most teams that built Lambda architectures later wished they had started with Kappa β€” the dual-path maintenance burden compounds over time.

Choosing Your Pattern from Requirements

Map your requirements document from Lesson 1 directly to pattern constraints: if your latency SLO is under 50ms, streaming and agentic patterns are eliminated. If your cost budget is under $0.001 per request at scale, RAG with large models is eliminated. If your data has a legal freshness requirement (e.g., financial regulations requiring intraday recalculation), batch-offline is eliminated. The requirements document should make the architecture choice almost automatic.

Lambda ArchitectureA data processing design with parallel batch (slow, complete) and speed (fast, approximate) layers whose outputs are merged at serving time.
Kappa ArchitectureA simplified alternative to Lambda that processes all data through a single streaming pipeline, eliminating batch-layer complexity.
RAGRetrieval-Augmented Generation β€” an architecture that grounds LLM outputs in dynamically retrieved external knowledge, reducing hallucination and enabling knowledge updates without retraining.
Lab Preview

In Lab 2 you will receive a set of requirements (latency, cost, freshness, team constraints) and work with the AI mentor to select and justify an architecture pattern. You will then sketch the key components and identify the top three failure modes of your chosen design.

Lesson 2 Quiz

Architecture Patterns

Three questions β€” choose the best answer.
1. Netflix migrated from batch-offline to a two-tower near-real-time architecture primarily because:
Correct. Netflix's engineering blog documented that each architectural upgrade was driven by measurable freshness failures β€” stale recommendations in a rapidly-updating content library.
Not quite. The documented driver was content freshness β€” weekly-trained models couldn't reflect rapidly changing catalogues and user tastes quickly enough.
2. A system requires P99 inference latency under 30ms at 10,000 requests/second. Which architecture patterns are definitively eliminated by this constraint?
Correct. RAG (500ms–3s for LLM calls) and Agentic (1s–60s+) cannot meet a 30ms P99 requirement. Batch offline (cache lookups at ~1ms) and optimized Kappa/Lambda pipelines can.
Not quite. Check the latency ranges: RAG requires 500ms–3s for LLM inference, and Agentic systems require 1–60+ seconds. Neither can meet a 30ms P99 SLO.
3. What is the key distinction between "model-in-the-loop" and "model-as-the-system" architectures?
Correct. The architectural implication is profound: model-as-the-system shifts the entire reliability and safety burden onto the model, whereas model-in-the-loop allows business logic to catch and override bad outputs.
Not quite. The key distinction is where the model output goes β€” whether it is filtered through business rules before reaching users, or presented directly.
Lesson 2 Lab β€” Hands-On

Architecture Selection & Failure Mode Analysis

Live AI-guided design session Β· Minimum 3 exchanges to complete

Your Mission

You are designing an AI system for a retail bank's fraud detection pipeline. The requirements are: P99 latency ≀ 80ms, data freshness ≀ 5 minutes, inference cost ≀ $0.0005 per transaction, 99.95% uptime, team of 3 ML engineers.

Work with the architect mentor to: select the right architecture pattern for these constraints, sketch the key components, and identify the top 3 failure modes of your design. You must justify every choice with reference to the requirements.

Begin by stating which architecture pattern you think fits these requirements and why. The mentor will probe your reasoning and help you identify what you might have missed.
AI Architect Mentor
Architecture Workshop
Let's design a fraud detection architecture for the retail bank. Your constraints are tight: 80ms P99 latency, 5-minute data freshness, $0.0005 per transaction cost, 99.95% uptime, and a 3-engineer team.

Before I share any opinions, I want to hear yours: which architecture pattern from Lesson 2 do you think fits this constraint set, and what's your primary reason for choosing it?

Don't worry about being wrong β€” picking a pattern and defending it is the core skill here. I'll help you stress-test the choice.
Module 7 Β· Lesson 3

Data Pipeline Design & Feature Engineering

Your model is only as good as the data that reaches it β€” and most data arrives broken.
How do you design a data pipeline that makes your model's inputs reliable, reproducible, and auditable?

In 2017, Amazon scrapped an internal AI recruiting tool after discovering it had been systematically downgrading rΓ©sumΓ©s containing the word "women's" β€” as in "women's chess club." The root cause was not the model architecture. It was the feature engineering pipeline: rΓ©sumΓ©s were represented in ways that encoded historical hiring biases present in the training labels. The data pipeline had no bias audit stage, no feature documentation, and no mechanism to detect that protected attributes were leaking into learned representations. Amazon's Reuters-reported disclosure in 2018 became a canonical case study in pipeline design responsibility.

The Four Stages Every Production Pipeline Needs

Reliable ML pipelines share a common structure regardless of domain. Skipping any stage is a documented source of production failures.

INGEST
Source connectors Schema validation Dead-letter queue
↓
VALIDATE
Distribution checks Completeness gates Bias audit hooks
↓
TRANSFORM
Feature computation Encoding Normalization Join logic
↓
SERVE
Feature store Training snapshot Online serving
The Training-Serving Skew Problem

The single most common production ML bug is training-serving skew: the features computed during training are subtly different from the features computed at inference time. Google's ML engineers estimated in their 2020 Practitioners Guide to MLOps whitepaper that training-serving skew accounts for the majority of unexplained production performance degradation in ML systems.

This happens because training pipelines and serving pipelines are written by different people, in different languages, at different times. The canonical fix is a feature store β€” a shared system that computes features once and serves the same computation to both training and inference. Uber's Michelangelo platform (documented in their 2017 engineering blog) was one of the first production feature stores, and Feast (open-sourced by Gojek in 2020) became the widely adopted open-source standard.

Data Validation with Great Expectations

Great Expectations (open-sourced 2019, now maintained by a dedicated company) introduced the concept of data contracts β€” machine-readable specifications of what valid data looks like, that run as automated pipeline gates. A data contract specifies: column existence, type, range, null rate, uniqueness, and distributional properties. When incoming data violates the contract, the pipeline raises an alert before the bad data reaches the model.

Spotify's ML engineering team published a 2021 post describing how they use data contracts across 200+ data pipelines to catch upstream schema changes before they corrupt model training jobs. The contract approach turned a reactive debugging task into a proactive quality gate.

The Leakage Warning

Data leakage β€” when information from the future or from the label leaks into training features β€” is the most seductive pipeline bug. The model trains to perfect accuracy. The pipeline looks clean. Only in production does performance collapse. Leakage gates must be architectural, not just practices: temporally-indexed train/test splits and feature provenance logging should be built into the pipeline infrastructure, not left to individual engineer discipline.

Feature Store Design Decisions
DimensionOption AOption B
FreshnessBatch-computed (hourly/daily)Streaming-computed (seconds)
ServingOffline store (Parquet/S3)Online store (Redis/DynamoDB)
Point-in-timeManual temporal joinsBuilt-in time-travel queries
GovernanceShared schema registryPer-team feature namespaces
Training-Serving SkewDivergence between feature computations used during model training and those used at inference time, causing unexplained production accuracy drops.
Feature StoreA centralized system that computes, stores, and serves ML features consistently for both training and real-time inference.
Data ContractA machine-readable specification of expected data properties that runs as an automated pipeline gate before data reaches models.
Lab Preview

In Lab 3 you will audit a realistic "broken" data pipeline. The mentor will describe a pipeline with embedded flaws β€” training-serving skew, a leakage risk, and a missing validation gate. Your job is to identify each flaw, explain the failure mode it creates, and redesign the pipeline to fix it.

Lesson 3 Quiz

Data Pipeline Design

Three questions β€” choose the best answer.
1. Amazon's 2017 recruiting AI failure was primarily caused by:
Correct. The pipeline had no bias audit stage, no feature documentation, and allowed protected attributes to leak into learned representations through biased historical labels.
Not quite. The root cause was upstream: the data pipeline encoded historical bias from training labels without any bias audit mechanism. The model learned what the biased labels taught it.
2. What is the primary purpose of a feature store in an ML system?
Correct. Feature stores solve the training-serving skew problem by providing a single source of truth for feature computation used by both the training pipeline and the online serving path.
Not quite. A feature store's core function is ensuring that the features seen during training are identical to those computed at inference time β€” addressing training-serving skew.
3. Data leakage is most dangerous because:
Correct. Leakage is seductive precisely because it makes everything look healthy during development. The model trains to near-perfect accuracy β€” the failure only materializes in production when future information is unavailable.
Not quite. Leakage is dangerous because it is invisible during development β€” the model performs exceptionally well in training/validation but fails in production when the leaked information is unavailable.
Lesson 3 Lab β€” Hands-On

Pipeline Audit & Redesign

Find the flaws. Fix the pipeline. Β· Minimum 3 exchanges to complete

Your Mission

A colleague's credit scoring pipeline just went to production. Six weeks later, the model's Gini coefficient has fallen from 0.71 to 0.48 and the team can't explain why. You've been brought in to audit the pipeline. Here's what you know:

  • Training features include: income, employment length, credit utilization, and "days since last payment" computed at prediction time
  • Training data is loaded as a flat CSV, split randomly 80/20, then features are normalized using training-set statistics
  • The normalization scaler object is not saved β€” serving code recomputes normalization from the current day's batch
  • No data validation runs between source and the feature computation step
  • The label ("defaulted within 90 days") was joined to the feature row after the prediction date in the source SQL query
Tell the mentor which pipeline flaws you can spot and what failure mode each creates. Then propose a redesigned pipeline that fixes them. The mentor will probe your analysis and help you find anything you missed.
AI Pipeline Auditor
Pipeline Audit Lab
I've reviewed the pipeline description with you. This is a realistic scenario β€” pipelines like this go to production more often than teams admit.

Before I share my analysis, I want yours: how many distinct flaws can you identify in the pipeline description above, and for each one, what specific failure mode does it create?

Don't just name the flaw β€” explain the mechanism. For example: "The X design means Y happens, which causes Z in production." Work through each component systematically.
Module 7 Β· Lesson 4

Monitoring, Observability & Graceful Degradation

A model that fails silently is more dangerous than one that fails loudly.
How do you know when your deployed AI system is failing β€” before your users tell you?

When COVID-19 lockdowns began in March 2020, virtually every ML system trained on pre-pandemic data experienced sudden, severe distributional shift. Demand forecasting models at retailers like Target and Walmart, documented in supply chain industry reports, began generating absurd reorder recommendations for toilet paper and hand sanitizer because training data contained no event with that demand signature. The systems produced outputs β€” no errors, no alerts, no monitoring flags. They just produced confidently wrong predictions. Teams discovered the failures through business outcomes (empty shelves, excess inventory) rather than through technical alerts. The core failure was the absence of concept drift detection in the monitoring layer.

The Three Monitoring Layers

Production ML systems require monitoring at three distinct levels, each catching different failure classes. Many teams implement only the first layer and call it done.

LayerWhat it monitorsAlert trigger examples
InfrastructureCPU, memory, disk, network, container healthGPU utilization >95%, inference pod restarts >3/hour
Data / FeatureInput distribution, null rates, schema driftFeature X mean shifted >2Οƒ from training baseline; null rate >5%
Model / ConceptPrediction distribution, confidence calibration, downstream outcomesOutput entropy increased 30%; business KPI decoupled from model confidence
Drift Detection in Practice

The most widely deployed drift detection method in production systems is Population Stability Index (PSI), originally from credit scoring (published in financial literature by 1990s risk modelers). PSI measures the shift in a feature or output distribution relative to a reference baseline. PSI < 0.1 is considered stable; 0.1–0.25 warrants investigation; > 0.25 triggers retraining or rollback.

Evidently AI (launched 2021), Whylogs (WhyLabs, 2020), and Arize AI (2020) all built commercial products around this core statistical concept, adding dashboards and alerting infrastructure. Meta's internal ML monitoring system, described in their 2022 engineering blog, monitors over 200 features per model with automated PSI gates that pause production inference when drift exceeds thresholds.

Graceful Degradation Design

A system that fails gracefully is explicitly designed for its failure modes before they occur. The pattern has three components:

PRIMARY
ML Model (full capability) Health check Confidence gate
↓ if health check fails OR confidence < threshold ↓
FALLBACK
Rule-based system Prior model version Population average
↓ if fallback also unavailable ↓
CIRCUIT BREAK
Human review queue Static safe default Explicit "I don't know"
The Confidence Calibration Problem

Many production models produce high-confidence scores even when they are wrong β€” particularly neural networks, which are known to be poorly calibrated out-of-the-box. Geifman & El-Yaniv's 2017 research and Guo et al.'s seminal 2017 paper "On Calibration of Modern Neural Networks" demonstrated that modern neural networks are significantly more overconfident than older logistic regression models.

Temperature scaling (Guo et al., 2017) became the standard post-hoc calibration technique: a single scalar parameter T is fit on a held-out validation set to soften the model's output distribution. Without calibration, confidence thresholds in graceful degradation systems are meaningless β€” a model outputting 0.95 confidence when it is wrong provides no safety benefit.

Shadow Mode Deployment

Before retiring an old model, run the new model in shadow mode: both models process every request, but only the old model's output is served. Log both outputs and compare. Booking.com's ML engineering team documented this pattern in 2019 as their standard release process for high-traffic ML systems β€” shadow mode runs for at least one business cycle before a new model takes production traffic.

SLO Violations as Rollback Triggers

The requirements document from Lesson 1 defined your SLOs. The monitoring system from this lesson measures them. The connection is: every SLO must have a corresponding alert, and every alert must have a documented response procedure. This documentation β€” often called a runbook β€” specifies exactly what an on-call engineer does when an alert fires: check these dashboards, run this query, trigger this rollback, page this escalation path.

Stripe's on-call engineering culture, described in their public engineering blog posts from 2020–2023, treats every alert without a runbook as a system design defect. An alert that requires human judgment to interpret is an incomplete alert.

Concept DriftWhen the statistical relationship between input features and the target variable changes over time, degrading model performance even when input distributions appear stable.
Population Stability Index (PSI)A statistical measure of distribution shift between a reference dataset and current data; PSI > 0.25 conventionally triggers retraining or rollback review.
Temperature ScalingA post-hoc calibration technique that divides neural network logits by a learned scalar T before the softmax, producing better-calibrated probability estimates.
Lab Preview

In Lab 4 you will design a complete monitoring and graceful degradation plan for a production AI system. The mentor will give you a system description and a list of observed failure events, and you must: specify the three monitoring layers, define PSI-based alert thresholds, design the fallback chain, and write a runbook entry for the most critical alert. This is the capstone lab β€” expect the mentor to push hard on completeness.

Lesson 4 Quiz

Monitoring & Graceful Degradation

Three questions β€” choose the best answer.
1. Retail ML demand forecasting failures in March 2020 (COVID-19 lockdowns) are best described as a failure of which monitoring layer?
Correct. The systems produced no errors β€” infrastructure was fine, features computed correctly. The failure was that the statistical relationship between inputs and demand had fundamentally changed, and no concept drift detection was present to catch it.
Not quite. Infrastructure and feature pipelines were functioning normally β€” the failure was invisible at those layers. Concept drift detection (model/outcome monitoring) was the missing layer.
2. A Population Stability Index (PSI) value of 0.31 for a key model input feature indicates:
Correct. PSI > 0.25 is the conventional threshold for significant distribution shift requiring immediate action β€” retraining evaluation or rollback review.
Not quite. PSI thresholds: <0.1 stable, 0.1–0.25 investigate, >0.25 significant shift requiring immediate action. 0.31 is clearly in the "act now" zone.
3. Why is temperature scaling important for graceful degradation systems that use confidence thresholds?
Correct. Guo et al. (2017) demonstrated that modern neural networks are significantly overconfident. Without calibration, a model outputting 0.95 confidence when it is wrong provides no safety benefit in a graceful degradation system.
Not quite. Temperature scaling addresses calibration β€” neural networks tend to output high confidence even when wrong. A miscalibrated confidence score is useless as a degradation trigger.
Lesson 4 Lab β€” Capstone Hands-On

Monitoring Plan & Runbook Design

Complete monitoring architecture Β· Minimum 3 exchanges to complete

Your Mission β€” Capstone Lab

You maintain a real-time loan approval AI system at a fintech. It processes 50,000 applications per day. Last Tuesday at 14:32 UTC, the model's approval rate dropped from a stable 68% to 41% over 90 minutes, then recovered β€” with no alerts firing and no infrastructure issues. The business only noticed because a VP asked why application-to-funding conversion had dropped.

The incident is over, but you need to prevent the next one. Design a complete monitoring and degradation system by working through this with the mentor:

  • Three monitoring layers with specific metrics and alert thresholds for this system
  • PSI alert configuration β€” which features to monitor and at what PSI threshold to act
  • Fallback chain β€” what happens when each layer of the primary system fails
  • Runbook entry for the "approval rate shift >15% in 60 minutes" alert
Start by telling the mentor what you think caused the Tuesday incident and what monitoring would have caught it. The mentor will probe your diagnosis and help you build the complete plan.
AI Monitoring Architect
Capstone Lab
This is a real scenario β€” variants of it happen at fintech companies regularly. A 27-point approval rate swing in 90 minutes with no alerts is a serious design gap.

Before we build the monitoring plan, I need you to hypothesize: what are the top 3 most likely root causes of the Tuesday incident? For each one, tell me what monitoring signal would have caught it and at what point in the 90-minute window.

Be specific about mechanisms. "Distributional shift" is not specific enough β€” tell me which inputs likely shifted, in which direction, and why that would cause approval rate to drop. Then we'll design the system to catch all three.
Module 7

Module Test β€” AI System Design

15 questions Β· 80% required to pass Β· All four lessons covered
1. A properly formed AI system requirement must be:
Correct. Requirements must be testable and architecture-agnostic β€” they define what success looks like, not how to achieve it.
Requirements must be architecture-agnostic and testable before any implementation begins.
2. Zillow's $881M Offers loss in 2021 demonstrates which specific requirements failure?
Correct. The team never specified what "accurate" meant in dollar terms or defined which market conditions would invalidate the model.
The failure was upstream β€” requirements never defined acceptable error tolerances or market condition boundaries.
3. Which architecture pattern provides the lowest inference latency?
Correct. Cache lookups from precomputed batch predictions can achieve sub-millisecond latency β€” far lower than any online inference pattern.
Batch offline with precomputed results provides the lowest latency β€” serving is a cache lookup, not inference.
4. Jay Kreps introduced Kappa architecture at LinkedIn as a solution to what problem?
Correct. Kappa eliminates the Lambda architecture's second code path, reducing the dual-maintenance burden that compounds over time.
Kreps specifically cited the Lambda architecture's dual code path maintenance as the problem Kappa solves.
5. Training-serving skew most commonly occurs when:
Correct. Training-serving skew arises from divergent implementations of the same logical feature computation β€” the canonical fix is a shared feature store.
Training-serving skew is a code-level problem β€” the same feature is computed differently in training vs. serving code paths.
6. Amazon's 2017 recruiting AI case study demonstrates that bias in ML pipelines most often enters through:
Correct. The pipeline had no bias audit stage β€” historical hiring bias in labels trained the model to replicate that bias through its feature representations.
The bias entered through biased training labels β€” the pipeline encoded historical decisions without any audit for protected attribute leakage.
7. A feature store's primary architectural benefit over separate training and serving pipelines is:
Correct. Feature stores solve training-serving skew by ensuring both training jobs and online inference consume identical feature computations.
The core benefit is consistency β€” one feature computation used by both training and serving, eliminating divergence.
8. In the three-layer monitoring model, which layer would alert on a sudden increase in prediction output entropy?
Correct. Output entropy is a model-level signal β€” it reflects the distribution of predictions, which is monitored at the model/concept layer.
Output distribution metrics belong to the model/concept monitoring layer β€” the layer that watches what the model produces, not what it receives.
9. PSI = 0.18 for a key input feature in production indicates:
Correct. PSI 0.1–0.25 indicates moderate drift requiring investigation β€” not stable, not yet critical, but warranting attention and scheduled review.
PSI thresholds: <0.1 stable, 0.1–0.25 investigate (this case), >0.25 critical action required.
10. Temperature scaling (Guo et al., 2017) is used in production AI systems primarily to:
Correct. Temperature scaling produces calibrated probabilities β€” a model saying 0.85 confidence should be right about 85% of the time, enabling meaningful confidence thresholds.
Temperature scaling is a calibration technique β€” it corrects for neural network overconfidence so confidence thresholds become actionable.
11. Shadow mode deployment, as used by Booking.com, is best described as:
Correct. Shadow mode provides zero user exposure risk β€” users always see the old model's output β€” while generating a complete production comparison of both models.
Shadow mode = both models run on all requests, only old model outputs are served, new model outputs are logged for offline comparison.
12. The EU AI Act (2024) requires stakeholder mapping and impact documentation specifically for:
Correct. The Act uses a tiered risk approach β€” high-risk systems (defined in annexes) face the most stringent documentation requirements.
The EU AI Act is tiered β€” high-risk systems (per annexes) face mandatory stakeholder documentation; other tiers have lighter obligations.
13. Data leakage in an ML pipeline is most reliably prevented by:
Correct. Leakage prevention must be architectural β€” built into the pipeline structure β€” not dependent on individual engineer discipline, which the lesson explicitly noted fails at scale.
Code review helps but is insufficient β€” leakage prevention must be architectural: temporally-correct splits and feature provenance enforced by the pipeline itself.
14. An SLO (Service Level Objective) without a corresponding runbook represents:
Correct. Stripe's documented on-call engineering culture treats every alert without a runbook as an incomplete system design β€” the response must be specified, not improvised.
Per Stripe's engineering culture: an alert without a runbook is a design defect. The response procedure must be documented in advance, not improvised under pressure.
15. When a model's confidence calibration is poor (overconfident), graceful degradation thresholds based on confidence scores will:
Correct. An overconfident model outputs high confidence even when wrong β€” the degradation threshold is never triggered on incorrect predictions, defeating its entire purpose.
Overconfident models say "I'm certain" even when they're wrong β€” so confidence-based fallback thresholds are never triggered when they should be.