L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Lesson 1 · Module 4

Data Drift and the Silent Degradation Problem

Why models that worked perfectly at launch quietly stop working — and how to catch it before it costs you.
How do you know when the world has changed enough to break your model?

In early 2020, virtually every credit-scoring, demand-forecasting, and fraud-detection model trained on pre-pandemic data began returning confidently wrong answers. Consumer spending collapsed overnight. Travel booking patterns inverted. Supply chains broke. The models had not changed — but the world they were trained to describe had shifted faster than any historical distribution. The lesson was brutal and industry-wide: a model's accuracy at launch says almost nothing about its accuracy six months later.

What Is Data Drift?

Data drift (also called covariate shift) occurs when the statistical distribution of input features changes after a model is deployed. The model's learned weights stay fixed, but the data it sees in production increasingly resembles something it was never trained on. Predictions degrade — often silently, because no error is thrown and no warning fires.

There are three distinct flavors worth keeping separate in your mind:

Covariate Drift

The distribution of input features X shifts. A fraud model trained when most transactions were under $200 suddenly sees a new cohort of high-value buyers — legitimate customers who look like fraudsters to the model.

Label / Concept Drift

The relationship between X and Y changes. What "spam" looks like changes as spammers adapt tactics. A model trained in 2021 may have never seen LLM-generated phishing text, which now dominates spam.

Upstream Data Drift

A pipeline dependency changes without notice — a vendor changes a field format, a sensor firmware update shifts baseline readings, a schema migration changes a column's semantics. The model receives subtly different input without anyone realizing it.

Prior Probability Shift

The base rates of target classes change. A churn model trained when churn rate was 4% deployed into a market where churn hits 18% will produce systematically overconfident "no churn" predictions.

Why Drift Is Hard to Notice

The core difficulty is that ground truth labels arrive late or never. A credit model predicts default risk today, but the actual default or repayment may not be known for 12–18 months. A recommendation model predicts click probability, but user satisfaction (the real target) is never directly measured. Without labels, you cannot compute accuracy — so you need proxy signals.

This is the fundamental asymmetry of production monitoring: you have inputs immediately, but the outputs you care most about are delayed or unobservable. This is why statistical drift detection on inputs and model outputs exists as its own discipline, separate from accuracy tracking.

Real Case — Amazon Rekognition, 2018–2019

MIT Media Lab researchers showed that Amazon Rekognition's gender classification error rate was 0.8% for lighter-skinned men but up to 34.7% for darker-skinned women. The training distribution was skewed toward lighter faces. As the system was deployed to policing and hiring use cases, the operational population shifted toward exactly the demographics where accuracy was worst — a textbook case of deployment-time covariate shift with real-world harm.

Statistical Tests for Drift Detection

Practitioners use several statistical tools to detect whether a production window's distribution has shifted from the training baseline:

PSIPopulation Stability Index. Measures how much a feature's distribution has shifted. PSI < 0.1 means stable; 0.1–0.25 means moderate change; >0.25 means significant drift requiring investigation. Originally developed for credit scoring, now widely used in ML monitoring.
KS TestKolmogorov-Smirnov Test. A non-parametric test comparing two continuous distributions — your training distribution vs. a production window. Returns a p-value; a very low p-value indicates the distributions differ significantly. Good for continuous features.
Chi-SquaredUsed for categorical features. Compares observed category frequencies in production against expected frequencies from training. Flags when categories appear or disappear at unexpected rates.
MMDMaximum Mean Discrepancy. A kernel-based measure comparing two probability distributions in high-dimensional space. Used for complex feature sets, embeddings, and image data where PSI/KS are insufficient.
Key Principle

Drift detection is not a one-time check at deployment. It is a continuous process that must be woven into production operations — running on every scoring window, every day, surfacing signals before accuracy degrades enough to cause visible downstream harm.

Lesson 1 Quiz

Data Drift and the Silent Degradation Problem · 3 questions
Which type of drift describes a change in the statistical relationship between input features and the target label?
Correct. Concept drift means the mapping from X to Y has changed — what used to indicate spam no longer does, what used to predict churn no longer correlates. The inputs shift too (covariate), but concept drift is specifically about the X→Y relationship.
Not quite. Covariate drift means the input distribution changed. Concept drift is when the relationship between inputs and labels changes — the same inputs now mean something different.
A PSI value of 0.31 for a feature in your production scoring window indicates:
Correct. PSI above 0.25 signals significant drift, meaning the production distribution looks materially different from training. This warrants investigation — root cause analysis before deciding whether to retrain.
Review the PSI thresholds: below 0.1 is stable, 0.1–0.25 is moderate, above 0.25 is significant. A PSI of 0.31 crosses the significant threshold and requires active investigation.
Why is drift particularly hard to detect in credit risk models compared to, say, an image classifier?
Correct. Label latency is the core problem. When you score a loan application today, you won't know if it defaults for 12–18 months. That gap means you can't compute live accuracy — making input-side drift detection essential as a proxy signal.
The central issue is label latency, not feature count or encryption. Defaults take months to materialize, so you can't measure accuracy in real time. That's why statistical drift tests on inputs are so important for credit models.

Lab 1 · Drift Detective

Practice identifying, classifying, and prioritizing data drift scenarios

Your Mission

You are a junior MLOps engineer. Your monitoring dashboard has fired three drift alerts. Work through each one with the AI advisor: classify the drift type, assess severity using PSI/KS logic, and recommend a response. Complete at least 3 exchanges to finish this lab.

Start by describing one of these scenarios: (A) Your fraud model's "transaction_amount" feature shows PSI = 0.38 this week. (B) Your churn model's predicted churn probability distribution has shifted sharply toward lower values. (C) A categorical feature "device_type" now shows 40% of values as "unknown" — up from 2% in training.
Drift Advisor
Lab 1
Hello — I'm your drift investigation advisor. Pick one of the three scenarios from the prompt above and walk me through what you see. We'll classify the drift type, judge its severity, and decide what to do next.
Lesson 2 · Module 4

Performance Metrics and Alert Thresholds

What to measure, when to fire an alert, and how to avoid the twin disasters of alert fatigue and missed degradation.
How do you set thresholds that catch real problems without drowning your team in false alarms?

In November 2021, Zillow announced it was shutting down Zillow Offers and writing off approximately $304 million in inventory losses, laying off 25% of its workforce. The root cause: its iBuying algorithm's home price predictions had degraded substantially in a rapidly shifting post-pandemic housing market. The model had been retrained on pre-pandemic data and its monitoring had not adequately flagged the growing divergence between predicted and actual transaction prices. The question is not whether monitoring could have helped — it is why the thresholds and feedback loops were not sensitive enough to trigger action earlier.

The Monitoring Metric Stack

Production model monitoring typically involves multiple layers of metrics, each answering a different question about model health:

Data QualityNull rate, schema violations, range breaches
DriftPSI, KS stat, Chi-squared by feature
Output DistributionScore distribution shift, prediction entropy
Business KPIsConversion rate, click rate, churn rate
Label AccuracyAUROC, F1, RMSE — when labels available
System HealthLatency, throughput, error rate
The Alert Threshold Problem

Setting alert thresholds is one of the most underappreciated engineering challenges in MLOps. Too sensitive: engineers get paged at 3am for noise, start ignoring alerts, and eventually miss a real degradation event. Too lenient: a genuine drift goes undetected for weeks while business metrics silently worsen.

Three approaches have emerged as best practice in production environments:

Static Thresholds

Fixed rules: "Alert if PSI > 0.25" or "Alert if null rate > 5%." Simple to explain and audit. Problem: they don't adapt to seasonal patterns. A retail demand model will show normal drift every December that fires false alerts all winter.

Relative / Rolling Baselines

Compare the current window against the rolling average of the past N weeks, not against training. Alert if the current window deviates more than K standard deviations from the rolling mean. Adapts to seasonality; harder to tune initially.

CUSUM / Sequential Testing

Cumulative Sum control charts detect small, sustained shifts that don't spike above static thresholds. Particularly powerful for catching gradual concept drift. Used heavily in manufacturing quality control; increasingly adopted in ML monitoring.

Tiered Severity

Route alerts to different channels by severity: INFO logs minor drift, Slack warns on moderate drift, PagerDuty pages on critical. Prevents alert fatigue while ensuring the worst events get immediate human attention.

Monitoring Output Distributions Without Labels

When ground truth is delayed, the model's score distribution becomes the primary proxy for health. If a fraud model that historically scored 0.3% of transactions above the decision threshold now scores 4% above it — but the fraud rate hasn't changed — the model is clearly misbehaving. This technique is called prediction monitoring or output distribution monitoring.

Google's ML monitoring guidelines, published in their 2022 ML Engineering Practices documentation, explicitly recommend tracking the full output distribution (not just average score) because distribution shape changes often precede accuracy drops by days or weeks.

Real Case — Spotify Recommendations, circa 2019

Spotify's recommendation teams have publicly described (in MLOps community talks) how their personalization models monitor the entropy of recommendation distributions as a health signal. When entropy collapses — meaning the model is recommending a narrow set of songs to nearly everyone — it's a leading indicator of model degradation, long before any accuracy metric would catch it.

Separating Model Failure from System Failure

A critical discipline in monitoring is distinguishing model quality degradation from infrastructure failures. If your feature pipeline sends null values because an upstream API went down, your fraud model will start predicting differently — but the model itself is fine. Correlating model output changes with upstream data quality metrics and system logs before escalating to a retraining decision saves enormous wasted effort.

Design Principle

Build monitoring in layers: data quality first, then drift, then output distribution, then business metrics. When an alert fires, work backwards from the metric to isolate whether the issue is data, the model, or the world. Retraining before understanding is expensive and often ineffective.

Lesson 2 Quiz

Performance Metrics and Alert Thresholds · 3 questions
Which alerting approach is best suited to detecting small, gradual shifts in model performance that never spike above a static threshold?
Correct. CUSUM accumulates small deviations over time, making it ideal for detecting sustained gradual drift that never breaches a point-in-time static threshold. It was designed for exactly this problem in manufacturing quality control.
CUSUM is the right tool here. Static thresholds miss gradual drift; label accuracy requires labels (which may be delayed); Chi-squared detects large distributional shifts, not slow accumulation. CUSUM tracks cumulative deviation over time.
Your fraud model historically flags 0.4% of transactions. Today it flags 5.2% — but the actual fraud team has only confirmed slightly elevated fraud rates. What does this most likely indicate?
Correct. A 13x increase in flagging rate while confirmed fraud rose only slightly is a red flag for model degradation or input data corruption — not a genuine fraud spike. Output distribution monitoring catches exactly this kind of signal before it becomes a costly business problem.
A 13x increase in flag rate without a corresponding 13x increase in confirmed fraud strongly suggests model degradation or upstream data issues — not genuine fraud. This is why monitoring output distributions is so valuable.
Why should engineers check data quality metrics BEFORE assuming a model needs retraining when performance drops?
Correct. If a feature pipeline sends nulls or corrupted values due to an upstream API failure, the model will behave poorly — but the model is not the problem. Retraining it on that corrupted data would learn wrong patterns. Always trace the signal back to its source before deciding on a remedy.
The key issue is causal ordering: bad inputs cause bad outputs. If data quality has degraded due to a pipeline failure, retraining on that data would embed the corruption into the new model. Fix the data problem first, then reassess model performance.

Lab 2 · Alert Threshold Design

Design a monitoring alert strategy for a real production scenario

Your Mission

You've been asked to design the monitoring alert system for a loan approval model at a mid-size bank. The model scores 50,000 applications per month. Ground truth (default/repayment) arrives 12–18 months after scoring. Work with the AI advisor to choose your metrics, set alert thresholds, assign severity tiers, and defend your choices. Complete at least 3 exchanges.

Start by telling the advisor: what are the top 2–3 metrics you'd monitor first for this loan model, and why? Consider the label latency problem.
Monitoring Advisor
Lab 2
Welcome to the alert design lab. You're building monitoring for a loan approval model with 12–18 month label latency. What metrics would you prioritize, and why does label latency change your strategy?
Lesson 3 · Module 4

Retraining Strategies and Trigger Logic

Scheduled, triggered, or continuous — the tradeoffs that determine whether your model stays current or becomes a liability.
How often should you retrain — and what should force an unscheduled retrain?

Google's search ranking systems require some form of continuous or near-continuous updating because the web itself changes by billions of pages every day. By 2003, Google had already developed systems to handle "freshness" — ensuring recently published content could rank appropriately without waiting for a full re-index cycle. The lesson that has rippled through the ML industry since: for systems where the data-generating process itself is continuously evolving, a retrain cadence measured in months is architecturally inappropriate. The question is never whether to retrain — it is how to build the infrastructure to do so efficiently and safely.

Three Retraining Philosophies

Organizations typically settle into one of three retraining postures, each with distinct tradeoffs in cost, risk, and freshness:

Scheduled / Calendar-Based

Retrain on a fixed cadence: weekly, monthly, quarterly. Simple to operate, predictable infrastructure costs, easy to audit. Risk: if the world shifts faster than your schedule, the model degrades between retrains. Good for stable domains — insurance actuarial models, annual forecast cycles.

Triggered / Event-Based

Retrain when a monitoring signal crosses a threshold — PSI above 0.25, accuracy drop of 3+ percentage points, business metric outside control bands. More adaptive but requires robust monitoring infrastructure and careful threshold calibration. Most production ML teams at tech companies use this approach.

Continuous / Online Learning

The model updates incrementally as new labeled data arrives — sometimes after every batch of transactions, sometimes hourly. Maximizes freshness. Requires careful safeguards against poisoning attacks, runaway learning rates, and catastrophic forgetting. Used by streaming recommendation systems and high-frequency trading models.

Hybrid Approaches

Scheduled retrain as a baseline with triggered retrain if signals cross thresholds earlier. The scheduled retrain prevents drift from compounding silently; the triggered retrain handles sudden shocks like the COVID-19 pandemic distribution shift. This is the dominant pattern at mature ML organizations.

Retrain Triggers: What Should Force an Unscheduled Retrain?

Not every drift alert should trigger a retrain. The cost of retraining — engineering time, compute, validation, deployment risk — must be weighed against the cost of operating a degraded model. Common trigger criteria include:

Accuracy BreachWhen labeled data is available: measured accuracy (AUROC, F1, RMSE) drops below a defined floor established during initial model validation. This is the clearest possible signal but requires label availability.
PSI ThresholdPSI above 0.25 on critical features — especially features with high feature importance — sustained over multiple scoring windows (not just a single spike).
Business KPI DivergenceA business metric the model directly drives (approval rate, click-through rate, fraud catch rate) deviates more than N standard deviations from its control band for K consecutive periods.
Known World EventA documented external shock — a new regulation, a market disruption, a pandemic — that is known to invalidate training assumptions. Organizations with mature ML operations have incident playbooks that pre-authorize retraining in response to defined world events.
The Retraining Data Problem

Retraining is not simply "run the training pipeline again on new data." Several critical decisions must be made about what data to retrain on:

Full historical window: Use all available historical data. Preserves long-run patterns. Risk: historical data may include exactly the obsolete patterns you're trying to escape.

Sliding window: Use only the most recent N months of data. More responsive to current patterns. Risk: loses long-run signal, may overfit to a transient period.

Weighted recency: Train on the full history but assign higher loss weights to recent examples. Balances freshness and historical depth. Technically more complex but often the right answer.

Real Case — Instacart, 2020

Instacart's data science team published a post-mortem in 2020 describing how the COVID-19 pandemic invalidated their demand forecasting models within days. Demand for staples exploded while demand for prepared foods collapsed. Their response involved rolling back to simpler heuristic models temporarily, then rapidly retraining on the emerging post-March 2020 distribution — explicitly excluding pre-pandemic data from the training window because it actively harmed forecast quality. The lesson: sometimes the right data window is a clean break, not an extension of history.

Operational Principle

Retraining is a deployment event, not just a training event. Every retrain must go through the same validation, shadow testing, and staged rollout as the original model launch. The worst outcomes in production AI come from retraining fast and deploying recklessly after a drift alert.

Lesson 3 Quiz

Retraining Strategies and Trigger Logic · 3 questions
Instacart's response to the COVID-19 demand shock illustrates which retraining data strategy?
Correct. Instacart explicitly excluded pre-pandemic data from their retrain window because that historical data encoded patterns that were not just obsolete but actively distorting new predictions. Sometimes the right answer is a clean break from history.
Instacart's key decision was to exclude pre-pandemic data — treating it as actively harmful, not just less relevant. This is a clean-break sliding window approach, not a weighting scheme or full historical approach.
What is the primary risk of online learning / continuous model updating?
Correct. Continuous learning systems are powerful but brittle. A malicious actor who controls a fraction of the incoming data stream can systematically shift the model's behavior. Runaway learning rates can amplify noise. And rapid adaptation to new patterns can erase important historical knowledge — known as catastrophic forgetting.
The core risk is adversarial and stability-related: poisoning attacks, unstable learning dynamics, and catastrophic forgetting. These require careful engineering safeguards — gradient clipping, anomaly detection on updates, and regularization to anchor model behavior.
A single PSI spike of 0.28 on a moderately important feature fires an alert on Tuesday. What is the correct immediate response according to best practice?
Correct. A single spike requires investigation, not immediate action. Check whether data quality degraded, a pipeline changed, or there's a genuine distributional shift. Sustained drift across multiple windows, combined with root cause analysis, should inform the retraining decision.
Reacting immediately to a single spike leads to expensive false-retrain cycles. Best practice is to investigate: is this a pipeline issue? A one-time event? Does the drift persist? Retraining is a deployment event with real risk — it must be justified by sustained, root-caused drift.

Lab 3 · Retrain Decision Simulation

Work through a realistic retrain-or-wait decision under uncertainty

Your Mission

You manage a retail demand forecasting model. It's late November — Black Friday week — and your model is behaving unexpectedly. Walk through the decision with the AI advisor: should you retrain, wait, roll back, or switch to a fallback heuristic? What data window would you use if you retrain? Complete at least 3 exchanges.

Scenario: Your demand forecast model is over-predicting demand for electronics by 40% this week. PSI on "product_category" has been elevated for 3 weeks. Ground truth (actual sales) is available with a 2-day lag. Last year's Black Friday data is in the training set. What's your call?
Retrain Advisor
Lab 3
Black Friday pressure, a misbehaving forecast, and a 3-week drift signal — this is a classic high-stakes retraining decision under time pressure. Walk me through your initial read: what's the most likely root cause, and what's your first move?
Lesson 4 · Module 4

Shadow Testing, Canary Deployments, and Champion–Challenger

The production validation methods that let you compare model versions safely — without betting the entire system on a retrain that might be worse.
How do you know the new model is actually better before you commit to it?

In 2012, LinkedIn suffered a major site outage triggered by a deploy that took down its key member feeds service. The incident accelerated adoption of what LinkedIn called "Inversion of Control" deployment practices — canary releases that route a small percentage of traffic to new code before full rollout. The ML equivalent of this hard-won lesson is the champion–challenger framework: never replace a production model wholesale with an untested successor, no matter how good the offline metrics look.

Why Offline Metrics Lie

A retrained model that scores better on a held-out validation set does not necessarily perform better in production. The gap between offline and online performance has a name — training–serving skew — and it arises from multiple sources:

Feature pipeline differences: The model was validated on batch-computed features but production uses real-time features with slightly different logic. Feedback loops: The model's own past predictions affected the labels it's now being evaluated against. Survivor bias in validation data: The validation set only contains requests that made it through the production pipeline — excluding edge cases that crashed it. Temporal leakage: Training data inadvertently includes signals from the future that aren't available at inference time.

Shadow Mode Testing

Shadow mode (also called shadow deployment or dark launch) runs the challenger model in parallel with the champion model. Real production traffic is scored by both models, but only the champion's predictions are acted upon. The challenger's outputs are logged and compared. No users are affected; no business decisions change. This is the safest possible way to validate a new model against real production traffic.

Netflix has described running challenger recommendation models in shadow mode for weeks before any A/B test, collecting output distribution data to verify the new model is behaving as expected before any user sees its recommendations.

Shadow Testing

Challenger runs silently alongside champion. Zero user exposure. Validates output distribution, latency, feature availability, and error rates in production conditions. Duration: days to weeks. Best for: high-risk models where any degradation is unacceptable.

Canary Deployment

Route a small percentage of real traffic (1–5%) to the challenger. Users actually receive challenger predictions. Monitor business metrics, error rates, and latency. Gradually increase percentage if metrics hold. Best for: moderate-risk models where limited exposure is acceptable.

A/B Testing

Randomly split traffic 50/50 (or other ratio) between champion and challenger. Statistically compare outcomes over a defined period. Requires sufficient sample size to detect meaningful differences. Best for: when you need statistical proof of improvement before full rollout.

Champion–Challenger

The formal framework that governs the above. The champion is the current production model. The challenger is the candidate retrain. Only when the challenger has demonstrated superiority through shadow + canary + A/B does it become the new champion. The loser is archived, not deleted.

Automated Rollback Triggers

Even with staged rollout, new models can fail in unexpected ways after deployment. Mature ML systems include automated rollback triggers that instantly switch back to the champion model if the challenger breaches defined thresholds within a rollback window (typically 24–72 hours post-deployment). Triggers include: error rate spike above N%, business KPI drop of more than M%, or latency increase above P milliseconds.

Uber's Michelangelo ML platform — described in their 2017 engineering blog — includes automated rollback as a core feature, treating it as a non-negotiable safety net for any ML deployment.

Real Case — Meta Ad Ranking, Ongoing

Meta runs hundreds of simultaneous A/B tests on its ad ranking models at any given time. Each test covers a slice of traffic; results are aggregated and fed into a cadenced decision process about which model versions become the new champion. This industrialization of champion–challenger is what allows a company operating at Meta's scale to continuously improve its models without introducing catastrophic rollout failures.

Documentation and the Model Card

Every model version — champion and challenger — should have a model card: a structured document recording training data window, feature list, evaluation metrics, known limitations, and drift monitoring thresholds. When a retrain occurs, the new model card is compared against the previous one as part of the champion–challenger review. This practice, formalized by Google researchers in a 2019 paper, creates an auditable trail of model evolution that is increasingly required by financial regulators (SR 11-7) and EU AI Act provisions for high-risk AI systems.

Closing Principle for This Module

Model monitoring is not a feature — it is the operational foundation that keeps deployed AI trustworthy over time. Drift happens. World changes. The teams that build durable AI systems are those that treat monitoring, retraining, and validation as first-class engineering disciplines from day one of deployment.

Lesson 4 Quiz

Shadow Testing, Canary Deployments, and Champion–Challenger · 3 questions
In shadow mode testing, what happens when the challenger model's output differs from the champion's?
Correct. Shadow mode means the challenger runs silently — its predictions are logged but never acted upon. Users only ever receive the champion's output. This is what makes shadow testing the safest validation method: zero production risk.
Shadow mode is specifically designed to have zero user impact. The champion always serves predictions. The challenger's outputs are captured in logs for analysis only. No averaging, no human review loop — just silent parallel scoring.
What is "training–serving skew" and why does it matter for champion–challenger decisions?
Correct. Training–serving skew means offline metrics (on historical data) don't reliably predict production performance. A challenger that looks better on validation data may underperform in production due to real-time feature differences, feedback loops, or data leakage in the training set. This is exactly why shadow and canary testing exist.
Training–serving skew is the performance gap between offline evaluation and live production — not a time lag or cost difference. It's why offline metrics alone can't be trusted for champion–challenger decisions. Real-traffic validation is necessary.
A model card should be updated every time a model is retrained. What is the primary regulatory and operational reason for this?
Correct. Model cards serve as the official record of a model version's provenance, capabilities, and limitations. For high-risk AI systems under EU AI Act provisions and Fed/OCC guidance like SR 11-7, this kind of documentation is legally required — not optional. They also enable internal audit trails when questions arise about why the model was retrained and what changed.
Model cards are documentation artifacts — they don't improve accuracy or trigger rollbacks. Their value is audit and compliance: creating a versioned record of training decisions, data windows, known limitations, and performance benchmarks. This matters enormously for regulated industries and increasingly for AI regulation globally.

Lab 4 · Champion–Challenger Design

Design a full champion–challenger rollout plan for a retrained model

Your Mission

You've completed a retrain of a credit card fraud detection model. Offline metrics look good: AUROC improved from 0.91 to 0.94. Now design the full champion–challenger validation plan before this model touches production. Work with the AI advisor on shadow duration, canary traffic percentage, A/B success criteria, rollback triggers, and model card requirements. Complete at least 3 exchanges.

Start by proposing your shadow testing plan: how long would you run shadow mode, what outputs would you compare, and what would need to be true to advance to canary?
Deployment Validator
Lab 4
Your retrained fraud model looks promising offline — AUROC up from 0.91 to 0.94. But offline metrics aren't the full story. Walk me through your shadow testing plan: duration, what you'd compare, and your go/no-go criteria to move to canary deployment.

Module 4 · Final Test

Monitoring Model Quality Over Time · 15 questions · Pass at 80%
1. Which drift type specifically describes a change in the relationship between input features and the target label?
Correct. Concept drift = the X→Y mapping has changed.
Concept drift is the right term — it describes a changed relationship between inputs and labels, not just a shift in input distributions.
2. PSI of 0.08 on a feature in your production window indicates:
Correct. PSI below 0.1 = stable. No action required.
PSI below 0.1 is the stable threshold. 0.08 falls within that range — distribution is stable.
3. Why is a fraud detection model's output distribution useful as a monitoring signal when ground truth labels are delayed?
Correct. Output distribution monitoring is a leading indicator of model degradation — available immediately, without waiting for labels.
Output distribution shifts are early warning signals — they often appear before labeled accuracy metrics can detect degradation, making them invaluable when labels are delayed.
4. The Kolmogorov-Smirnov (KS) test is best suited for detecting drift in which type of feature?
Correct. KS is a non-parametric test comparing two continuous distributions — ideal for numeric features like transaction amount.
KS tests compare two continuous distributions. For categorical features, use Chi-squared. For high-dimensional embeddings, use MMD.
5. What is the main problem with purely static alert thresholds (e.g., "alert if PSI > 0.25 always")?
Correct. Static thresholds fire during normal seasonal variation — holiday demand spikes, for example — creating alert fatigue that causes engineers to ignore genuine degradation signals.
The core problem is seasonality-blindness. Static thresholds treat expected seasonal variation as drift alerts. Rolling baselines or CUSUM adapt to these patterns.
6. Which real-world company's demand forecasting team explicitly excluded pre-pandemic training data because it actively harmed forecast quality?
Correct. Instacart published a post-mortem describing their decision to exclude pre-March 2020 data from their COVID-era retrain.
Instacart published this case study. Zillow suffered losses from inadequate monitoring; Netflix uses shadow testing; Google updates continuously for freshness.
7. A model retrain should be treated as a deployment event. What does this specifically mean?
Correct. A retrained model is a new model. It carries the same risks as the original deployment and must go through equivalent validation processes — shadow, canary, A/B — before becoming champion.
Treating retrain as a deployment event means applying the same rigor: shadow testing, canary rollout, A/B validation, rollback readiness. Never auto-promote a retrain to production based solely on offline metrics.
8. In a champion–challenger framework, what happens to the losing model after the challenger wins?
Correct. Archiving — not deleting — the losing model allows rollback if the new champion develops unexpected issues, and maintains an audit trail of model evolution.
Archiving is the correct practice. Deleting previous model versions removes your rollback capability and audit trail. Both are critical in production operations.
9. Maximum Mean Discrepancy (MMD) is preferred over PSI for which type of feature?
Correct. MMD is a kernel-based method designed for high-dimensional spaces where PSI and KS break down — particularly useful for NLP embeddings and computer vision features.
MMD shines on high-dimensional data — embeddings, image representations — where PSI and KS aren't adequate. For simple numeric features, PSI and KS are sufficient.
10. Spotify's recommendation teams monitor the entropy of recommendation distributions as a health signal. A collapse in entropy indicates:
Correct. Entropy collapse in recommendation outputs means the diversity of predictions has collapsed — everyone gets the same recommendations — which is a key symptom of model degradation.
Entropy collapse in recommendation distributions means the model has lost its ability to personalize — it's converging on a few popular items for everyone. This is a degradation signal, not a success signal.
11. What is the primary advantage of a "weighted recency" approach to retraining data selection over a simple sliding window?
Correct. Weighted recency assigns higher loss weights to recent examples without discarding historical data — you get the freshness of a sliding window with the depth of full history.
Weighted recency trains on all data but emphasizes recent examples through higher loss weights. A sliding window discards old data entirely. Weighted recency is the balanced middle path.
12. The Amazon Rekognition bias case (2018–2019) illustrates which drift concept?
Correct. The training distribution was skewed toward lighter faces; as the system was deployed to policing and hiring contexts, the operational population shifted toward darker-skinned faces — exactly where accuracy was worst. Classic covariate shift.
This is covariate shift: the deployed population looked different from the training population. The model's learned weights were fine for the training distribution but failed for the operational one.
13. CUSUM (Cumulative Sum) control charts are particularly valuable for detecting:
Correct. CUSUM accumulates deviations over time, making it uniquely powerful for gradual drift that flies under static threshold radar.
CUSUM's strength is detecting slow accumulation of small deviations — gradual concept drift that never spikes above a static threshold but compounds into significant model degradation.
14. Zillow's Offers program shutdown (2021) most directly illustrates a failure of which monitoring capability?
Correct. The core failure was inadequate sensitivity in monitoring the gap between model predictions and actual transaction prices — the feedback loop that should have triggered retraining or de-risking of inventory positions.
Zillow's problem was that the monitoring feedback loop between model price predictions and actual transaction prices wasn't sensitive enough to trigger corrective action before $304M in inventory losses accumulated.
15. Model cards are increasingly required by regulations like the EU AI Act and SR 11-7 (Federal Reserve) primarily because they:
Correct. Model cards are documentation artifacts — they create the traceable, versioned record of model provenance, capabilities, and limitations that regulators require for high-risk AI systems in financial services and other regulated domains.
Model cards are documentation, not detection tools. They create the paper trail regulators need: what data was used, what limitations exist, what performance metrics were validated, and what changed between versions.