Module 8 · Lesson 1

Why Evaluation Never Stops

The gap between benchmark performance and production reality — and why closing it requires continuous work.

What happens to an AI system's performance between launch day and six months later?

Amazon built a machine-learning hiring tool beginning in 2014, training it on résumés submitted over a ten-year period. The system was evaluated on held-out test sets and showed strong performance — until internal audits revealed it was systematically downgrading résumés that included the word "women's" (as in "women's chess club"). The underlying training data reflected a decade of male-dominated hiring outcomes, and no continuous evaluation caught the drift. Amazon shut the tool down in 2018 without ever deploying it at scale. The lesson was stark: a one-time evaluation at launch is not sufficient when the world being modeled keeps changing.

The Illusion of the Frozen Model

Most ML workflows treat evaluation as a pre-deployment gate: train a model, evaluate it on a held-out test set, meet a threshold, ship it. This framing has a hidden assumption — that the statistical relationship between inputs and outputs will remain stable after deployment. In practice, that assumption fails in three recurring ways.

Data drift occurs when the distribution of incoming inputs shifts away from the distribution seen during training. A fraud-detection model trained on 2021 transaction patterns may never have seen the buy-now-pay-later structures that proliferated in 2022–2023. Concept drift occurs when the underlying relationship between inputs and the correct output changes — a credit-risk model trained before a recession assigns different risk to identical financial profiles than one trained after. Feedback loops occur when the model's own outputs alter future inputs, as with recommender systems that shape the very consumption patterns they predict.

Each of these mechanisms means that a model which passed its pre-deployment evaluation can silently degrade once live — sometimes within weeks, sometimes over years.

~6 mo

Median time before significant drift in production ML systems

Per Google's 2022 internal study on deployed Vertex AI models across verticals.

74%

Of ML teams report unexpected post-launch degradation

Algorithmia / DataRobot State of Enterprise ML survey, 2021.

Continuous Evaluation Defined

Continuous evaluation means running a structured measurement process on live production traffic — not just at model-update time, but as an ongoing operational discipline. It answers three questions on a recurring schedule:

Is performance stable? Comparing current metrics to a baseline established at launch or last retrain.

Is the input distribution changing? Monitoring statistical properties of features and request patterns over time.

Are there segments degrading faster than others? Disaggregated analysis by demographic group, geography, or product category.

Why This Is Hard

In production, ground-truth labels often arrive with significant delay or not at all. A content moderation model may flag a post, but whether that flag was correct might only be determinable after a human review that takes days. An ad click-through model won't know if a click converted for 30 days. Continuous evaluation must account for this label latency by designing proxy metrics and delayed-label pipelines.

The MLOps Evaluation Loop

Modern MLOps frameworks conceptualize continuous evaluation as a closed loop. Google's MLOps whitepaper (2020) describes a production ML system as requiring not just a model pipeline but a feedback pipeline that continuously routes production signals back into the evaluation and retraining workflow.

Meta's infrastructure team published in 2023 that their recommendation models are evaluated on live traffic slices hourly, with automated alerts triggering retraining workflows when metrics cross pre-defined thresholds. The infrastructure cost of this is non-trivial — but the cost of a degraded recommender at Meta's scale (billions of daily active users) vastly exceeds it.

Production Traffic Sampling

Log a representative slice of live requests and model outputs, preserving enough context to later evaluate correctness.

Label Acquisition

Route samples through human review, delayed feedback signals, or proxy metrics where ground truth is unavailable.

Metric Computation

Calculate performance metrics across the overall population and key subgroups. Compare to baseline and prior periods.

Alerting and Decision

Automated alerts surface degradation. Human judgment decides: retrain, rollback, or investigate further.

Core Principle

Continuous evaluation is not a quality-assurance afterthought — it is the mechanism by which a deployed AI system maintains its contract with users over time. Without it, every deployed model is running on borrowed reliability.

Lesson 1 Quiz

Why Evaluation Never Stops — check your understanding

What is "concept drift" in a deployed ML model?

Correct. Concept drift means the real-world process the model learned has changed — the same inputs now correspond to different correct outputs than at training time.

Not quite. Concept drift refers specifically to a change in the statistical relationship between inputs and outputs in the real world, not to software or infrastructure changes.

Amazon's recruiting tool was shut down in 2018 primarily because:

Correct. The tool penalized résumés containing words like "women's," having learned from a decade of male-dominated hiring outcomes. No continuous evaluation caught this before shutdown.

That's not the documented reason. Internal audits found systematic gender bias baked into its scoring — it downgraded résumés associated with women.

What is "label latency" in the context of continuous evaluation?

Correct. Label latency is a core challenge in production evaluation — a click-through model may not know for 30 days whether a click converted, making real-time accuracy measurement impossible.

Not quite. Label latency refers to the gap between prediction time and when the ground-truth answer becomes knowable — for example, waiting weeks to see if a fraud flag was correct.

According to the MLOps evaluation loop, what is the correct order of the four stages?

Correct. You first capture live traffic, then acquire ground truth, then compute metrics against baseline, then alert humans or trigger automated responses.

Review the pipeline. The correct sequence starts with sampling live production traffic, then acquiring labels, then computing metrics, then triggering alerts or decisions.

Lab 1 — Diagnosing Silent Degradation

Practice session · Continuous Evaluation Advisor

Your Scenario

You operate a loan-application approval model that was deployed 8 months ago. It passed all pre-launch evaluations with 91% accuracy. No retraining has been done since launch. You've just noticed that customer complaint rates have risen 40% in the past two months, but your offline test-set metrics look fine.

Ask the advisor: How do I tell whether my model has silently degraded? What signals should I look for, and in what order? What might explain the disconnect between my offline metrics and rising complaints?

Continuous Evaluation Advisor

Lab 1

Welcome. You've described a classic silent-degradation scenario — offline metrics look fine, but real-world outcomes are deteriorating. That disconnect is diagnostic in itself. What aspect would you like to dig into first: the types of drift you might be experiencing, the signals you can monitor without ground-truth labels, or how to structure an investigation plan?

Module 8 · Lesson 2

Metrics That Matter in Production

Choosing what to measure — and why the wrong metrics can give you false confidence for months.

If accuracy is 94% but one demographic group's error rate is 3× higher, is your model performing well?

In November 2019, software engineer David Heinemeier Hansson publicly reported that Apple Card's credit algorithm, operated by Goldman Sachs, had offered him 20× the credit limit offered to his wife, despite their shared assets and her higher credit score in some systems. When Goldman investigated, they stated the algorithm was working as intended. New York's Department of Financial Services opened an investigation. Goldman later paid a $89 million settlement in 2024 acknowledging fair lending violations. The algorithm's overall performance metrics were likely healthy — the problem was only visible when outcomes were disaggregated by gender.

The Metric Selection Problem

Every production evaluation system requires a deliberate choice: which metrics will you track, at what granularity, and how will you respond when they move? The wrong choices produce two failure modes. Metric blindness — tracking metrics that don't surface real problems, as in the Apple Card case. Metric overload — tracking so many metrics that teams can't distinguish signal from noise and alerts become meaningless.

The discipline starts by distinguishing between model metrics (properties of the model's statistical outputs), system metrics (infrastructure health), and outcome metrics (real-world effects on users).

Model Metrics Accuracy, F1, AUC-ROC, calibration error, precision/recall. Computed from model outputs vs. labels.

System Metrics Latency, throughput, error rates, uptime. Computed from infrastructure telemetry.

Outcome Metrics User satisfaction, complaint rates, downstream business KPIs. Computed from user behavior and business data.

Disaggregated Evaluation

The Apple Card case, and dozens like it in credit, hiring, healthcare, and criminal justice, share a common structure: aggregate metrics mask subgroup disparities. In 2023, the National Institute of Standards and Technology published updated guidance under the AI Risk Management Framework explicitly requiring that evaluation be disaggregated across protected characteristics where feasible.

Disaggregated evaluation means computing key metrics separately for meaningful subgroups — by geography, age bracket, device type, or demographic — and then comparing across groups. A model with 90% aggregate accuracy but 78% accuracy for a minority subgroup is not a 90%-accurate model for that population.

Precision vs. Recall Trade-offs in Production

A medical diagnostic model optimizing for recall (catching all positive cases) will generate false positives that require expensive follow-up. A model optimizing for precision (only flagging high-confidence positives) will miss true cases. Neither is universally correct — the right trade-off is determined by the cost structure of errors in the specific deployment context, and it may shift as the deployment context changes.

Proxy Metrics When Labels Are Delayed

When true labels arrive days or weeks after a prediction, you need proxy metrics to evaluate model health in real time. Netflix's recommendation team has publicly described using engagement rate (whether a recommended title is actually watched beyond a threshold duration) as a proxy for recommendation quality, because user satisfaction surveys arrive too slowly. Uber Eats uses reorder rate as a proxy for delivery quality predictions.

The danger with proxies is that they can be optimized directly — a recommender trained to maximize watch duration may push sensational content that users regret watching. Proxy selection requires understanding why the proxy correlates with what you care about, and continuously checking that correlation.

Calibration — The Underappreciated Metric

A well-calibrated model is one whose predicted probabilities match actual frequencies. If a model says an event has a 70% probability, that event should actually occur 70% of the time across all predictions at that confidence level. Calibration error tends to drift even when accuracy stays stable — especially after distribution shift. A 2021 paper from Carnegie Mellon showed that models deployed on distribution-shifted data can maintain accuracy while their calibration degrades severely, causing downstream decision-makers to over-trust or under-trust outputs.

Practical Metric Dashboard Structure

Teams at Google, Microsoft, and Airbnb have converged on a tiered alert structure:

Critical Threshold Breach

Primary metric falls below hard floor. Triggers immediate on-call page and potential rollback. Example: fraud recall drops below 80%.

Trend Alert

Metric declining steadily over 7–14 days without crossing hard threshold. Triggers investigation and retraining consideration.

Subgroup Divergence

Aggregate metrics stable but a specific subgroup's metrics are deteriorating. Triggers fairness review and targeted retraining.

Key Principle

Choose the fewest metrics that collectively answer: "Is the model still doing what it was designed to do, for all the populations it serves?" Calibration, disaggregated accuracy, and at least one outcome metric form a minimum viable evaluation set for most production systems.

Lesson 2 Quiz

Metrics That Matter in Production — check your understanding

What did the Apple Card credit algorithm case demonstrate about aggregate performance metrics?

Correct. The algorithm's overall metrics were likely healthy, but disaggregating by gender revealed 20× credit limit disparities — invisible at the aggregate level.

The case showed the opposite — aggregate metrics gave false confidence while a significant gender disparity existed that was only visible through disaggregated analysis.

A model predicting loan default maintains 91% aggregate accuracy but has 78% accuracy for applicants over age 60. This is best described as:

Correct. This is a textbook subgroup disparity — the model performs significantly worse for a specific population that the aggregate metric masks.

This describes a subgroup performance disparity. The aggregate metric looks good, but one population group is receiving meaningfully worse predictions — exactly what disaggregated evaluation is designed to surface.

What does "calibration error" refer to in a production model?

Correct. A miscalibrated model that predicts 70% probability for an event might see that event occur only 40% of the time — making its confidence scores unreliable for decision-making.

Calibration error is specifically the gap between stated confidence and actual accuracy. A model saying "70% probability" should see that event occur 70% of the time if well-calibrated.

A P1 alert in a tiered monitoring system typically means:

Correct. P1 alerts catch slow degradation trends before they become critical failures — the kind of drift that wouldn't trigger an immediate P0 page but signals a need for investigation.

That's a P0 (critical) scenario. P1 refers to a worrying trend over 7–14 days that hasn't yet crossed a hard threshold but warrants investigation and potential retraining.

Lab 2 — Building a Metric Framework

Practice session · Production Metrics Advisor

Your Scenario

You are designing the evaluation framework for a healthcare triage model that predicts which patients need immediate escalation. The model serves a diverse patient population across 12 hospitals. Ground-truth labels (actual patient outcomes) arrive with a 48-hour delay on average. Your manager is asking for a dashboard covering "the right metrics."

Ask the advisor: What specific metrics should I track for a healthcare triage model? How do I handle the 48-hour label delay? What does a disaggregated evaluation look like in this context, and what proxy metrics could I use in the meantime?

Production Metrics Advisor

Lab 2

A healthcare triage model is one of the highest-stakes production evaluation contexts — the cost of false negatives (missed escalations) is severe, and subgroup disparities can directly harm vulnerable populations. Let's build a framework. What would you like to start with: the primary model metrics and their P0 thresholds, the proxy metrics for the 48-hour label gap, or the disaggregation structure across hospitals and patient demographics?

Module 8 · Lesson 3

Shadow Deployment and A/B Testing

How to evaluate new models safely in production — before they make any live decisions.

How do you know a new model is better than the old one — not on a test set, but on actual production traffic?

Microsoft's Bing team has published extensively on their controlled experiment platform, which runs thousands of simultaneous A/B tests across Bing's search ranking, ads, and UI. In a 2013 paper in KDD, Kohavi, Longbotham, and colleagues reported that most proposed changes — including changes that engineers were confident would improve quality — showed no improvement or degraded key metrics when actually tested on live users. The Bing framework became a reference model for the industry: no model change ships to production without first demonstrating improvement on live traffic. By 2019, their platform was running over 10,000 concurrent experiments.

The Problem with Test-Set Validation Alone

A held-out test set is a snapshot of the world at collection time, filtered through the decisions made about what data to label and how. Production traffic is messier, more diverse, and arrives in patterns no offline dataset fully captures. The Bing team documented that improvements on their offline evaluation correlated with production improvements only about 75% of the time — meaning one in four "improvements" that looked good offline actually degraded live user experience.

Two evaluation patterns address this gap: shadow deployment and A/B testing. They serve different purposes and answer different questions.

Shadow Deployment

In shadow deployment, a new model runs in parallel with the production model. Real production requests are duplicated and sent to both. The new model's outputs are logged but not served to users — the old model continues to make all live decisions. This allows teams to observe how the new model would have behaved on real traffic without any risk to users.

Shadow deployment is particularly valuable for catching operational failures — edge cases that crash the model, unexpected output distributions, or latency spikes — before they affect real users. Google's Site Reliability Engineering documentation describes shadow testing as the standard pre-production step for models handling sensitive user data.

Shadow Deployment Limitations

Shadow deployment cannot evaluate user response to new model outputs, because users never see them. A recommendation model may produce statistically higher-quality suggestions in shadow mode, but you can't know if users will actually engage with them differently. For that, you need A/B testing.

A/B Testing in Production

In an A/B test (also called a controlled experiment or online experiment), incoming traffic is randomly split: a control group receives the current model's outputs, and a treatment group receives the new model's outputs. Both groups interact normally with the product. The difference in outcomes between groups — measured on engagement, error rates, conversions, or other metrics — estimates the causal effect of the model change.

The randomization is critical. Without it, selection bias makes causal inference impossible. Airbnb's ML platform team published in 2019 that their experimentation framework assigns users to experiment buckets using hash functions over user IDs, ensuring consistent bucket assignment across sessions and preventing users from seeing different model outputs depending on which server handles their request.

Statistical Significance and Practical Significance

At high-traffic services, even trivial effects reach statistical significance because the sample sizes are enormous. Booking.com's data science team published in 2019 that they distinguish between statistical significance (unlikely to be due to chance) and practical significance (large enough to matter for the business or user). They set minimum detectable effect thresholds — a change must be large enough to be worth the operational cost of the switch — before any experiment is declared a winner.

~75%

Offline-to-online improvement correlation

Bing's published finding: offline test improvements predicted production improvements only ~75% of the time (KDD 2013, Kohavi et al.).

10,000+

Concurrent experiments on Bing by 2019

Microsoft's ExP (Experimentation Platform) enables simultaneous testing across search ranking, ads, and UI layers.

Interleaving — A Faster Alternative

For ranking systems specifically, Netflix and others have used interleaving — a technique where results from model A and model B are interleaved into a single ranked list shown to a single user. The system tracks which model's results the user interacts with. Because both models are evaluated by the same user on the same query, interleaving reduces variance dramatically compared to A/B testing and reaches statistical significance with far less traffic. Netflix's recommendation team reported in 2016 that interleaving experiments required roughly 100× less traffic than equivalent A/B tests to reach the same statistical power.

The Deployment Sequence

Shadow deployment → A/B test on a small traffic slice (1–5%) → Gradual rollout with monitoring → Full deployment. Each stage is a gate that requires the model to demonstrate improvement — not just on an offline benchmark, but on real production signals.

Lesson 3 Quiz

Shadow Deployment and A/B Testing — check your understanding

What is the primary purpose of shadow deployment?

Correct. Shadow deployment runs the new model in parallel with no risk to users — it's about catching operational failures and output distribution issues before any live user sees the new model's decisions.

Shadow deployment specifically withholds the new model's outputs from users. It evaluates behavior on real traffic without exposing users to it. User response measurement requires A/B testing.

The Bing team found that offline evaluation improvements predicted production improvements approximately what fraction of the time?

Correct. Kohavi et al. (KDD 2013) documented ~75% correlation between offline and online improvement — strong enough to be informative, but imperfect enough that live testing remains mandatory.

The documented figure from Bing's KDD 2013 paper was approximately 75% correlation — meaning one in four offline "improvements" actually degraded live user metrics when deployed.

Why does randomization matter in A/B testing?

Correct. Randomization is what makes A/B tests causal rather than merely correlational. Without it, you can't know whether outcome differences are caused by the model or by pre-existing differences between user groups.

Randomization's purpose is causal inference. If assignment to control vs. treatment isn't random, differences in outcomes might reflect pre-existing differences between user groups rather than the effect of the model change.

Netflix's interleaving technique is advantageous over A/B testing for ranking systems because:

Correct. Netflix reported that interleaving reaches equivalent statistical power with roughly 100× less traffic than a standard A/B test, because each user acts as their own control.

The key advantage is variance reduction through within-user comparison. Netflix documented ~100× less traffic needed to reach the same statistical power as A/B testing, because both models face the same user on the same query.

Lab 3 — Designing a Safe Rollout

Practice session · Deployment Strategy Advisor

Your Scenario

You have a new content ranking model ready for production. Your current model has been running for 14 months. Your team's offline benchmarks show the new model outperforms on precision@10 by 8%. However, the new model has never been tested on real user traffic. Your platform serves 2 million daily active users. You need to design a safe rollout sequence.

Ask the advisor: How should I sequence the deployment — what does shadow testing tell me that offline tests don't? When should I move to A/B testing, and how do I size the initial traffic slice? What metrics should trigger a halt and rollback?

Deployment Strategy Advisor

Lab 3

Good framing — an 8% offline improvement is meaningful, but as the Bing team documented, offline improvements predict production improvements only about 75% of the time. Let's design a rollout that gives you real evidence at each stage before you increase exposure. Would you like to start with shadow deployment design, A/B test sizing, or rollback trigger criteria?

Module 8 · Lesson 4

Closing the Loop — Retraining and Governance

When evaluation finds a problem, what happens next? The human decisions that continuous monitoring must eventually produce.

Evaluation tells you the model is degrading — but who decides when to retrain, and what governance ensures the fix doesn't create new problems?

In September 2020, users began posting examples showing that Twitter's automated image-cropping algorithm consistently favored white faces over Black faces when framing photos. Twitter's engineering team ran their own analysis, confirmed the disparity, and published an internal evaluation in May 2021. Their response was not to retrain: they removed the automated cropping feature entirely, concluding that no technically feasible retrain would guarantee elimination of the bias at acceptable cost. The case illustrates that "close the loop" doesn't always mean retrain — sometimes the evaluation finding recommends a different intervention entirely.

The Retraining Decision

When continuous evaluation detects degradation, three categories of response are possible: retrain on new data, roll back to a prior model version, or intervene differently (as Twitter did). Each is appropriate in different circumstances.

Retraining is appropriate when the degradation is caused by data drift — the input distribution has changed and new data representing the current distribution is available. Rollback is appropriate when a recent model update is the identified cause of the degradation. Alternative intervention is appropriate when the degradation reveals a fundamental problem with the task framing, training data, or deployment context that retraining alone cannot address.

Scheduled Retraining Retraining on a fixed calendar cadence regardless of detected degradation. Simple to operate; may waste resources or lag behind fast-moving drift.

Triggered Retraining Retraining initiated when monitoring detects metric degradation beyond a threshold. Responds to actual drift; requires reliable monitoring infrastructure.

Continuous Retraining Models retrain on a rolling window of recent data without a discrete trigger. Used by high-frequency systems like ad bidding; expensive and operationally complex.

Data Governance for Retraining

Retraining on production data introduces new risks. The data that arrives in production may reflect the prior model's decisions — a recommendation model trained on its own recommendations can amplify existing biases. This is the feedback loop problem in data collection, distinct from the inference-time feedback loop problem.

Spotify's infrastructure team described in 2021 their approach to this: they maintain a causal data collection protocol that randomizes recommendations for a small fraction of traffic at all times, specifically to generate unbiased training data that isn't confounded by the prior model's choices. This data is ring-fenced for retraining use and validated against the broader production distribution before retraining begins.

The Retraining Validation Problem

A retrained model must itself be validated before deployment — which returns you to the evaluation pipeline from Lesson 3. This creates a cycle: evaluation → decision → retrain → evaluation → deploy. Governance structures must ensure that time pressure (to fix a degrading model quickly) doesn't cause teams to skip validation steps. The EU AI Act's Article 9 explicitly requires that high-risk AI systems have documented risk management processes covering post-market monitoring and corrective action.

Model Versioning and Rollback Infrastructure

Continuous evaluation is only actionable if rollback is fast. MLflow, Weights & Biases, and SageMaker all provide model registries that store versioned model artifacts with associated metadata: training data hash, evaluation metrics at training time, and deployment history. When a rollback decision is made, the registry enables restoration to any prior version within minutes rather than hours.

DoorDash's ML platform team documented in 2022 that they maintain a "champion-challenger" architecture at all times — the current production model (champion) runs alongside the previously validated model (challenger) which is ready to receive traffic within 60 seconds if the champion is rolled back. This ensures that rollback is always a viable option, not a theoretical one.

Human Oversight in the Evaluation Loop

Fully automated retraining pipelines — where detected degradation triggers retraining and deployment without human approval — carry the risk of automated error amplification. If the evaluation system itself has a bug, or if the detected "degradation" is a measurement artifact, automated pipelines can retrain on corrupted signals and deploy a worse model faster than any human can intervene.

The industry standard at companies including Google, Microsoft, and Airbnb is to automate detection and preparation (surfacing degradation alerts, running retraining, running offline validation) but require human approval for promotion to production — at least for high-stakes models. The 2023 White House Executive Order on AI, and the EU AI Act's requirements for high-risk systems, both establish human oversight of consequential AI decisions as a baseline expectation.

Detection

Automated monitoring detects metric degradation and surfaces alert with context — which metrics, which segments, since when, magnitude.

Root Cause Analysis

Human or automated investigation determines cause: data drift, concept drift, upstream pipeline failure, or model update.

Intervention Selection

Team decides: retrain, rollback, or alternative intervention. Decision documented with rationale.

Validation and Deployment

New model or rolled-back version validated offline and in shadow/A/B mode before promotion. Human approval gate before full deployment.

Post-Intervention Monitoring

Heightened monitoring cadence for 72 hours post-deployment. Confirm resolution before returning to standard schedule.

The Governance Mandate

Continuous evaluation without governance is just continuous observation. The loop only closes when detected problems are met with documented decisions, validated fixes, and confirmed resolution — all traceable to accountable human decision-makers. This is both operational best practice and, for high-risk AI systems, increasingly a legal requirement.

Lesson 4 Quiz

Closing the Loop — Retraining and Governance

Twitter's response to the image-cropping bias finding in 2021 is significant because it demonstrates that:

Correct. Twitter concluded that no technically feasible retraining would guarantee elimination of the bias, so the feature was removed entirely — demonstrating that "close the loop" has multiple possible responses beyond retraining.

Twitter's case shows the opposite: they evaluated the problem, determined retraining wasn't sufficient, and removed the feature entirely. Evaluation findings can lead to interventions other than retraining.

Spotify's "causal data collection protocol" addresses which specific risk in retraining?

Correct. By randomizing a fraction of recommendations, Spotify generates training data uncorrupted by the prior model's choices — preventing the new model from learning the old model's biases through feedback-loop-contaminated data.

The specific risk is feedback-loop data contamination: training on production data that reflects the prior model's choices can amplify whatever biases that model had. Spotify's protocol generates unbiased training data through randomization.

What is "triggered retraining"?

Correct. Triggered retraining is event-driven — monitoring detects a threshold breach and initiates the retraining pipeline automatically, responding to actual drift rather than a fixed schedule.

That describes scheduled retraining. Triggered retraining is initiated by a monitoring event — specifically when metrics cross a degradation threshold, making the response proportional to actual drift.

Why do companies like Google and Airbnb maintain human approval gates before promoting retrained models to production?

Correct. Fully automated pipelines risk automating errors at scale. If the monitoring system itself has a bug, automated retraining and deployment could propagate a corrupted model faster than any human oversight could catch it.

The rationale is about error amplification risk. If automated evaluation systems misidentify degradation or are fed corrupted signals, a fully automated pipeline could deploy a bad model faster than any human can intervene.

Lab 4 — Retraining Decision Workshop

Practice session · MLOps Governance Advisor

Your Scenario

You run a credit-scoring model in production for a regional bank. Your continuous evaluation pipeline has just flagged: (1) overall AUC has dropped from 0.87 to 0.81 over 60 days, (2) the drop is concentrated in applicants with non-traditional income sources — a segment that grew 35% in volume over the same period, (3) three months ago you did a minor feature engineering change that is now a possible confound. You need to decide how to respond.

Ask the advisor: Is this data drift, concept drift, or a model update regression? How do I determine which it is? If I retrain, what governance steps do I need before promoting the new model? How do I ensure the retrained model doesn't amplify the subgroup disparity I've now detected?

MLOps Governance Advisor

Lab 4

This is a multi-hypothesis situation — you have three plausible causes for the degradation and they require different responses. Disentangling them is the first priority before any retraining decision. Let's work through it systematically. Where would you like to start: ruling out the feature engineering change as a regression, characterizing the drift in the non-traditional income segment, or designing the governance checklist for whatever intervention you choose?

Module 8 Test

Continuous Evaluation in Production — 15 questions · 80% to pass

1. What term describes the phenomenon where the statistical relationship between inputs and correct outputs changes after a model is deployed?

Correct. Concept drift is when the real-world relationship between inputs and correct outputs changes — distinct from data drift, which is when the input distribution changes without the underlying relationship changing.

Concept drift is the correct term — the relationship between inputs and the correct output has shifted in the real world. Data drift refers to input distribution changes only.

2. Amazon's ML hiring tool was shut down in 2018 because:

Correct. The model penalized résumés containing "women's" (as in "women's chess club") because it had learned from historical hiring data where most successful candidates were male.

The documented cause was gender bias: the model penalized résumés associated with women, having learned from a decade of male-dominated hiring outcome data.

3. Which three question categories does continuous evaluation answer on a recurring basis?

Correct. These three questions form the core of continuous evaluation: overall stability, input distribution monitoring, and disaggregated subgroup analysis.

The three recurring questions of continuous evaluation are: Is performance stable? Is the input distribution changing? Are there segments degrading faster than others?

4. Label latency requires production evaluation systems to:

Correct. Delayed labels require proxy metrics (immediate signals correlated with the true outcome) and delayed-label pipelines that incorporate ground truth when it eventually arrives.

Label latency demands proxy metrics and delayed-label pipelines — ways to get useful evaluation signals in real time before ground truth is available.

5. The Apple Card credit algorithm case demonstrates primarily that:

Correct. The algorithm likely had healthy aggregate metrics — the 20× credit limit disparity was only visible through gender-disaggregated analysis of outcomes.

The Apple Card case shows that aggregate metrics gave false confidence while a severe gender disparity existed — only visible through disaggregated evaluation of outcomes by gender.

6. A model predicting churn shows 89% aggregate accuracy but 71% accuracy for customers aged 55+. This is a P2 alert scenario because:

Correct. P2 is specifically for the pattern where aggregate metrics are stable but a subgroup metric is diverging — requiring a targeted fairness review and potentially targeted retraining.

This matches the P2 pattern: aggregate metrics appear fine, but a subgroup is experiencing significantly worse performance. P0 is for critical aggregate threshold breach; P1 is for aggregate declining trends.

7. Calibration error in a deployed model refers to:

Correct. A calibrated model saying "70% probability" should be correct 70% of the time. Calibration can degrade after distribution shift even when accuracy appears stable.

Calibration error is the gap between stated probability and actual outcome frequency. A model saying "90% confident" that is actually right only 60% of the time has severe calibration error.

8. Shadow deployment differs from A/B testing in that:

Correct. Shadow deployment is fully safe — users never see the new model — but cannot measure user response. A/B testing exposes a traffic slice to the new model and measures actual user behavior.

The key difference is exposure: shadow deployment logs the new model's outputs without serving them; A/B testing actually serves the new model to a treatment group to measure real user response.

9. The Bing experimentation team found that offline evaluation improvements predicted production improvements approximately:

Correct. Kohavi et al. (KDD 2013) documented ~75% correlation — strong enough to be informative, imperfect enough to require live testing before any model is promoted to production.

The published figure is approximately 75% — meaning 1 in 4 offline "improvements" actually degraded live user metrics, which is why Bing requires live experimentation before production promotion.

10. Netflix's interleaving technique achieves statistical significance with ~100× less traffic than A/B testing because:

Correct. Within-user comparison eliminates the between-user variance that dominates A/B tests, allowing the same statistical power with far less traffic.

The variance reduction comes from within-user comparison: the same user on the same query evaluates both models, so user-level variation cancels out. This requires ~100× less traffic than between-user A/B testing.

11. Triggered retraining is initiated when:

Correct. Triggered retraining is event-driven — a monitoring threshold is crossed, and the pipeline initiates retraining automatically, responding to detected drift rather than a schedule.

Triggered retraining is initiated by a monitoring event — a metric threshold breach — not a calendar or manual decision. Scheduled retraining uses a fixed cadence.

12. Spotify's causal data collection protocol addresses what specific retraining risk?

Correct. By randomizing a fraction of recommendations, Spotify generates training data not confounded by the prior model's choices — preventing bias amplification through feedback-loop contamination.

The risk being addressed is data-level feedback loops: production data reflects the prior model's choices, so retraining on it uncritically can amplify whatever biases that model had. Randomization breaks this loop.

13. DoorDash's champion-challenger architecture ensures that:

Correct. The architecture makes rollback operationally viable — a prior validated model is always warm and ready, preventing a situation where rollback is theoretically possible but practically slow.

The champion-challenger pattern specifically ensures rollback readiness: the prior validated model (challenger) can take over traffic within 60 seconds if the champion (current production model) needs to be rolled back.

14. Why do industry-standard organizations maintain human approval gates before promoting retrained models to production?

Correct. The risk is error amplification at speed: fully automated pipelines could propagate monitoring bugs or corrupted evaluation signals into production faster than any human can catch and correct them.

The rationale is automated error amplification risk. If automated monitoring has a bug, a fully automated pipeline deploys the mistake at speed. Human gates catch errors before they affect users.

15. According to the five-stage corrective action pipeline, what happens immediately after the "Intervention Selection" stage?

Correct. After selecting the intervention (retrain, rollback, or other), the resulting model must be validated — offline, then shadow/A/B — before promotion to production, with human sign-off required.

After intervention selection comes validation and deployment: the fix is validated offline and in shadow or A/B mode before promotion, with a human approval gate. Heightened monitoring comes after deployment, not before.