Amazon built a machine-learning hiring tool beginning in 2014, training it on résumés submitted over a ten-year period. The system was evaluated on held-out test sets and showed strong performance — until internal audits revealed it was systematically downgrading résumés that included the word "women's" (as in "women's chess club"). The underlying training data reflected a decade of male-dominated hiring outcomes, and no continuous evaluation caught the drift. Amazon shut the tool down in 2018 without ever deploying it at scale. The lesson was stark: a one-time evaluation at launch is not sufficient when the world being modeled keeps changing.
Most ML workflows treat evaluation as a pre-deployment gate: train a model, evaluate it on a held-out test set, meet a threshold, ship it. This framing has a hidden assumption — that the statistical relationship between inputs and outputs will remain stable after deployment. In practice, that assumption fails in three recurring ways.
Data drift occurs when the distribution of incoming inputs shifts away from the distribution seen during training. A fraud-detection model trained on 2021 transaction patterns may never have seen the buy-now-pay-later structures that proliferated in 2022–2023. Concept drift occurs when the underlying relationship between inputs and the correct output changes — a credit-risk model trained before a recession assigns different risk to identical financial profiles than one trained after. Feedback loops occur when the model's own outputs alter future inputs, as with recommender systems that shape the very consumption patterns they predict.
Each of these mechanisms means that a model which passed its pre-deployment evaluation can silently degrade once live — sometimes within weeks, sometimes over years.
Continuous evaluation means running a structured measurement process on live production traffic — not just at model-update time, but as an ongoing operational discipline. It answers three questions on a recurring schedule:
In production, ground-truth labels often arrive with significant delay or not at all. A content moderation model may flag a post, but whether that flag was correct might only be determinable after a human review that takes days. An ad click-through model won't know if a click converted for 30 days. Continuous evaluation must account for this label latency by designing proxy metrics and delayed-label pipelines.
Modern MLOps frameworks conceptualize continuous evaluation as a closed loop. Google's MLOps whitepaper (2020) describes a production ML system as requiring not just a model pipeline but a feedback pipeline that continuously routes production signals back into the evaluation and retraining workflow.
Meta's infrastructure team published in 2023 that their recommendation models are evaluated on live traffic slices hourly, with automated alerts triggering retraining workflows when metrics cross pre-defined thresholds. The infrastructure cost of this is non-trivial — but the cost of a degraded recommender at Meta's scale (billions of daily active users) vastly exceeds it.
Continuous evaluation is not a quality-assurance afterthought — it is the mechanism by which a deployed AI system maintains its contract with users over time. Without it, every deployed model is running on borrowed reliability.
You operate a loan-application approval model that was deployed 8 months ago. It passed all pre-launch evaluations with 91% accuracy. No retraining has been done since launch. You've just noticed that customer complaint rates have risen 40% in the past two months, but your offline test-set metrics look fine.
In November 2019, software engineer David Heinemeier Hansson publicly reported that Apple Card's credit algorithm, operated by Goldman Sachs, had offered him 20× the credit limit offered to his wife, despite their shared assets and her higher credit score in some systems. When Goldman investigated, they stated the algorithm was working as intended. New York's Department of Financial Services opened an investigation. Goldman later paid a $89 million settlement in 2024 acknowledging fair lending violations. The algorithm's overall performance metrics were likely healthy — the problem was only visible when outcomes were disaggregated by gender.
Every production evaluation system requires a deliberate choice: which metrics will you track, at what granularity, and how will you respond when they move? The wrong choices produce two failure modes. Metric blindness — tracking metrics that don't surface real problems, as in the Apple Card case. Metric overload — tracking so many metrics that teams can't distinguish signal from noise and alerts become meaningless.
The discipline starts by distinguishing between model metrics (properties of the model's statistical outputs), system metrics (infrastructure health), and outcome metrics (real-world effects on users).
The Apple Card case, and dozens like it in credit, hiring, healthcare, and criminal justice, share a common structure: aggregate metrics mask subgroup disparities. In 2023, the National Institute of Standards and Technology published updated guidance under the AI Risk Management Framework explicitly requiring that evaluation be disaggregated across protected characteristics where feasible.
Disaggregated evaluation means computing key metrics separately for meaningful subgroups — by geography, age bracket, device type, or demographic — and then comparing across groups. A model with 90% aggregate accuracy but 78% accuracy for a minority subgroup is not a 90%-accurate model for that population.
A medical diagnostic model optimizing for recall (catching all positive cases) will generate false positives that require expensive follow-up. A model optimizing for precision (only flagging high-confidence positives) will miss true cases. Neither is universally correct — the right trade-off is determined by the cost structure of errors in the specific deployment context, and it may shift as the deployment context changes.
When true labels arrive days or weeks after a prediction, you need proxy metrics to evaluate model health in real time. Netflix's recommendation team has publicly described using engagement rate (whether a recommended title is actually watched beyond a threshold duration) as a proxy for recommendation quality, because user satisfaction surveys arrive too slowly. Uber Eats uses reorder rate as a proxy for delivery quality predictions.
The danger with proxies is that they can be optimized directly — a recommender trained to maximize watch duration may push sensational content that users regret watching. Proxy selection requires understanding why the proxy correlates with what you care about, and continuously checking that correlation.
A well-calibrated model is one whose predicted probabilities match actual frequencies. If a model says an event has a 70% probability, that event should actually occur 70% of the time across all predictions at that confidence level. Calibration error tends to drift even when accuracy stays stable — especially after distribution shift. A 2021 paper from Carnegie Mellon showed that models deployed on distribution-shifted data can maintain accuracy while their calibration degrades severely, causing downstream decision-makers to over-trust or under-trust outputs.
Teams at Google, Microsoft, and Airbnb have converged on a tiered alert structure:
Choose the fewest metrics that collectively answer: "Is the model still doing what it was designed to do, for all the populations it serves?" Calibration, disaggregated accuracy, and at least one outcome metric form a minimum viable evaluation set for most production systems.
You are designing the evaluation framework for a healthcare triage model that predicts which patients need immediate escalation. The model serves a diverse patient population across 12 hospitals. Ground-truth labels (actual patient outcomes) arrive with a 48-hour delay on average. Your manager is asking for a dashboard covering "the right metrics."
Microsoft's Bing team has published extensively on their controlled experiment platform, which runs thousands of simultaneous A/B tests across Bing's search ranking, ads, and UI. In a 2013 paper in KDD, Kohavi, Longbotham, and colleagues reported that most proposed changes — including changes that engineers were confident would improve quality — showed no improvement or degraded key metrics when actually tested on live users. The Bing framework became a reference model for the industry: no model change ships to production without first demonstrating improvement on live traffic. By 2019, their platform was running over 10,000 concurrent experiments.
A held-out test set is a snapshot of the world at collection time, filtered through the decisions made about what data to label and how. Production traffic is messier, more diverse, and arrives in patterns no offline dataset fully captures. The Bing team documented that improvements on their offline evaluation correlated with production improvements only about 75% of the time — meaning one in four "improvements" that looked good offline actually degraded live user experience.
Two evaluation patterns address this gap: shadow deployment and A/B testing. They serve different purposes and answer different questions.
In shadow deployment, a new model runs in parallel with the production model. Real production requests are duplicated and sent to both. The new model's outputs are logged but not served to users — the old model continues to make all live decisions. This allows teams to observe how the new model would have behaved on real traffic without any risk to users.
Shadow deployment is particularly valuable for catching operational failures — edge cases that crash the model, unexpected output distributions, or latency spikes — before they affect real users. Google's Site Reliability Engineering documentation describes shadow testing as the standard pre-production step for models handling sensitive user data.
Shadow deployment cannot evaluate user response to new model outputs, because users never see them. A recommendation model may produce statistically higher-quality suggestions in shadow mode, but you can't know if users will actually engage with them differently. For that, you need A/B testing.
In an A/B test (also called a controlled experiment or online experiment), incoming traffic is randomly split: a control group receives the current model's outputs, and a treatment group receives the new model's outputs. Both groups interact normally with the product. The difference in outcomes between groups — measured on engagement, error rates, conversions, or other metrics — estimates the causal effect of the model change.
The randomization is critical. Without it, selection bias makes causal inference impossible. Airbnb's ML platform team published in 2019 that their experimentation framework assigns users to experiment buckets using hash functions over user IDs, ensuring consistent bucket assignment across sessions and preventing users from seeing different model outputs depending on which server handles their request.
At high-traffic services, even trivial effects reach statistical significance because the sample sizes are enormous. Booking.com's data science team published in 2019 that they distinguish between statistical significance (unlikely to be due to chance) and practical significance (large enough to matter for the business or user). They set minimum detectable effect thresholds — a change must be large enough to be worth the operational cost of the switch — before any experiment is declared a winner.
For ranking systems specifically, Netflix and others have used interleaving — a technique where results from model A and model B are interleaved into a single ranked list shown to a single user. The system tracks which model's results the user interacts with. Because both models are evaluated by the same user on the same query, interleaving reduces variance dramatically compared to A/B testing and reaches statistical significance with far less traffic. Netflix's recommendation team reported in 2016 that interleaving experiments required roughly 100× less traffic than equivalent A/B tests to reach the same statistical power.
Shadow deployment → A/B test on a small traffic slice (1–5%) → Gradual rollout with monitoring → Full deployment. Each stage is a gate that requires the model to demonstrate improvement — not just on an offline benchmark, but on real production signals.
You have a new content ranking model ready for production. Your current model has been running for 14 months. Your team's offline benchmarks show the new model outperforms on precision@10 by 8%. However, the new model has never been tested on real user traffic. Your platform serves 2 million daily active users. You need to design a safe rollout sequence.
In September 2020, users began posting examples showing that Twitter's automated image-cropping algorithm consistently favored white faces over Black faces when framing photos. Twitter's engineering team ran their own analysis, confirmed the disparity, and published an internal evaluation in May 2021. Their response was not to retrain: they removed the automated cropping feature entirely, concluding that no technically feasible retrain would guarantee elimination of the bias at acceptable cost. The case illustrates that "close the loop" doesn't always mean retrain — sometimes the evaluation finding recommends a different intervention entirely.
When continuous evaluation detects degradation, three categories of response are possible: retrain on new data, roll back to a prior model version, or intervene differently (as Twitter did). Each is appropriate in different circumstances.
Retraining is appropriate when the degradation is caused by data drift — the input distribution has changed and new data representing the current distribution is available. Rollback is appropriate when a recent model update is the identified cause of the degradation. Alternative intervention is appropriate when the degradation reveals a fundamental problem with the task framing, training data, or deployment context that retraining alone cannot address.
Retraining on production data introduces new risks. The data that arrives in production may reflect the prior model's decisions — a recommendation model trained on its own recommendations can amplify existing biases. This is the feedback loop problem in data collection, distinct from the inference-time feedback loop problem.
Spotify's infrastructure team described in 2021 their approach to this: they maintain a causal data collection protocol that randomizes recommendations for a small fraction of traffic at all times, specifically to generate unbiased training data that isn't confounded by the prior model's choices. This data is ring-fenced for retraining use and validated against the broader production distribution before retraining begins.
A retrained model must itself be validated before deployment — which returns you to the evaluation pipeline from Lesson 3. This creates a cycle: evaluation → decision → retrain → evaluation → deploy. Governance structures must ensure that time pressure (to fix a degrading model quickly) doesn't cause teams to skip validation steps. The EU AI Act's Article 9 explicitly requires that high-risk AI systems have documented risk management processes covering post-market monitoring and corrective action.
Continuous evaluation is only actionable if rollback is fast. MLflow, Weights & Biases, and SageMaker all provide model registries that store versioned model artifacts with associated metadata: training data hash, evaluation metrics at training time, and deployment history. When a rollback decision is made, the registry enables restoration to any prior version within minutes rather than hours.
DoorDash's ML platform team documented in 2022 that they maintain a "champion-challenger" architecture at all times — the current production model (champion) runs alongside the previously validated model (challenger) which is ready to receive traffic within 60 seconds if the champion is rolled back. This ensures that rollback is always a viable option, not a theoretical one.
Fully automated retraining pipelines — where detected degradation triggers retraining and deployment without human approval — carry the risk of automated error amplification. If the evaluation system itself has a bug, or if the detected "degradation" is a measurement artifact, automated pipelines can retrain on corrupted signals and deploy a worse model faster than any human can intervene.
The industry standard at companies including Google, Microsoft, and Airbnb is to automate detection and preparation (surfacing degradation alerts, running retraining, running offline validation) but require human approval for promotion to production — at least for high-stakes models. The 2023 White House Executive Order on AI, and the EU AI Act's requirements for high-risk systems, both establish human oversight of consequential AI decisions as a baseline expectation.
Continuous evaluation without governance is just continuous observation. The loop only closes when detected problems are met with documented decisions, validated fixes, and confirmed resolution — all traceable to accountable human decision-makers. This is both operational best practice and, for high-risk AI systems, increasingly a legal requirement.
You run a credit-scoring model in production for a regional bank. Your continuous evaluation pipeline has just flagged: (1) overall AUC has dropped from 0.87 to 0.81 over 60 days, (2) the drop is concentrated in applicants with non-traditional income sources — a segment that grew 35% in volume over the same period, (3) three months ago you did a minor feature engineering change that is now a possible confound. You need to decide how to respond.