In the early 1880s, Thomas Edison's Pearl Street Station in lower Manhattan began supplying direct-current electricity to a handful of city blocks. The technical achievement was real, but what made it lasting was something less celebrated: Edison's team built metering equipment and recorded consumption data from the very first day of operation. They knew, in near-real time, when a circuit was drawing too much load, when a generator was drifting out of spec, when a customer's meter had stopped registering. The competitors who deployed power grids without that instrumentation lost equipment, lost customers, and in several cases lost the business entirely. The observable grid outlasted the blind one by decades.
Machine learning systems in 2024 are in a strikingly similar position. Models ship continuously β GPT-4 went into production in March 2023, Meta's Llama 2 followed in July, and enterprise teams at companies from Stripe to Mayo Clinic began embedding these systems into critical workflows almost immediately. What those teams discovered, quickly, was that a model that performed well on a benchmark could degrade silently in production: the input distribution shifted, a downstream API changed, real-world language diverged from training data. Without instrumentation, the failure was invisible until a customer noticed.
This course covers the discipline that has emerged to address exactly that problem: AI observability. You will learn what signals to collect, how to detect drift before it becomes damage, how to trace failures back to their root cause, and how to build the feedback loops that let a deployed system improve over time. The material draws on documented production incidents and real tooling. It will not make deployment simple, but it will make the consequences of deployment legible.
In November 2022, Zillow's AI-powered home-purchasing program, Zillow Offers, concluded a 700-million-dollar write-down and shuttered entirely. The model that priced homes had been accurate enough in testing. In production, however, the input distribution shifted faster than anyone had instrumented for: pandemic-era housing volatility made historical price patterns unreliable. The team was not flying blind in the traditional sense β they had dashboards, latency metrics, error rates. What they lacked was a way to see what the model was actually doing with the data it was receiving. By the time the pricing errors became visible in financial reports, the positions were already taken. Observability, for an AI system, means catching that kind of drift before the write-down.
This is not a story about a bad model. It is a story about insufficient visibility into a model's behavior in a live environment. Understanding why β and what sufficient visibility looks like β is where this module begins.
Classical software observability rests on three signal types that emerged from distributed systems engineering at companies like Google, Netflix, and Twitter between roughly 2010 and 2018. Each captures something the others miss.
Logs are timestamped records of discrete events β a request arrived, a function was called, an exception was thrown. They are the most granular signal and the most expensive to store at scale. Metrics are numerical measurements aggregated over time: requests per second, p99 latency, error rate. They are cheap to store and fast to query, but they compress detail. Traces follow a single request through every service it touches, recording timing and context at each hop. They are the primary tool for diagnosing latency in microservice architectures.
All three were designed for software that does exactly what its code says. Call function X with input Y, get output Z every time. AI inference breaks this contract: the same input can produce different outputs across model versions, the "function" is a billion-parameter statistical approximation, and the failure mode is not an exception thrown β it is a quietly wrong answer that looks perfectly normal from the outside.
A model endpoint can return HTTP 200, sub-100ms latency, and zero Python exceptions while simultaneously producing outputs that are factually wrong, biased, or completely out of distribution. Traditional monitoring would report that system as healthy. AI observability requires additional signal layers specifically designed to evaluate output quality and behavioral consistency.
AI observability extends the classical three pillars with signal types that are specific to statistical models in production. The field is young β the term was not widely used before 2021 β but a consensus set of additional dimensions has emerged from the work of teams at organizations including Uber, Lyft, Google Brain, and a wave of dedicated tooling companies.
Data and feature monitoring tracks the statistical properties of inputs flowing into the model at inference time and compares them against the distribution seen during training. If a model was trained on loan applications from applicants aged 25β60 and begins receiving applications predominantly from applicants aged 18β22, a statistical test β typically Population Stability Index or Jensen-Shannon divergence β will flag the shift before accuracy degrades visibly.
Prediction monitoring tracks the distribution of the model's outputs. If a classification model begins emitting one class far more frequently than during validation, that shift is often a leading indicator of a data or model problem even before ground-truth labels arrive.
Model performance monitoring tracks accuracy, precision, recall, or task-specific metrics against ground truth as labels become available. The challenge is label latency: ground truth for a credit decision may not arrive for 30β90 days. Much of the practice of AI observability is about what to monitor in the absence of immediate ground truth.
Infrastructure observability asks: "Is the system running?" AI observability asks: "Is the system behaving correctly?" These are separate questions with separate toolchains. A production AI system needs both.
In a mature ML platform β Google's internal systems, Meta's PyTorch-based production infrastructure, or the open-source MLflow and Weights & Biases ecosystems β observability is implemented in layers. The lowest layer is infrastructure telemetry: GPU utilization, memory, network I/O, pod health in Kubernetes. The second layer is serving telemetry: request rate, latency percentiles, error codes from the model server (TorchServe, Triton Inference Server, or a custom FastAPI wrapper are common). The third layer is model telemetry: logged inputs and outputs, feature statistics, prediction distributions. The fourth layer, increasingly common since 2022, is quality telemetry: automated evaluation of outputs using secondary models, human-in-the-loop review pipelines, and feedback signals from downstream systems or users.
Each layer feeds different stakeholders. Infrastructure telemetry goes to SRE teams. Serving telemetry goes to platform engineers. Model telemetry goes to ML engineers and data scientists. Quality telemetry goes to everyone, including product managers and compliance teams. Building a coherent observability stack means deciding what to collect at each layer, where to store it, and who needs to see it in what time frame.
Uber's Michelangelo ML platform, described publicly in a 2017 engineering blog post and updated in subsequent documentation, built prediction logging into the serving layer from the start. Every inference is logged with the input features, the output score, and a unique prediction ID that can be joined against outcomes later. This design β log everything at prediction time, join against outcomes asynchronously β became a template for production ML observability at scale and influenced the architecture of tools including Tecton, Feast, and Vertex AI Feature Store.
Through 2023 and into 2024, large language models moved from research artifacts to production dependencies at a pace that outran the observability tooling designed for classical ML. A regression model producing a continuous score is relatively straightforward to monitor statistically. A language model producing free-form text raises harder questions: how do you detect drift in a 512-token output? How do you define a distribution over natural language? How do you monitor for hallucination at scale without reading every response?
The emerging answers involve LLM-as-evaluator architectures β using a second language model to score the outputs of the first β combined with deterministic checks for factual anchoring, citation accuracy, and format compliance. Companies including Arize AI, Weights & Biases, LangSmith, and Honeycomb have built tooling specifically for this layer. The field is moving fast, and the vocabulary and practices covered in this module are the stable foundation beneath that fast-moving surface.
You'll work through a real-world deployment scenario with the AI tutor. Describe what observability signals are present, identify the gaps, and reason through what additional instrumentation would have changed the outcome. Focus on distinguishing infrastructure monitoring from AI-specific observability.
In early 2020, a fraud detection model deployed by a major European bank abruptly began misclassifying legitimate transactions at a rate three times higher than its validation benchmark. No one had changed the model. No code had been deployed. What had changed was the world: COVID-19 lockdowns had fundamentally altered consumer spending patterns within a matter of weeks. Online grocery purchases spiked. Gym memberships dropped to zero. The transaction patterns that had defined "normal" in the training data were no longer representative of the real distribution. The bank's monitoring system, which tracked accuracy against a rolling 90-day window of labeled outcomes, was 90 days behind the shift. By the time the alert fired, the false positive rate had been elevated for six weeks.
This scenario β documented in a 2021 case study by the European Banking Authority on operational resilience of AI systems β became a canonical illustration of why drift detection cannot depend solely on lagged accuracy metrics.
Drift detection begins with a reference distribution β typically the feature statistics from the training or validation dataset β and compares incoming production data against it. Several statistical tests have become standard tools for this comparison.
The Kolmogorov-Smirnov (KS) test compares two continuous distributions by measuring the maximum absolute difference between their empirical cumulative distribution functions. It requires no assumptions about the underlying distribution shape and is sensitive to changes in both location and spread. It is the most commonly used test for continuous features in production ML monitoring systems.
The Population Stability Index (PSI) originated in credit risk modeling in the 1990s and remains widely used in financial services AI. It measures the shift in a feature distribution by bucketing values and comparing bucket proportions between reference and production populations. A PSI below 0.1 is generally considered stable; between 0.1 and 0.2 warrants investigation; above 0.2 indicates significant shift. PSI's interpretability β and the fact that it maps directly to regulatory reporting conventions β explains its persistence despite more statistically principled alternatives.
The Jensen-Shannon (JS) divergence is a symmetric, bounded version of KL divergence that measures how much two probability distributions differ. Unlike PSI, it handles continuous distributions natively without bucketing. It is increasingly used in ML monitoring libraries including Evidently AI and WhyLabs, which both launched production-ready open-source tooling in 2021.
KS tests work well for continuous unimodal features. PSI is preferred in regulated industries due to its interpretability and established threshold conventions. JS divergence handles multimodal distributions and categorical variables more gracefully. In practice, most production monitoring systems run multiple tests and alert when any threshold is crossed, accepting the cost of some false alarms in exchange for earlier detection.
Monitoring each feature independently misses an important failure mode: the joint distribution of features can shift even when each individual feature looks stable. A fraud model trained on transaction data where high-value purchases and late-night activity are correlated may behave poorly in a world where that correlation has broken down β even if the marginal distributions of transaction value and time-of-day look unchanged.
Detecting multivariate drift requires different tools. The Maximum Mean Discrepancy (MMD) test operates in a kernel-induced feature space and can detect shifts in the joint distribution without specifying which features are involved. It is computationally expensive but powerful for high-dimensional data. Domain classifier drift detection trains a binary classifier to distinguish reference from production data β if the classifier achieves significantly above-chance accuracy, the distributions have diverged in some detectable way. Google's TFX platform and Arize AI both implement variants of this approach.
A more recent approach, gaining adoption since 2022, uses learned data representations: pass inputs through an encoder (often the early layers of the production model itself) and monitor drift in the embedding space rather than the raw feature space. This detects semantically meaningful shifts even in high-dimensional inputs like text and images where raw feature statistics are uninformative.
Arize AI's 2023 documentation describes production deployments where embedding drift detection caught model degradation 3β4 weeks before accuracy metrics declined, by monitoring the centroid and spread of embedding clusters for incoming production data. For LLMs processing free-form text, this is currently the most practical approach to input drift detection at scale.
Monitoring the distribution of model outputs is distinct from monitoring inputs, and in many ways more directly actionable. If a binary classifier begins predicting the positive class at 35% instead of its historical 18%, that shift is a strong signal that something has changed β whether in the inputs, the model, or the deployment environment. Output drift monitoring does not require access to ground truth labels and can be implemented immediately upon deployment.
For regression models, tracking the mean, variance, and percentile distribution of predicted values against a reference window is straightforward. For classification models, tracking class probability distributions and predicted class proportions provides equivalent signal. For ranking models, tracking the distribution of scores at the top-k positions detects the most operationally important shifts.
The limitation of output drift monitoring is that it is a lagging indicator relative to input drift: by the time prediction distributions shift meaningfully, the model has already been producing degraded outputs. It is most useful as a confirmation signal paired with upstream input monitoring, and as a canary metric for operational teams who do not have direct access to feature-level telemetry.
Detection without response is just surveillance. A production ML system needs a defined protocol for what happens when drift is detected, calibrated to the severity of the shift and the business impact of the model's decisions.
At the first tier β minor drift within warning thresholds β the appropriate response is typically increased logging fidelity and human review of a sample of recent predictions. At the second tier β drift exceeding action thresholds β common responses include model rollback to an earlier version, switching to a more conservative fallback model (often a simpler, more robust baseline), or temporarily routing traffic to human reviewers. At the third tier β severe or rapid drift β the model may need to be taken offline entirely until retraining on current data is complete.
Netflix's 2022 engineering blog described a tiered response system for their recommendation models that automated the first two tiers while requiring human sign-off for the third. The key insight from that system: automating the response to minor drift actually made teams more willing to set sensitive detection thresholds, because they did not face the prospect of an alert triggering a manual escalation at 2am for every minor fluctuation.
Work through drift detection method selection with the AI tutor. You'll be given a deployment scenario and asked to choose appropriate statistical tests, justify your choices, and design a response protocol. Focus on matching the method to the data type, ground truth latency, and business stakes.
In 2021, researchers at MIT published a study examining AI diagnostic tools deployed in hospitals across the United States and found that many systems lacked any structured logging of their inputs and outputs. When clinicians reported that a sepsis prediction tool seemed to be performing differently on night-shift patients, there was no data with which to investigate the claim. The tool had been deployed with infrastructure monitoring β uptime, response time β but without any record of what features went in or what scores came out. The investigation required reconstructing a partial picture from EHR audit logs that had not been designed for this purpose, and the finding β that the model's performance differed significantly between day and night shifts due to systematic differences in which nursing notes were entered by the time of prediction β took over a year to confirm. Logging, designed properly at deployment, would have made that finding available in weeks.
ML logging decisions involve four fundamental trade-offs: completeness versus storage cost, granularity versus privacy, immediacy versus throughput impact, and structure versus flexibility. Getting these right requires thinking forward β imagining the investigations you might need to conduct β rather than backward from what is easy to implement today.
At minimum, a production ML system should log: the timestamp of every inference, a unique prediction ID, the model version serving the request, and the model's output. This minimal record enables volume monitoring, version attribution, and output distribution tracking. It is not sufficient for diagnosing most production problems.
The next tier adds input feature values β or a hash or summary of them, if full feature logging is cost-prohibitive or privacy-constrained. This enables input drift detection, feature importance debugging, and the ability to replay historical inferences against a new model version. Uber's Michelangelo, as noted in Lesson 1, built this into the serving layer from day one; teams that did not find themselves unable to diagnose model behavior during the pandemic distribution shifts of 2020.
The highest tier adds contextual metadata: user cohort, geographic region, product surface, A/B experiment assignment, and any contextual signals that might stratify model behavior. This data is often more valuable than the features themselves for diagnosing population-level performance differences β the night-shift versus day-shift discrepancy in the MIT hospital study would have been detectable from a single "shift" metadata field.
Design your logging so that you can, at any future point, take a historical window of production traffic and re-run it through any model version. This requires logged inputs (not just hashes), model version tags, and immutable storage. Teams that have this capability can validate a new model against real production traffic before deploying it β effectively running a simulation of the deployment on real data.
High-volume serving systems β a recommendation model processing tens of millions of requests per day β cannot log every inference in full fidelity without incurring prohibitive storage costs. Sampling is necessary, but the sampling strategy matters enormously for what the logs can tell you.
Random sampling at a fixed rate (e.g., 1% of all requests) is the simplest approach and produces a statistically representative sample for distribution monitoring. Its weakness is that rare but important events β edge cases, high-confidence wrong predictions, adversarial inputs β are underrepresented proportionally to their rarity.
Stratified sampling ensures that important subpopulations β demographic groups, product segments, geographic regions β are sampled at rates sufficient to support disaggregated monitoring. This is increasingly important for AI fairness monitoring, where an overall sample might look healthy while a minority subgroup experiences severe degradation.
Anomaly-based sampling preferentially logs predictions that are unusual in some way: high model uncertainty, inputs far from the training distribution, outputs near decision boundaries. This disproportionately captures the cases that are most likely to be problematic and most informative for model improvement. Google's production ML systems described in their 2018 "Rules of Machine Learning" paper use variants of this approach.
Systematic high-value sampling logs 100% of predictions for high-stakes or high-value cases β large transactions, high-acuity medical cases, predictions that will be acted on immediately β while sampling more lightly from routine cases. This prioritizes logging fidelity where the cost of a wrong prediction is highest.
Full-fidelity feature logging is often impossible or inadvisable. Medical features are protected under HIPAA; financial features are regulated under GDPR's right to explanation provisions in the EU; biometric features may be restricted by state laws like Illinois's BIPA. These constraints are not obstacles to observability β they are design parameters.
Common privacy-preserving logging approaches include: logging feature statistics (mean, variance, percentiles) rather than individual values; logging hashed or tokenized versions of sensitive identifiers; logging in an encrypted form that requires key access to read; and separating the feature log from the prediction log with different access controls so that the union of the two tables is only joinable by authorized analysts.
The GDPR's Article 22 provisions on automated decision-making have an important interaction with ML logging: if your system makes decisions with significant legal or similarly significant effects on individuals using solely automated processing, you may be required to log the basis of those decisions in a form that supports human review. This is a regulatory requirement for observability, not a technical choice.
The most common practical failure in ML logging is schema drift: the feature schema at inference time diverges from the schema at training time, and no one catches it because the serving system accepts the request and the log records whatever came in. Enforcing a schema contract at the serving layer β rejecting or flagging requests with unexpected feature structures β is a critical observability primitive that is frequently omitted in early-stage ML systems.
Unstructured logs are nearly useless for ML diagnostics at scale. A production ML logging schema should be structured (JSON or Parquet, not free text), versioned (include a schema version field), and designed with join keys in mind. The prediction ID must be stable and globally unique. If you use a feature store, log the feature retrieval query alongside the feature values so you can diagnose feature serving errors independently of model errors. If you serve multiple model versions simultaneously via A/B or canary routing, the model version and routing decision must be logged with every prediction.
MLflow, introduced by Databricks in 2018, was among the first open-source tools to enforce structured experiment and run logging for ML workflows. Its influence on production logging practices β particularly the emphasis on logging hyperparameters, metrics, and artifacts as first-class objects β is visible in Vertex AI Experiments, Azure ML, and SageMaker Experiments, all of which adopted similar schemas.
Work with the AI tutor to design a logging schema for a specific AI deployment. You'll need to decide what fields to include, what sampling strategy to apply, and how to handle privacy constraints. The tutor will challenge your design and help you think through failure modes.
In March 2023, GitHub Copilot's suggestion quality degraded noticeably for a subset of users programming in Python. GitHub's internal telemetry detected an uptick in user rejection rates β the fraction of Copilot suggestions that users dismissed without accepting β within hours. The signal chain worked as intended: the metric fired, an on-call engineer reviewed the degradation, identified that a recent infrastructure change had subtly altered the tokenization path for a specific Python version, and the rollback was complete within a working day. The incident was documented publicly in GitHub's transparency report and cited as an example of feedback loop design working under real conditions. The critical element was not just that the monitoring fired β it was that the alert was connected to a runbook, routed to someone with authority to act, and that the action was clear.
Most production ML incidents are not resolved this cleanly. The difference is usually not in the detection quality β it is in the response infrastructure that the detection feeds into.
Alert fatigue is a documented failure mode in both traditional SRE and ML operations. When monitoring systems generate too many alerts β or alerts with insufficient context β on-call engineers begin ignoring them. A 2023 survey by PagerDuty found that 62% of on-call engineers reported receiving alerts they considered uninformative at least weekly; 31% reported actively suppressing alert channels during periods of high volume. In ML systems, where many alerts are statistical in nature and do not correspond to discrete errors, this problem is more acute.
Effective ML alerts share several properties. They are actionable: each alert has a defined set of actions that can be taken in response, documented in a runbook. They are routed: they reach the person or team with the authority and context to act, not just the person who happens to be on call for infrastructure. They are contextualized: they include not just the raw metric value but the baseline, the trend, the affected model version, and the affected population segment. They are tiered: severity levels distinguish between "investigate during business hours" and "wake someone up now."
The Prometheus and Alertmanager stack, widely used for infrastructure monitoring, supports all of these properties and has been extended for ML monitoring by tools like Evidently AI (which generates drift reports that can be exported as Prometheus metrics) and Grafana ML, which added ML-specific alert condition types in 2022.
A runbook is a documented procedure for responding to a specific alert: what the metric means, what values are concerning, what to investigate first, what actions are available, and who to escalate to if those actions are insufficient. ML runbooks differ from infrastructure runbooks in that they must address statistical ambiguity β an alert does not mean the system is broken, it means something has changed that warrants human judgment. Documenting that nuance is essential for teams where not everyone is a data scientist.
For high-stakes AI systems β medical devices, credit decisions, legal document analysis β monitoring is not sufficient without human review capacity. The question is how to design review pipelines that scale with prediction volume and provide actionable signal back into the model development loop.
The standard architecture involves a queue-based review system: predictions flagged by monitoring rules (high uncertainty, anomalous inputs, specific content categories) are routed to a review queue where human reviewers label the output as correct, incorrect, or ambiguous. These labels serve two purposes simultaneously: they generate immediate quality signal for the monitoring dashboard, and they accumulate as training data for the next model version.
Scale Control, acquired by Scale AI in 2021, and Labelbox both offer production review pipeline tooling that integrates this dual-purpose design. Google's data labeling service, launched as part of Vertex AI in 2021, builds the training data feedback loop directly into the annotation interface: reviewed production predictions can be added to the training dataset in one step.
The most sophisticated observability infrastructure is wasted if its outputs do not feed back into model improvement. Closing the feedback loop means building a pathway from production observations β drift detections, error analyses, human review labels β to model retraining and redeployment.
The simplest feedback loop is manual: data scientists review monitoring dashboards on a cadence, decide when retraining is warranted, trigger a retraining job, evaluate the new model offline, and deploy if it passes. This loop typically operates on weeks-to-months timescales and is appropriate for models where the deployment cost is high and the production distribution changes slowly.
Continuous training pipelines automate portions of this loop. Google's TFX (TensorFlow Extended) framework, described in a 2017 KDD paper, pioneered the automated retraining pipeline in production: new data meeting quality criteria is automatically added to the training set, a retraining job is triggered on a schedule or by a drift detection event, the new model is evaluated against held-out production data, and if it passes a comparison threshold against the deployed model, it enters a staged rollout. Meta described a similar architecture for their ad ranking models in a 2021 engineering blog post, noting that continuous training reduced model staleness from weeks to hours for their highest-velocity use cases.
Continuous retraining on production data can silently encode feedback loops that amplify model errors over time. If a model makes biased decisions that affect which users generate training data β as documented in a 2019 analysis of YouTube recommendation systems β retraining on that data reinforces the bias. Observability for continuous training systems must include monitoring of the training data distribution, not just the serving distribution.
Two deployment patterns work in direct support of observability by structuring how new model versions are introduced to production traffic.
In a canary deployment, a new model version receives a small percentage of live traffic β typically 1β5% β while the current version serves the rest. Monitoring compares the canary's behavior against the baseline in real time. If the canary's metrics degrade or diverge, traffic is routed back to the baseline before the impact is visible to most users. Canary deployments require that your monitoring infrastructure can disaggregate metrics by model version β which is why model version logging, discussed in Lesson 3, is a prerequisite for this pattern.
In shadow mode (also called dark launch or mirror testing), the new model receives a copy of live traffic and produces predictions, but those predictions are not served to users. Instead, they are logged and compared against the production model's predictions offline. Shadow mode enables testing a new model against real production traffic without any user-facing risk. It is particularly valuable for LLM applications where offline evaluation benchmarks are known to be poor predictors of production behavior β a finding documented by researchers at Stanford and DeepMind in separate 2023 papers on the gap between benchmark and production performance of language models.
Technical infrastructure for observability is necessary but not sufficient. The organizational practices around it β who reviews dashboards, how incidents are documented, whether post-mortems generate systemic improvements β determine whether monitoring creates learning or just records failure.
The blameless post-mortem practice, imported into ML teams from SRE culture, is particularly valuable. When a production ML incident occurs, the post-mortem focuses on what system properties made the incident possible and what changes would prevent recurrence β not on who made a mistake. The output is a set of action items that improve the system: new monitoring signals, tightened thresholds, improved runbooks, additional human review checkpoints. Google's Site Reliability Engineering book (Beyer et al., 2016) codified this practice, and it has been adopted by ML platform teams at Airbnb, Spotify, and Twitter (now X) as they built out production ML operations functions.
Work with the AI tutor to design an alerting and escalation framework for a specific AI deployment. You'll define alert conditions, severity tiers, runbook content, and feedback loop design. The tutor will test your design against realistic incident scenarios.