Deploying and Monitoring AI

1. What is the "replay principle" in ML logging design?

Correct.

Review Lesson 3, the Replay Principle gold callout.

2. What is "model cascading" in cost optimization?

Correct. Cascading means trying the cheap model first; if its response passes a quality gate, return it. If not, escalate to a more powerful model. This achieves near-frontier quality at near-budget-model cost for most requests.

Cascading: cheap model first → quality check → escalate to powerful model only on failure. Tools like RouteLLM implement this with configurable quality thresholds.

3. Which type of drift describes a change in the real-world relationship between inputs and correct outputs, making the model's learned mapping stale?

Correct. Concept drift is when the correct answer changes — the world has moved but the model hasn't. Common in fraud detection and content moderation as adversaries and content evolve.

This is concept drift: the underlying relationship the model learned has changed. Data drift is a change in input distributions; prediction drift is observable output distribution change. Concept drift is the deeper change in what "correct" means.

4. Netflix's interleaving technique achieves statistical significance with ~100× less traffic than A/B testing because:

Correct. Within-user comparison eliminates the between-user variance that dominates A/B tests, allowing the same statistical power with far less traffic.

The variance reduction comes from within-user comparison: the same user on the same query evaluates both models, so user-level variation cancels out. This requires ~100× less traffic than between-user A/B testing.

5. A mental health chatbot providing self-harm instructions to a vulnerable user would typically be classified as which severity tier?

Correct. Direct physical harm to a user is the archetypal SEV-1 event: immediate escalation, executive notification, and all-hands containment response.

This is SEV-1. A system actively providing instructions that could lead to user death or serious injury requires the most urgent possible response, not a next-business-day triage.

6. A span in distributed tracing is best defined as:

Correct. A span is a named, timed unit of work — the atomic building block of a trace.

A span is the atomic timed unit. A trace is the full tree of spans. Logs and metrics are separate signals.

7. The AIID analysis found a median AI incident detection lag of 11 days. What is the primary reason AI failures are harder to detect than traditional software outages?

Correct. Probabilistic degradation — the model getting worse gradually rather than stopping — is the fundamental reason AI incidents evade detection much longer than binary software failures.

The root issue is probabilistic degradation: a model doesn't stop working, it produces worse outputs over time. Without pre-defined statistical thresholds, that gradual decline is invisible until user harm is widespread.

8. What cache hit rate does GPTCache report in customer-support applications using semantic caching?

Correct. GPTCache reports 30–40% cache hit rates in customer-support applications where users ask semantically similar questions repeatedly. Each hit eliminates one API call entirely.

GPTCache reports 30–40% semantic cache hit rates in customer-support use cases — meaning 30–40% of queries can be answered without any API call, purely from cached responses to similar past queries.

9. Alert fatigue is best mitigated by:

Correct. Actionable alerts with clear owners and playbooks are the core solution. Alert quality — not quantity — determines whether engineers respond meaningfully to each page.

The solution is alert quality: every alert should be actionable, have an owner, and have a response playbook. Reducing metrics risks missing real failures; raising thresholds risks late detection; shared inboxes have no accountability.

10. The Apple Card credit algorithm case demonstrates primarily that:

Correct. The algorithm likely had healthy aggregate metrics — the 20× credit limit disparity was only visible through gender-disaggregated analysis of outcomes.

The Apple Card case shows that aggregate metrics gave false confidence while a severe gender disparity existed — only visible through disaggregated evaluation of outcomes by gender.

11. Why is a fraud detection model's output distribution useful as a monitoring signal when ground truth labels are delayed?

Correct. Output distribution monitoring is a leading indicator of model degradation — available immediately, without waiting for labels.

Output distribution shifts are early warning signals — they often appear before labeled accuracy metrics can detect degradation, making them invaluable when labels are delayed.

12. In a layered budget architecture, what does Layer 1 (application layer) do that vendor-side controls cannot?

Correct. Only the application layer can prevent cost before it is incurred — by rejecting a request without ever calling the API. Vendor controls only activate after the API call has been initiated.

The application layer is unique in acting before the API call. If a per-user token budget is exhausted, the request never reaches the vendor — so zero cost is incurred. Vendor controls can only block or throttle after they receive the call.

13. What is the first optimization that should be applied in the cost optimization hierarchy?

Correct. max_tokens is a one-line change that is immediate, zero-risk, and requires no architectural changes. It is the highest-ROI-per-effort optimization available and should always be applied first.

Start with max_tokens — it's a one-line change with immediate impact and zero risk. The hierarchy continues: model routing → semantic caching → prompt compression → fine-tuning. Fine-tuning has the highest upfront cost and takes longest to see ROI.

14. The W3C TraceContext specification defines which HTTP header for trace propagation?

Correct. The traceparent header carries the version, trace ID, parent span ID, and flags in a standardized format.

The W3C TraceContext spec defines the traceparent header as the standard carrier for distributed trace context.

15. What is the "fifth signal" that Google's SRE golden signals framework must be extended with for AI systems?

Correct. The four SRE golden signals are latency, traffic, errors, saturation. For AI, output quality — captured through semantic logging — is the essential fifth signal.

The four SRE golden signals (latency, traffic, errors, saturation) don't capture whether the model's outputs are actually correct or drifting.

16. Google SRE's recommended test for whether an alert should exist in production is:

Correct. Google SRE defines alert toil as alerts requiring human effort that produce no lasting improvement. The elimination criterion: every alert must be either immediately actionable or lead to permanent system improvement. Otherwise it is toil.

Not correct. The SRE standard is about actionability and lasting value, not automation speed or validation period.

17. Which open-source tool provides a unified proxy for 100+ LLM APIs with built-in cost tracking?

Correct. LiteLLM is an open-source proxy that provides a unified interface to 100+ LLM APIs with built-in cost tracking, model routing, and logging. Helicone is a logging proxy; LangSmith is LangChain's observability platform; GPTCache is a semantic caching library.

LiteLLM is the unified proxy for 100+ APIs. Helicone is specifically a logging/cost-tracking proxy. LangSmith is LangChain's observability tool. GPTCache is for semantic caching.

18. Which containment action is the LEAST disruptive to users while still addressing a specific harmful output category?

Correct. Output filtering is the least disruptive option: the service remains operational for all users, and only the specific harmful output type is blocked. It is the first tool in the containment toolkit for most AI incidents.

The spectrum runs from least to most disruptive: output filtering → traffic throttling → feature flagging → fallback routing → full suspension. Output filtering preserves the most service functionality while directly addressing the harm.

19. A model retrain should be treated as a deployment event. What does this specifically mean?

Correct. A retrained model is a new model. It carries the same risks as the original deployment and must go through equivalent validation processes — shadow, canary, A/B — before becoming champion.

Treating retrain as a deployment event means applying the same rigor: shadow testing, canary rollout, A/B validation, rollback readiness. Never auto-promote a retrain to production based solely on offline metrics.

20. "Silent degradation" in an AI system is particularly dangerous because:

Correct. Silent degradation means the service looks healthy to infrastructure monitoring. Error rates are normal. Latency is normal. But prediction quality has degraded — users are receiving wrong answers. This is why AI monitoring requires business-metric and model accuracy monitoring, not just infrastructure health checks.

Not correct. Silent degradation means the service continues operating normally from an infrastructure perspective while model quality degrades — invisible to standard monitoring tools.

Final Exam