Applied AI Development

1. You should set a high-refusal-rate alert because a sudden spike likely indicates:

Correct. Refusal rate is a behavioral signal. Sudden spikes indicate either adversarial users probing guardrails (prompt injection) or a system prompt change that altered how the model interprets what it should refuse.

Refusal rate is a model behavior metric, not an infrastructure metric. It points to prompt-level causes, not GPU or network issues.

2. What fundamental problem does RAG solve that fine-tuning cannot address efficiently?

Correct — RAG separates knowledge from weights, enabling instant knowledge updates by changing a database rather than retraining a model.

RAG's primary advantage is knowledge update speed. Fine-tuning is slow and expensive to refresh; RAG updates instantly when source documents change.

3. The top_p parameter controls:

Correct. Nucleus sampling (top-p) restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. At top_p=0.9, the model only samples from tokens that collectively account for 90% of probability mass, excluding unlikely tokens that could produce nonsense.

Top-p controls nucleus sampling. If top_p=0.9, the model constructs a set of tokens whose probabilities sum to 0.9, then samples from only that set — excluding the long tail of improbable tokens.

4. Few-shot prompting includes examples primarily to:

Correct. Examples work by narrowing the space of plausible outputs. They show rather than tell, constraining the model's probability distribution more precisely than verbal descriptions alone can achieve.

Few-shot examples constrain outputs by demonstration — showing the exact pattern rather than describing it. They don't teach new facts or trigger fine-tuning; they shift the probability distribution toward outputs that match the demonstrated pattern.

5. If a RAG system has low Context Recall (below 0.75), the most appropriate fix is:

Correct. Low Context Recall means the retrieval system is missing chunks that contain the answer. The fix is in retrieval coverage: more candidates (higher K), better chunk boundaries (key info isn't split), or a reranker that catches what ANN search missed.

Context Recall is about coverage — are you finding all the relevant chunks? Adding metadata filters would reduce recall further. The fix is improving coverage: more candidates or better chunk design.

6. In LLM-as-judge evaluation, "self-enhancement bias" refers to:

✓ Correct. Self-enhancement bias is a documented systematic preference: GPT-4 gives inflated scores to GPT-4 outputs; Claude inflates Claude outputs. This happens even in blind evaluation because models share stylistic tendencies and the judge model "resonates" with familiar patterns. Use cross-family judges when possible.

✗ Self-enhancement bias means LLMs prefer outputs from their own model family. GPT-4 judging GPT-4 vs Claude — GPT-4's judgments systematically favor GPT-4 style. This creates circular evaluation that inflates scores for the "home" model. Cross-family judging substantially reduces (not eliminates) this bias.

7. Which file should be listed in .gitignore to prevent API key exposure?

Correct. The .env file contains secrets and must never be committed. It belongs in .gitignore from the start of every project.

The .env file stores API keys and must be excluded via .gitignore. config.py should only load from environment variables, not store keys directly.

8. In a progressive canary rollout (1%→5%→25%→50%→100%), what should trigger an automatic rollback at any stage?

Correct. Rollback gates must be automated and threshold-based — not triggered by individual complaints (too noisy) or any difference at all (too sensitive). Define specific thresholds per metric before deploying.

Rollbacks should be automated on pre-defined metric thresholds, not individual complaints (too noisy) or GPU metrics (wrong layer) or any distribution difference (too sensitive).

9. A properly formed AI system requirement must be:

Correct. Requirements must be testable and architecture-agnostic — they define what success looks like, not how to achieve it.

Requirements must be architecture-agnostic and testable before any implementation begins.

10. A feature store's primary architectural benefit over separate training and serving pipelines is:

Correct. Feature stores solve training-serving skew by ensuring both training jobs and online inference consume identical feature computations.

The core benefit is consistency — one feature computation used by both training and serving, eliminating divergence.

11. Training-serving skew most commonly occurs when:

Correct. Training-serving skew arises from divergent implementations of the same logical feature computation — the canonical fix is a shared feature store.

Training-serving skew is a code-level problem — the same feature is computed differently in training vs. serving code paths.

12. In the three-layer monitoring model, which layer would alert on a sudden increase in prediction output entropy?

Correct. Output entropy is a model-level signal — it reflects the distribution of predictions, which is monitored at the model/concept layer.

Output distribution metrics belong to the model/concept monitoring layer — the layer that watches what the model produces, not what it receives.

13. An AI API is stateless. What is the direct consequence for conversation management in your application?

Correct. Stateless means zero server-side memory. Your application owns the conversation history. You store it, you manage it, you re-send it. This gives you full control — and full responsibility for managing its growth.

No server-side session memory exists. Your code must maintain and re-send the full conversation history. This is both the source of conversational capability and the source of unbounded cost growth if history isn't managed.

14. The most important reason to run regression tests (e.g., MMLU) after fine-tuning is:

Correct. Regression tests catch catastrophic forgetting — where fine-tuning on a narrow task degrades the model's general capabilities. A model that excels at your task but can no longer reason coherently about anything else is a deployment risk.

Regression testing's primary purpose is detecting catastrophic forgetting. Fine-tuning on narrow data can overwrite general capabilities. Comparing your fine-tuned model against its own base model on general benchmarks is the check — not comparison against commercial models.

15. The Stanford research on GPT-4's performance change between March and June 2023 is a case study in:

Correct. The endpoint name "gpt-4" was stable, but the underlying model changed — and coding task accuracy dropped from 95.2% to 86.8%. This is model version drift, and the mitigation is pinning to dated model versions.

The Stanford study is specifically about undocumented model checkpoint changes causing behavioral shifts — the definition of model version drift.

16. What does LangFuse offer that Prometheus + Grafana does NOT?

Correct. Prometheus tracks numeric metrics (latency, throughput, error rates). LangFuse captures the content of each pipeline step — what was the prompt, what was retrieved, what did the model say — enabling root cause analysis of bad outputs.

Prometheus/Grafana are for infrastructure metrics. LangFuse is for content-level tracing of what happened inside each LLM pipeline execution.

17. When constructing a test set for a user-facing AI product, what split strategy is most appropriate for user-level data?

✓ Correct. User-stable splits prevent data leakage through user-level patterns. If User A's examples appear in both train and test, the model may learn User A's specific writing style, vocabulary, and preferences — inflating test performance in ways that don't generalize to new users.

✗ Random splits of user data allow the same user's examples in both train and test. The model learns user-specific patterns that inflate test metrics without generalizing. Always split at the user level for user-generated content.

18. Dynabench's key innovation over traditional static benchmarks is:

✓ Correct. Dynabench (Kiela et al., Meta AI, 2021) uses human annotators who see the current best model's predictions and specifically write examples that fool it. These adversarial examples become the new test set — ensuring the benchmark perpetually stays ahead of model capabilities instead of being saturated.

✗ Dynabench is human-in-the-loop adversarial benchmarking. Annotators write examples designed to fool the current SOTA model. This creates a continuously hard benchmark that can't be saturated by contamination or overfitting — it adapts faster than models can memorize it.

19. Model collapse in synthetic data generation primarily refers to:

Correct. When models train repeatedly on their own outputs, rare but important output patterns disappear and the model converges toward a narrower, less diverse output distribution.

Model collapse describes the progressive narrowing of output diversity when training on model-generated data across multiple generations. Important minority patterns disappear as the model reinforces its most common outputs.

20. DPO (Direct Preference Optimization) requires which type of training data?

Correct. DPO trains on (prompt, chosen_response, rejected_response) triplets, directly maximizing the probability gap between preferred and rejected outputs without a separate reward model.

DPO requires preference pairs — for each prompt, you need both a preferred and a rejected completion. This allows DPO to optimize the model's behavior directly without training a separate reward model.

Final Exam