Module 8 · Lesson 1

Choosing Your Problem

The single most important decision in any AI project is not which model to use — it is whether the problem is worth solving with AI at all.

How do you identify problems where AI genuinely adds value versus problems that a simpler tool would handle better?

In April 2023, Duolingo announced it had integrated GPT-4 to power two features: Roleplay, which lets learners hold open-ended conversations with AI characters, and Explain My Answer, which gives personalized grammar feedback. Both features addressed a precise gap — the app had millions of users who could practice vocabulary but had no one to converse with freely at 2 a.m. The AI did not replace the curriculum; it filled a slot the curriculum genuinely could not fill with rule-based logic. Within two months, Duolingo Max subscribers completed 20% more daily lessons than the control group. The lesson from this case is structural: Duolingo defined the gap first, then found the tool.

The AI-Fit Test

Not every task benefits from AI. The clearest wins cluster around a set of problem characteristics. Before committing to an AI project, run your idea through four questions that practitioners call the AI-Fit Test.

1. Is the output variable? If the correct answer is always identical regardless of input, a lookup table or simple formula is faster and cheaper. AI earns its cost when outputs must adapt — to tone, to context, to incomplete information.

2. Is the pattern learnable from examples? AI systems excel at generalizing from data. If there are no examples — because the task is entirely novel or because the data has never been captured — AI struggles too.

3. Is the cost of error acceptable? A customer-service draft that occasionally sounds slightly off is fixable. An AI that miscalculates a medication dose is catastrophic. Problems where errors are high-stakes require more robust guardrails than a first project can typically provide.

4. Does the scale justify the effort? If you are doing this task twice a week, a prompt template might suffice. If you are doing it ten thousand times a day, a properly integrated pipeline pays for itself quickly.

Principle

The best AI projects solve a real bottleneck. They do not automate a task that nobody cared about and they do not replace human judgment in domains where that judgment is the entire value. They fill a specific, documented gap.

Three Problem Categories Worth Starting With

For a first AI project, three categories offer favorable risk-to-reward ratios because they have clear success metrics, tolerate moderate error rates, and produce outputs humans can quickly review.

Content Generation

Drafting product descriptions, summarizing reports, generating FAQ answers from documentation. Errors are visible and easy to catch in review.

Classification & Routing

Sorting support tickets by category, tagging customer sentiment, routing emails to the right team. Measurable accuracy and low catastrophic risk.

Information Extraction

Pulling structured data from unstructured text — names, dates, entities from contracts or emails. Downstream validation is straightforward.

Conversational Interface

Answering questions from a defined knowledge base, booking workflows, guided troubleshooting. Scope is bounded; off-topic answers can be redirected.

Defining Success Before You Build

Every project needs a measurable definition of done. Vague goals — "make our support better with AI" — produce vague projects. Strong goal statements specify the metric, the baseline, and the target. For example: Reduce average email drafting time from 8 minutes to under 3 minutes for customer-success agents handling tier-1 tickets, measured over 30 days post-launch.

In 2022, Klarna set a similarly specific objective when deploying its AI assistant: handle 67% of customer-service chats without human escalation within 30 days. The company published that it hit the target within the first month, attributing the success partly to how precisely the objective was scoped — the bot handled only specific transaction-related queries, not general complaints.

Your success definition should include: the task being automated or augmented, the metric you will track, the baseline (current performance), the target, and who owns measuring it.

Key Takeaway

Good problem selection is a filter, not a formula. Run the AI-Fit Test, pick a category with visible outputs, and write a measurable success definition before touching a single line of code or crafting a single prompt. The project you do not build because it fails the fit test is as valuable as the one you build correctly.

Lesson 1 Quiz

Choosing Your Problem — 5 questions

1. In the Duolingo GPT-4 integration (2023), what was the primary strategic reason the company chose AI for those two features?

Correct. Duolingo identified a precise gap — real conversation practice at any hour — that deterministic logic genuinely could not fill, then selected AI to fill that gap specifically.

Incorrect. The Duolingo case was about filling a capability gap (free conversation), not cutting headcount or replacing the curriculum.

2. According to the AI-Fit Test, which characteristic most clearly signals that AI is a poor fit for a task?

Correct. When outputs are always identical, a lookup table or formula is faster and cheaper. AI earns its cost when outputs must adapt to variable context.

Incorrect. A fixed, context-independent correct answer is the clearest sign that simpler tooling suffices. AI's value comes from handling variability.

3. Klarna's 2022 AI assistant deployment succeeded partly because the objective was precisely scoped. What was that scope?

Correct. Klarna set a precise metric (67% of chats handled autonomously), a clear time horizon (30 days), and a scoped domain (transaction-related queries only).

Incorrect. Klarna scoped the bot to specific transaction queries and set a 67% autonomous-handling target within 30 days — a precise, bounded objective.

4. Which of the following is the strongest example of a measurable AI project success definition?

Correct. This definition specifies the task, the metric (drafting time), the baseline (8 min), the target (under 3 min), and the measurement window (30 days) — all required components.

Incorrect. The other options lack baselines, metrics, or measurable targets. A strong definition includes task, metric, baseline, target, and timeframe.

5. For a first AI project, why are content-generation and classification tasks considered lower-risk starting points than, say, medical diagnosis support?

Correct. Low catastrophic risk and easy human review are the key factors. A slightly off product description is correctable; an AI diagnostic error can harm a patient.

Incorrect. The risk distinction is about the cost of error and reviewability — not about accuracy rates or data requirements.

Lab 1 — Problem Scoping Assistant

Practice the AI-Fit Test and define a measurable project goal with an AI coach.

Your Task

Describe a real or hypothetical task you would like to automate or augment with AI. The assistant will guide you through the four AI-Fit Test questions and help you write a measurable success definition. Aim for at least three exchanges to complete the lab.

Starter prompt: "I want to use AI to help my team [describe your task]. Can you run me through the AI-Fit Test?"

Problem Scoping Coach

Lab 1

Welcome to Lab 1. Describe the task or problem you are considering for your first AI project — even a rough idea works. I will guide you through four diagnostic questions to decide whether AI is genuinely the right fit, then help you draft a measurable success definition. What task are you thinking about?

Module 8 · Lesson 2

Designing Your Workflow

A project is not a prompt. It is a sequence of decisions about data, interfaces, human checkpoints, and failure modes — all of which must be designed before the first API call.

What does the architecture of a simple, functional AI workflow actually look like — and where do first-time builders most commonly break it?

In February 2023, Spotify launched its AI DJ feature in the United States and Canada. The feature — built on OpenAI technology — generated personalized spoken commentary between tracks, mimicking the patter of a radio host who knows your listening history. What made the engineering distinctive was not the language model itself but the pipeline surrounding it. Spotify's system pulled a user's listening data, passed it through a recommendation model to choose tracks, then sent structured context (genre, mood, listening history, time of day) to the language model to generate commentary, then routed the output through a text-to-speech voice model trained on a real DJ's voice. Each stage had a defined input format, a defined output format, and a fallback behavior if any stage failed. Users experienced a seamless product; underneath was a carefully sequenced workflow with no single magical black box.

The Five Stages of a Minimal AI Workflow

Every working AI project — from a simple email-drafting tool to Spotify's DJ — moves through the same five stages. Understanding each stage prevents the most common architectural mistakes.

Input collection. Where does the raw material come from? A form, a database query, an uploaded file, a webhook from another system? Define the exact structure and validate it before passing it downstream.
Context assembly. The AI model needs context to do its job. This stage collects the relevant background — user history, retrieved documents, system instructions — and packages it into a prompt or API payload. This is where prompt engineering lives.
Model call. The actual API call to the language model, image model, or other AI component. Specify the model, temperature, max tokens, and any stop sequences. Include retry logic for transient failures.
Output processing. Raw model output is rarely ready to use. Parse it, validate it against expected formats, run safety filters, and handle cases where the output is malformed or off-topic.
Delivery and logging. Send the processed output to its destination (user interface, database, email, downstream system). Log the input, the output, and any metadata needed for evaluation and debugging.

Common Failure Point

Most first-project failures happen at Stage 4 — output processing. Builders test the happy path (model returns clean JSON) but do not handle the case where the model returns partial JSON, appends an explanation, or refuses to answer. Always write explicit output-validation logic before deploying.

Human-in-the-Loop Checkpoints

Not every AI workflow should operate fully autonomously. The question is not whether to include humans, but where in the workflow human review adds the most value relative to its cost.

In 2023, the Washington Post's internal AI tool for generating first-draft summaries of statistical sports stories went through an explicit human-review gate before any summary was published. Editors could accept, edit, or reject the draft. The workflow was not "AI does it automatically" — it was "AI drafts, human publishes." That checkpoint halved the drafting time without removing editorial accountability. The human review step was not an afterthought; it was designed into the workflow from the start.

For your first project, draw a simple diagram with boxes and arrows. Label each box as automated or human-reviewed. If any automated box produces output that could be wrong in a costly way, move a human-review gate upstream of the consequence.

Failure Mode Planning

Every stage of the workflow can fail. The AI model may time out. The retrieved document may be stale. The output may fail validation. Good workflow design specifies, for each failure mode: what the system will do (retry, fallback, alert), who will be notified, and what the user will see.

A simple framework: for each stage, write one sentence answering "If this stage fails, the system will ___." Systems that cannot answer that question for every stage are not ready to deploy.

Key Takeaway

Design the workflow as a sequence of stages with defined inputs, outputs, and failure behaviors — not as a single prompt. Place human checkpoints where errors would be costly. Before building, you should be able to draw the full pipeline on a whiteboard in under five minutes.

Lesson 2 Quiz

Designing Your Workflow — 5 questions

1. In Spotify's 2023 AI DJ feature, what was architecturally significant beyond the language model itself?

Correct. The key insight is that Spotify's pipeline chained multiple specialized models with structured handoffs — a workflow, not a single AI call.

Incorrect. The architectural significance was the multi-stage pipeline — recommendation model → language model → TTS — with defined data formats at each stage.

2. Which of the five workflow stages is identified as the most common failure point for first-time AI project builders?

Correct. Builders often test the happy path but fail to handle malformed, partial, or refused outputs. Output validation logic is essential before deployment.

Incorrect. Output processing is the most common failure point — specifically the failure to handle cases where the model does not return a clean, expected result.

3. How did the Washington Post design human oversight into its AI-assisted sports-summary workflow in 2023?

Correct. The human-review checkpoint was designed into the workflow from the start — AI drafts, human publishes — halving drafting time while preserving editorial accountability.

Incorrect. Every summary went through an editor who could accept, edit, or reject. The checkpoint was not random or post-publication.

4. What does "context assembly" specifically refer to in the five-stage workflow framework?

Correct. Context assembly is where prompt engineering lives — gathering the right background information and structuring it so the model has what it needs to produce a useful output.

Incorrect. Context assembly is the stage of collecting and packaging relevant information into the prompt — this is where prompt engineering decisions happen.

5. A well-designed AI workflow should be able to answer one sentence for each stage: "If this stage fails, the system will ___." What is the purpose of this exercise?

Correct. Systems that cannot answer this question for every stage are not ready to deploy. Failure mode planning prevents silent failures that erode user trust.

Incorrect. The one-sentence exercise ensures every stage has an explicit, pre-planned response to failure — it is a deployment-readiness check, not a cost or compliance exercise.

Lab 2 — Workflow Design Coach

Map out your AI pipeline, identify human checkpoints, and plan failure responses.

Your Task

Describe the AI project you scoped in Lab 1 (or a new one). The assistant will help you map each of the five workflow stages, decide where human review should sit, and write a one-sentence failure response for each stage. Aim for at least three exchanges.

Starter prompt: "I want to build a workflow that [describe your project]. Help me map the five stages and identify where I need human checkpoints."

Workflow Design Coach

Lab 2

Welcome to Lab 2. Let us design your AI workflow together. Describe your project — what it takes in, what it produces, and who uses the output. I will help you walk through all five stages, flag where human review makes sense, and draft failure-response sentences for each stage. What is your project?

Module 8 · Lesson 3

Prompting for Your Project

A project's system prompt is its constitution — it defines what the AI is, what it does, and what it must never do. Writing it carefully is not optional.

How do you craft prompts that are specific enough to produce consistent outputs but flexible enough to handle real-world variation?

When GitHub released Copilot in October 2021, the underlying system prompt — the set of instructions given to the Codex model before any user code appeared — specified the model's role as an AI pair programmer, defined the expected output format (code completions rather than explanations), and instructed it to infer the programmer's intent from surrounding context. Over the following two years of iterative refinement, GitHub's engineering team made one finding consistent enough to publish: the specificity of the system prompt had a larger effect on output quality than switching between model versions. A more precise role description and output format specification produced meaningfully better completions from the same underlying model. This finding is now a standard reference point in enterprise AI deployment discussions.

The Four Elements of a Project System Prompt

A well-structured system prompt for a production AI workflow contains four elements in roughly this order:

Role and identity. Who is the AI in this context? Not "you are a helpful assistant" — that is too broad. Instead: "You are a customer-support draft writer for a B2B SaaS company. You produce concise, professional email replies to inbound support tickets."
Task and output format. What exactly should the AI produce, and in what format? Specify structure (bullet points, JSON, prose paragraph), length constraints (under 150 words), tone (formal/conversational), and any required fields.
Constraints and prohibitions. What must the AI never do? Never promise a refund amount. Never mention competitor products. Never provide legal or medical advice. Always escalate requests about account deletion to human support.
Context injection placeholder. Where in the prompt will runtime context (the actual ticket text, the user's account tier, the relevant FAQ section) be inserted? Mark it clearly so your code can reliably substitute values.

Prompt Pattern

A reliable template: [Role sentence] → [Task and format instructions] → [Constraints list] → [Context placeholder marker]. Keep the role sentence under 30 words. List constraints as numbered items so they are easy to audit. Mark the context placeholder with a distinctive token like {{TICKET_TEXT}} that your code replaces at runtime.

Iterating Toward Consistency

No prompt is right on the first draft. The professional standard, documented in Anthropic's 2023 prompt engineering guide and echoed in OpenAI's production cookbook, is to evaluate prompts against a test set of representative inputs before deployment. A test set of 20 to 50 varied examples — covering typical cases, edge cases, and adversarial cases — will surface most prompt failures before they reach users.

The iteration loop is: write prompt → run test set → identify failure patterns → revise the prompt element responsible for the failure → repeat. Common failure patterns include: the model ignoring format instructions when the input is long (fix: move format instructions to the end of the prompt, closer to the output); the model adding caveats not requested (fix: add an explicit "do not add disclaimers" constraint); the model misidentifying the task when the input is ambiguous (fix: add an example in the system prompt).

In 2023, Notion's AI team reported that their prompt for generating structured meeting summaries required eleven revision cycles before output consistency exceeded 90% on their internal test set. Eleven cycles is not unusual; it is the norm for production-grade prompts.

Few-Shot Examples in System Prompts

For tasks with consistent output formats, including two to three worked examples directly in the system prompt typically improves output quality more than any other single intervention. The examples serve as a format specification that the model can pattern-match against, even when the instructions alone are ambiguous. Each example should include a realistic input and the exact output format you want the model to produce.

Keep examples under 200 words each. If your example set grows much larger, consider fine-tuning rather than few-shotting — the examples are trying to do the work that a fine-tuned model would internalize permanently.

Key Takeaway

Write your system prompt with four explicit sections: role, task and format, constraints, and context placeholder. Evaluate it against a test set of 20–50 examples before deployment. Expect ten or more revision cycles for a production-grade prompt. Include two to three worked examples for format-sensitive tasks.

Lesson 3 Quiz

Prompting for Your Project — 5 questions

1. What consistent finding did GitHub's engineering team report across two years of Copilot development regarding prompt specificity?

Correct. GitHub found that a more precise system prompt — better role description, clearer format spec — produced meaningfully better completions from the same model than simply upgrading the model.

Incorrect. The GitHub finding was the opposite: prompt specificity mattered more than the model version. This is a core principle of production AI deployment.

2. In the four-element system prompt framework, what is the purpose of the "context injection placeholder"?

Correct. The placeholder — e.g., {{TICKET_TEXT}} — is a marked position that application code replaces at runtime with the actual input data specific to each request.

Incorrect. The context injection placeholder is a marked token that code replaces with real runtime data. It makes the prompt a reusable template rather than a one-off string.

3. Notion's AI team reported in 2023 that their meeting-summary prompt required how many revision cycles before output consistency exceeded 90%?

Correct. Eleven revision cycles is presented as a norm, not an outlier, for production-grade prompts — the lesson being that extensive iteration is expected, not a sign of failure.

Incorrect. Notion reported eleven cycles. The point is that many revision rounds are the professional norm, not a sign that the approach is wrong.

4. When a model consistently ignores format instructions when the input is long, what is the recommended fix according to the prompt engineering guidance in this lesson?

Correct. Models tend to weight instructions closer to the output position more heavily. Moving format instructions to the end of the system prompt improves adherence when inputs are long.

Incorrect. The fix is positional: move format instructions to the end of the prompt so they are closer to the output and receive more weight relative to the long input.

5. According to the lesson, when should a builder consider fine-tuning a model rather than expanding few-shot examples in the system prompt?

Correct. A large few-shot example set is doing the work a fine-tuned model would do permanently. If you need dozens of examples to get consistent results, fine-tuning becomes more economical and effective.

Incorrect. The signal to consider fine-tuning is the size of the example set — if you need many examples to achieve consistency, fine-tuning internalizes that pattern more efficiently.

Lab 3 — System Prompt Builder

Draft, critique, and refine a production-ready system prompt for your project.

Your Task

Describe your project's task and the assistant will help you draft a four-element system prompt: role, task and format, constraints, and context placeholder. Then test it against two edge cases and revise based on what you find. Aim for at least three exchanges.

Starter prompt: "I need a system prompt for an AI that [describe the task]. The output should be [describe the format]. Help me build all four sections."

System Prompt Builder

Lab 3

Welcome to Lab 3. Let us build a production-quality system prompt for your project. Start by telling me: what task will this AI perform, what format should outputs take, and what should the AI never do? I will draft a four-element system prompt and we will test it together against edge cases. What is your project?

Module 8 · Lesson 4

Launching and Evaluating

Shipping is not the end of the project — it is the beginning of the measurement phase. The builders who improve fastest are the ones who define evaluation before they deploy.

How do you launch an AI project responsibly, measure whether it is working, and set up the feedback loops that drive continuous improvement?

Stripe launched Radar, its machine-learning fraud-detection system, in 2016. What distinguishes Radar's operational history is the discipline of its evaluation framework. From the beginning, Stripe published the key metrics it tracks: fraud rate, dispute rate, and false-positive rate (legitimate transactions incorrectly blocked). Each metric has a clear owner. Each degradation triggers a defined escalation process. By 2023, Stripe reported that Radar blocked over $4 billion in fraud annually, and its false-positive rate had declined year-over-year since 2019 — a result the company attributes explicitly to continuous model retraining driven by labeled feedback data. The system ships, measures, retrains, and ships again. The launch in 2016 was not the product; it was the starting condition.

The Minimum Viable Launch

A minimum viable launch for an AI project means deploying with the smallest user base needed to collect meaningful signal — not the full user population. Start with a small internal group, a single team, or a beta cohort of willing users. This limits exposure to errors while generating real behavioral data that synthetic testing cannot replicate.

Define the launch scope in advance: how many users, which use cases are in scope, what monitoring will run, and what threshold of errors or complaints triggers a rollback. A rollback plan is not pessimism — it is the evidence that you have thought seriously about production behavior.

Four Metrics Every AI Project Should Track

The exact metrics depend on the project type, but four categories apply to nearly every production AI deployment:

Task Success Rate

What percentage of AI outputs were used without modification or rejection? Track this per output type. A declining rate signals prompt drift or data distribution shift.

Latency

How long does the full pipeline take from input to delivered output? Model calls add latency. Monitor p50 and p95 — the median and the 95th percentile — not just average.

Error Rate

How often does the pipeline fail — API timeouts, output validation failures, fallback activations? A rising error rate is the earliest warning of a system problem.

User Correction Rate

How often do users edit, reject, or override AI output before using it? For augmentation tools, this is often the most sensitive quality signal available.

Evaluation Principle

Log enough information to reconstruct every AI decision: the full prompt sent, the raw output received, the user action taken, and a timestamp. Without this log, post-deployment debugging is guesswork. Storage costs are far lower than the cost of a production bug you cannot diagnose.

Feedback Loops and Continuous Improvement

The fastest-improving AI products close the loop between production behavior and prompt or model updates on a short cycle. In practice this means: reviewing a sample of outputs weekly (not monthly), categorizing failure patterns, updating the prompt or fine-tuning data to address the top failure category, and re-evaluating against the test set before re-deploying.

In 2022 and 2023, the team building Intercom's AI support product, Fin, documented a disciplined weekly review process: a small team reviewed 100 randomly sampled conversations, tagged them by failure type (hallucination, off-topic, format error, tone error), and produced a ranked list of the most common failure mode. The following week's prompt update addressed only the top-ranked failure. This single-focus iteration kept improvements measurable and prevented the common trap of trying to fix everything at once and breaking what was already working.

When to Escalate Beyond a Prompt Fix

Three signals indicate that a prompt fix alone will not solve the problem and a more significant intervention — fine-tuning, retrieval augmentation, or workflow redesign — is needed:

1
The same failure persists across three or more prompt revision cycles — the model does not have the capability or the knowledge required, not a formatting issue.

2
Task success rate is below 70% after ten iterations — the problem is either mis-scoped for AI or requires more specialized training data.

3
Outputs are accurate but users are not adopting the tool — the problem is workflow integration or trust, not output quality, and requires UX or process redesign, not a better prompt.

Key Takeaway

Launch small with a defined rollback plan. Track task success rate, latency, error rate, and user correction rate from day one. Log every AI decision in full. Run a weekly sample review and fix one failure category at a time. Treat the launch as the start of the improvement cycle, not the end of the project.

Lesson 4 Quiz

Launching and Evaluating — 5 questions

1. What does Stripe's Radar case (2016–2023) most clearly demonstrate about the role of a product launch in an AI project?

Correct. Stripe's explicit framing — and its year-over-year improvement in false-positive rate — demonstrates that the 2016 launch began a continuous cycle of shipping, measuring, and retraining.

Incorrect. Stripe's case demonstrates the opposite of a one-time launch: a continuous ship-measure-retrain cycle that has improved performance every year since deployment.

2. When monitoring latency for an AI pipeline, why should builders track p95 in addition to the average?

Correct. Averages obscure tail behavior. A p95 of 12 seconds means 5% of users wait 12+ seconds — a significant UX problem that an average of 2 seconds would never reveal.

Incorrect. p95 captures tail latency — the slowest 5% of requests — which averages mask. High tail latency degrades real user experience even when average latency looks acceptable.

3. How did Intercom's team building the Fin AI product structure their weekly improvement process in 2022–2023?

Correct. The single-focus iteration — fix only the top-ranked failure each week — kept improvements measurable and prevented regressions from trying to fix too many things at once.

Incorrect. Intercom's Fin team sampled 100 conversations weekly, categorized failures, and addressed only the top failure category. The discipline of fixing one thing at a time was central to their approach.

4. Which signal indicates that the problem requires workflow or UX redesign rather than a better prompt?

Correct. When output quality is high but adoption is low, the problem is trust, workflow fit, or interface design — none of which a prompt change can solve.

Incorrect. The signal for a workflow/UX problem specifically is accurate outputs + low adoption. The other options signal prompt or model problems, not integration problems.

5. What is the minimum information the lesson recommends logging for every AI decision in production?

Correct. These four elements — prompt, output, user action, timestamp — allow you to reconstruct every AI decision and debug production issues that sampling or aggregates would hide.

Incorrect. The recommended minimum is: full prompt sent, raw output received, user action taken, and timestamp. This set enables complete reconstruction of every AI decision for debugging.

Lab 4 — Launch Planning Coach

Define your launch scope, metrics, logging plan, and weekly review process.

Your Task

Describe your project and the assistant will help you define a minimum viable launch scope, select the four core metrics to track, design your logging plan, and outline a weekly failure-review process. Aim for at least three exchanges to complete the lab.

Starter prompt: "I am ready to launch my AI project that [describe it]. Help me define the launch scope, the metrics I should track, and how to set up a weekly review process."

Launch Planning Coach

Lab 4

Welcome to Lab 4 — the final lab. Let us plan a responsible launch for your AI project. Tell me about your project: who will use it, what it produces, and what you are most worried about. I will help you scope the initial rollout, select the right metrics, design your logging, and build a weekly review process. What is your project?

Module 8 — Module Test

Your First AI Project — 15 questions · Pass at 80% (12/15)

1. The AI-Fit Test asks whether the output is variable. What does this criterion reveal about a task?

Correct. Output variability is the core condition: AI earns its cost when the correct answer changes based on context, tone, or incomplete input.

Incorrect. Output variability determines whether AI adds value over simpler tools — fixed outputs belong in lookup tables, not language models.

2. Duolingo's 2023 GPT-4 integration achieved what measurable outcome within two months of launch?

Correct. Duolingo Max subscribers completed 20% more daily lessons — a behavioral engagement metric that validated the gap-filling rationale for the feature.

Incorrect. The measured outcome was 20% more daily lessons completed by Max subscribers compared to a control group.

3. Which problem category is considered safest for a first AI project because errors are visible and easy to catch in review?

Correct. Content generation outputs are visible, easily reviewed by humans, and errors are correctable — the key criteria for low-risk first projects.

Incorrect. Content generation offers low catastrophic risk and easy human review — essential properties for a first AI project.

4. What is the architectural significance of Spotify's 2023 AI DJ feature relative to simpler AI deployments?

Correct. Spotify's DJ is a multi-stage pipeline — each model has defined inputs and outputs — illustrating that professional AI products are workflows, not single prompts.

Incorrect. The architectural significance is the multi-stage pipeline design — separate recommendation, language, and TTS models with defined handoffs between each stage.

5. In the five-stage workflow framework, which stage is responsible for preventing malformed or refused AI outputs from reaching end users?

Correct. Output processing parses, validates, filters, and handles malformed or refused outputs before delivery — the critical stage most first-time builders skip.

Incorrect. Output processing is where raw model responses are validated and sanitized. Skipping it is the most common first-project failure point.

6. The Washington Post's AI sports-summary workflow in 2023 is cited as an example of what design principle?

Correct. The Post's workflow built a human-review gate into the pipeline by design — editors decided what to publish — halving drafting time while preserving accountability.

Incorrect. The Post built a deliberate human-review gate: AI drafts, editor publishes. This was designed in from the start, not added as an afterthought.

7. GitHub Copilot's engineering team found that, for improving output quality, prompt specificity was more important than which other factor?

Correct. GitHub found that a more precise system prompt produced better completions from the same model than upgrading to a newer model version — a key finding for practitioners.

Incorrect. GitHub's consistent finding was that prompt specificity outperformed model version upgrades in improving output quality.

8. In a four-element system prompt, where does prompt engineering primarily live?

Correct. Context assembly is where prompt engineering decisions happen — which background to include, how to structure it, what format to use for the payload.

Incorrect. Prompt engineering lives in context assembly — the stage that collects and packages the right information into the prompt for each request.

9. Notion's AI team reported that their meeting-summary prompt required eleven revision cycles to reach 90% consistency. What conclusion should a builder draw from this?

Correct. Eleven cycles is presented as normal, not exceptional. Builders who expect a production prompt on the first or second draft will be consistently disappointed.

Incorrect. The lesson is that eleven revision cycles is a normal expectation for production-grade prompts — a calibration for builders managing their own expectations.

10. When should a builder consider fine-tuning instead of expanding few-shot examples in the system prompt?

Correct. A large few-shot set does at inference time what fine-tuning does at training time — permanently. When examples grow large, fine-tuning is more economical and effective.

Incorrect. The fine-tuning signal is a large example set — the examples are doing the work fine-tuning would do more efficiently and permanently.

11. Stripe's Radar system improved its false-positive rate year-over-year from 2019 through 2023. What operational practice drove this improvement?

Correct. Stripe attributes its year-over-year improvement explicitly to continuous retraining on labeled feedback, structured metric ownership, and defined escalation processes.

Incorrect. Stripe's improvement came from continuous retraining on feedback data and disciplined metric ownership — not from tooling changes or increased human review.

12. What is the specific purpose of tracking the user correction rate as a production metric for an AI augmentation tool?

Correct. User correction rate captures quality through real behavior rather than automated metrics — a rising correction rate is often the first sign of output quality degradation.

Incorrect. User correction rate measures quality through behavior: how often users must fix AI output before using it. It is often more sensitive than automated quality metrics.

13. Three signals indicate that a prompt fix alone will not resolve a problem and a more significant intervention is needed. Which of the following is one of those three signals?

Correct. Persistence across three or more revision cycles signals that the model lacks the capability or knowledge required — a prompt fix is chasing a problem that requires architectural intervention.

Incorrect. The three escalation signals are: persistent failure across 3+ revision cycles, task success below 70% after 10 iterations, and accurate outputs with low adoption.

14. The full logging requirement — prompt, output, user action, timestamp — serves what primary operational purpose?

Correct. Without full decision logs, debugging a production AI bug is guesswork. The four-element log (prompt, output, user action, timestamp) enables precise reconstruction of any failure.

Incorrect. The primary operational purpose is debugging: full logs allow you to reconstruct every AI decision and diagnose failures precisely rather than speculating.

15. Which practice, documented in Intercom's Fin development process, is recommended to prevent regressions when improving an AI product?

Correct. Intercom's Fin team fixed one failure category per week. Addressing multiple issues simultaneously makes it impossible to attribute improvements or catch regressions introduced by one change.

Incorrect. Intercom's documented practice was to fix only the top-ranked failure each iteration. Fixing everything at once makes regressions undiagnosable.