In April 2023, Duolingo announced it had integrated GPT-4 to power two features: Roleplay, which lets learners hold open-ended conversations with AI characters, and Explain My Answer, which gives personalized grammar feedback. Both features addressed a precise gap β the app had millions of users who could practice vocabulary but had no one to converse with freely at 2 a.m. The AI did not replace the curriculum; it filled a slot the curriculum genuinely could not fill with rule-based logic. Within two months, Duolingo Max subscribers completed 20% more daily lessons than the control group. The lesson from this case is structural: Duolingo defined the gap first, then found the tool.
Not every task benefits from AI. The clearest wins cluster around a set of problem characteristics. Before committing to an AI project, run your idea through four questions that practitioners call the AI-Fit Test.
1. Is the output variable? If the correct answer is always identical regardless of input, a lookup table or simple formula is faster and cheaper. AI earns its cost when outputs must adapt β to tone, to context, to incomplete information.
2. Is the pattern learnable from examples? AI systems excel at generalizing from data. If there are no examples β because the task is entirely novel or because the data has never been captured β AI struggles too.
3. Is the cost of error acceptable? A customer-service draft that occasionally sounds slightly off is fixable. An AI that miscalculates a medication dose is catastrophic. Problems where errors are high-stakes require more robust guardrails than a first project can typically provide.
4. Does the scale justify the effort? If you are doing this task twice a week, a prompt template might suffice. If you are doing it ten thousand times a day, a properly integrated pipeline pays for itself quickly.
The best AI projects solve a real bottleneck. They do not automate a task that nobody cared about and they do not replace human judgment in domains where that judgment is the entire value. They fill a specific, documented gap.
For a first AI project, three categories offer favorable risk-to-reward ratios because they have clear success metrics, tolerate moderate error rates, and produce outputs humans can quickly review.
Drafting product descriptions, summarizing reports, generating FAQ answers from documentation. Errors are visible and easy to catch in review.
Sorting support tickets by category, tagging customer sentiment, routing emails to the right team. Measurable accuracy and low catastrophic risk.
Pulling structured data from unstructured text β names, dates, entities from contracts or emails. Downstream validation is straightforward.
Answering questions from a defined knowledge base, booking workflows, guided troubleshooting. Scope is bounded; off-topic answers can be redirected.
Every project needs a measurable definition of done. Vague goals β "make our support better with AI" β produce vague projects. Strong goal statements specify the metric, the baseline, and the target. For example: Reduce average email drafting time from 8 minutes to under 3 minutes for customer-success agents handling tier-1 tickets, measured over 30 days post-launch.
In 2022, Klarna set a similarly specific objective when deploying its AI assistant: handle 67% of customer-service chats without human escalation within 30 days. The company published that it hit the target within the first month, attributing the success partly to how precisely the objective was scoped β the bot handled only specific transaction-related queries, not general complaints.
Your success definition should include: the task being automated or augmented, the metric you will track, the baseline (current performance), the target, and who owns measuring it.
Good problem selection is a filter, not a formula. Run the AI-Fit Test, pick a category with visible outputs, and write a measurable success definition before touching a single line of code or crafting a single prompt. The project you do not build because it fails the fit test is as valuable as the one you build correctly.
Describe a real or hypothetical task you would like to automate or augment with AI. The assistant will guide you through the four AI-Fit Test questions and help you write a measurable success definition. Aim for at least three exchanges to complete the lab.
In February 2023, Spotify launched its AI DJ feature in the United States and Canada. The feature β built on OpenAI technology β generated personalized spoken commentary between tracks, mimicking the patter of a radio host who knows your listening history. What made the engineering distinctive was not the language model itself but the pipeline surrounding it. Spotify's system pulled a user's listening data, passed it through a recommendation model to choose tracks, then sent structured context (genre, mood, listening history, time of day) to the language model to generate commentary, then routed the output through a text-to-speech voice model trained on a real DJ's voice. Each stage had a defined input format, a defined output format, and a fallback behavior if any stage failed. Users experienced a seamless product; underneath was a carefully sequenced workflow with no single magical black box.
Every working AI project β from a simple email-drafting tool to Spotify's DJ β moves through the same five stages. Understanding each stage prevents the most common architectural mistakes.
Most first-project failures happen at Stage 4 β output processing. Builders test the happy path (model returns clean JSON) but do not handle the case where the model returns partial JSON, appends an explanation, or refuses to answer. Always write explicit output-validation logic before deploying.
Not every AI workflow should operate fully autonomously. The question is not whether to include humans, but where in the workflow human review adds the most value relative to its cost.
In 2023, the Washington Post's internal AI tool for generating first-draft summaries of statistical sports stories went through an explicit human-review gate before any summary was published. Editors could accept, edit, or reject the draft. The workflow was not "AI does it automatically" β it was "AI drafts, human publishes." That checkpoint halved the drafting time without removing editorial accountability. The human review step was not an afterthought; it was designed into the workflow from the start.
For your first project, draw a simple diagram with boxes and arrows. Label each box as automated or human-reviewed. If any automated box produces output that could be wrong in a costly way, move a human-review gate upstream of the consequence.
Every stage of the workflow can fail. The AI model may time out. The retrieved document may be stale. The output may fail validation. Good workflow design specifies, for each failure mode: what the system will do (retry, fallback, alert), who will be notified, and what the user will see.
A simple framework: for each stage, write one sentence answering "If this stage fails, the system will ___." Systems that cannot answer that question for every stage are not ready to deploy.
Design the workflow as a sequence of stages with defined inputs, outputs, and failure behaviors β not as a single prompt. Place human checkpoints where errors would be costly. Before building, you should be able to draw the full pipeline on a whiteboard in under five minutes.
Describe the AI project you scoped in Lab 1 (or a new one). The assistant will help you map each of the five workflow stages, decide where human review should sit, and write a one-sentence failure response for each stage. Aim for at least three exchanges.
When GitHub released Copilot in October 2021, the underlying system prompt β the set of instructions given to the Codex model before any user code appeared β specified the model's role as an AI pair programmer, defined the expected output format (code completions rather than explanations), and instructed it to infer the programmer's intent from surrounding context. Over the following two years of iterative refinement, GitHub's engineering team made one finding consistent enough to publish: the specificity of the system prompt had a larger effect on output quality than switching between model versions. A more precise role description and output format specification produced meaningfully better completions from the same underlying model. This finding is now a standard reference point in enterprise AI deployment discussions.
A well-structured system prompt for a production AI workflow contains four elements in roughly this order:
A reliable template: [Role sentence] β [Task and format instructions] β [Constraints list] β [Context placeholder marker]. Keep the role sentence under 30 words. List constraints as numbered items so they are easy to audit. Mark the context placeholder with a distinctive token like {{TICKET_TEXT}} that your code replaces at runtime.
No prompt is right on the first draft. The professional standard, documented in Anthropic's 2023 prompt engineering guide and echoed in OpenAI's production cookbook, is to evaluate prompts against a test set of representative inputs before deployment. A test set of 20 to 50 varied examples β covering typical cases, edge cases, and adversarial cases β will surface most prompt failures before they reach users.
The iteration loop is: write prompt β run test set β identify failure patterns β revise the prompt element responsible for the failure β repeat. Common failure patterns include: the model ignoring format instructions when the input is long (fix: move format instructions to the end of the prompt, closer to the output); the model adding caveats not requested (fix: add an explicit "do not add disclaimers" constraint); the model misidentifying the task when the input is ambiguous (fix: add an example in the system prompt).
In 2023, Notion's AI team reported that their prompt for generating structured meeting summaries required eleven revision cycles before output consistency exceeded 90% on their internal test set. Eleven cycles is not unusual; it is the norm for production-grade prompts.
For tasks with consistent output formats, including two to three worked examples directly in the system prompt typically improves output quality more than any other single intervention. The examples serve as a format specification that the model can pattern-match against, even when the instructions alone are ambiguous. Each example should include a realistic input and the exact output format you want the model to produce.
Keep examples under 200 words each. If your example set grows much larger, consider fine-tuning rather than few-shotting β the examples are trying to do the work that a fine-tuned model would internalize permanently.
Write your system prompt with four explicit sections: role, task and format, constraints, and context placeholder. Evaluate it against a test set of 20β50 examples before deployment. Expect ten or more revision cycles for a production-grade prompt. Include two to three worked examples for format-sensitive tasks.
Describe your project's task and the assistant will help you draft a four-element system prompt: role, task and format, constraints, and context placeholder. Then test it against two edge cases and revise based on what you find. Aim for at least three exchanges.
Stripe launched Radar, its machine-learning fraud-detection system, in 2016. What distinguishes Radar's operational history is the discipline of its evaluation framework. From the beginning, Stripe published the key metrics it tracks: fraud rate, dispute rate, and false-positive rate (legitimate transactions incorrectly blocked). Each metric has a clear owner. Each degradation triggers a defined escalation process. By 2023, Stripe reported that Radar blocked over $4 billion in fraud annually, and its false-positive rate had declined year-over-year since 2019 β a result the company attributes explicitly to continuous model retraining driven by labeled feedback data. The system ships, measures, retrains, and ships again. The launch in 2016 was not the product; it was the starting condition.
A minimum viable launch for an AI project means deploying with the smallest user base needed to collect meaningful signal β not the full user population. Start with a small internal group, a single team, or a beta cohort of willing users. This limits exposure to errors while generating real behavioral data that synthetic testing cannot replicate.
Define the launch scope in advance: how many users, which use cases are in scope, what monitoring will run, and what threshold of errors or complaints triggers a rollback. A rollback plan is not pessimism β it is the evidence that you have thought seriously about production behavior.
The exact metrics depend on the project type, but four categories apply to nearly every production AI deployment:
What percentage of AI outputs were used without modification or rejection? Track this per output type. A declining rate signals prompt drift or data distribution shift.
How long does the full pipeline take from input to delivered output? Model calls add latency. Monitor p50 and p95 β the median and the 95th percentile β not just average.
How often does the pipeline fail β API timeouts, output validation failures, fallback activations? A rising error rate is the earliest warning of a system problem.
How often do users edit, reject, or override AI output before using it? For augmentation tools, this is often the most sensitive quality signal available.
Log enough information to reconstruct every AI decision: the full prompt sent, the raw output received, the user action taken, and a timestamp. Without this log, post-deployment debugging is guesswork. Storage costs are far lower than the cost of a production bug you cannot diagnose.
The fastest-improving AI products close the loop between production behavior and prompt or model updates on a short cycle. In practice this means: reviewing a sample of outputs weekly (not monthly), categorizing failure patterns, updating the prompt or fine-tuning data to address the top failure category, and re-evaluating against the test set before re-deploying.
In 2022 and 2023, the team building Intercom's AI support product, Fin, documented a disciplined weekly review process: a small team reviewed 100 randomly sampled conversations, tagged them by failure type (hallucination, off-topic, format error, tone error), and produced a ranked list of the most common failure mode. The following week's prompt update addressed only the top-ranked failure. This single-focus iteration kept improvements measurable and prevented the common trap of trying to fix everything at once and breaking what was already working.
Three signals indicate that a prompt fix alone will not solve the problem and a more significant intervention β fine-tuning, retrieval augmentation, or workflow redesign β is needed:
Launch small with a defined rollback plan. Track task success rate, latency, error rate, and user correction rate from day one. Log every AI decision in full. Run a weekly sample review and fix one failure category at a time. Treat the launch as the start of the improvement cycle, not the end of the project.
Describe your project and the assistant will help you define a minimum viable launch scope, select the four core metrics to track, design your logging plan, and outline a weekly failure-review process. Aim for at least three exchanges to complete the lab.