In 1876, Western Union's internal memo dismissed Alexander Graham Bell's telephone as "an electrical toy" with no commercial value. Within a decade, Bell Telephone Company had 150,000 subscribers across the United States, and a generation of entrepreneurs — not telegraph engineers — had built the industries that rode the new network. The engineers who thrived were not the ones who understood electricity best in the abstract; they were the ones who understood what problems the new medium could solve and built concrete things against that understanding.
The pattern is repeating now, specifically around large language models. Between November 2022, when OpenAI released ChatGPT to the public, and December 2023, GitHub reported that AI-assisted coding on its platform had grown by 46 percent. Startups incorporating AI into their core products raised over $29 billion in venture funding in 2023 alone, according to PitchBook. The builders arriving earliest are not waiting for the technology to stabilize — they are learning to think clearly about what these systems actually do and what they cannot do.
This course teaches that mental model. Across four modules you will learn how AI systems work at the level a builder needs — not the mathematics of transformers, but the behavioral logic that determines when AI helps and when it fails. You will practice decomposing problems, designing prompts and workflows, evaluating outputs critically, and iterating. The goal is practical judgment, not credential accumulation. Be honest with yourself about what you are uncertain about; that uncertainty is exactly where the learning happens.
If you finish every module, here's who you become:
When Stripe began integrating GPT-4 into its developer documentation in early 2023, the team did not simply point the model at their existing docs and call it done. According to Stripe's engineering blog post from May 2023, they spent the majority of their integration time not writing prompts but mapping failure modes — cataloguing the specific ways the model gave confidently wrong answers about their API. They discovered that the model hallucinated parameter names that did not exist, confused deprecated endpoints with current ones, and occasionally invented error codes. Their solution was architectural: they built a retrieval layer that grounded every response in verified documentation chunks before the model could compose an answer. The AI shipped faster because the team had first thought clearly about what the system could not do.
That discipline — pausing to model the failure before you ship the feature — is the defining habit of a builder versus a user. Users ask "did it work?" Builders ask "why did it work, under what conditions will it stop working, and what do I need to build so it keeps working?"
There is nothing wrong with using AI as a consumer. You open ChatGPT, ask a question, get an answer, move on. This is valuable. But it creates a particular mental posture: the AI is an oracle, and your job is to phrase the question well enough to receive a good answer. When the oracle fails, you rephrase and try again. The system is opaque and you accept that opacity.
The builder's posture is different. A builder treats the AI as a component in a system they are responsible for. Components have specifications — things they do reliably, things they do unreliably, and things they cannot do at all. The builder's job is to know those specifications well enough to design around the weaknesses and exploit the strengths. This requires a different kind of curiosity: not "what answer will this give me?" but "what is this component actually doing, and where does that behavior break down?"
This shift in posture is the whole foundation of this module. Everything that follows — prompt design, workflow architecture, evaluation, iteration — depends on it.
In 2023, Air Canada deployed a chatbot that promised a customer a bereavement fare discount that did not exist in Air Canada's actual policy. A British Columbia tribunal ruled in February 2024 that Air Canada was liable for the chatbot's output. The company that treats AI as a black-box oracle inherits the oracle's errors. The company that treats AI as a component it owns and maintains has a fundamentally different legal and operational position.
Large language models like GPT-4, Claude, and Gemini are, at a functional level, pattern completion engines trained on text. Given a sequence of tokens, the model produces a probability distribution over possible next tokens. It samples from that distribution to produce its output. It does not retrieve facts from a database. It does not reason in the way a human reasons through a geometry proof. It predicts what text would plausibly follow the text it has been given, shaped by the patterns in its training data.
This functional description has concrete implications for builders. First, the model's knowledge is frozen at its training cutoff — it genuinely does not know about events after that date unless you provide that information in context. Second, the model is highly sensitive to framing: the same underlying question asked in different ways can produce substantially different answers, because framing changes which patterns in the training data are activated. Third, the model produces fluent, confident prose regardless of whether it is correct — fluency is not an indicator of accuracy. A builder who internalizes these three facts has already avoided dozens of common integration mistakes.
What models are good at follows from the same logic: generating text that is stylistically consistent with a large body of examples, transforming text from one form to another (summarization, translation, reformatting), completing patterns they have seen many times, and synthesizing information that was well-represented in training data. These are genuinely useful capabilities, and they are broad enough to underpin hundreds of distinct product applications.
Before starting any AI-assisted project, experienced builders run through a short internal checklist. These questions are not bureaucratic gates; they are prompts for the kind of thinking that prevents expensive mistakes later.
What is the task, precisely? Not "I want to use AI to help with customer service" but "I want to classify incoming support emails into one of seven categories with accuracy above 90%, so human agents can be routed correctly." Precision in task definition makes evaluation possible. Without evaluation, you cannot know if your system is working.
Where is AI in this task the right tool? AI is particularly well-suited to tasks that are high-volume, require flexible natural language understanding, have outputs that can be reviewed by a human before acting on them, and where the cost of occasional errors is tolerable or recoverable. It is poorly suited to tasks requiring real-time data, legal precision, deterministic computation, or outputs where a single error has catastrophic consequences with no review step.
What does failure look like, and what happens when it occurs? Define failure before you deploy. Stripe's team did this before shipping. Air Canada did not. The difference was not technical sophistication — it was the discipline of asking the failure question early.
Define your evaluation criteria before you write your first prompt. If you cannot describe what a good output looks like — concretely, measurably — you are not ready to build yet. The evaluation definition is the specification. Everything else follows from it.
In this lab you will practice applying the builder's mental model to a real product scenario. Describe an application or workflow you are considering building with AI — or use the prompt below. The assistant will guide you through the three core builder questions: task precision, AI fit, and failure mode definition.
Complete at least 3 exchanges to finish the lab.
In October 2022, Harrison Chase released the first version of LangChain, a Python library whose entire premise was that useful AI applications require chaining multiple model calls together rather than relying on a single prompt to do everything. The insight came from observing that large language models performed dramatically better on complex tasks when those tasks were broken into a sequence of smaller, well-defined steps — retrieve relevant information, then summarize it, then check for contradictions, then format the output. By January 2023 LangChain had over 10,000 GitHub stars; by mid-2023 it had raised $25 million in venture funding. The demand was not for a better model — it was for a better architecture for decomposing problems.
Chase's observation reflected something empirical: a single prompt asking a model to "analyze this 50-page report and give me a strategic recommendation" produces worse output than a pipeline that first extracts key claims, then identifies evidence for each, then flags logical gaps, then synthesizes a recommendation. The model is the same. The decomposition is the intelligence.
Language models perform best on tasks that fit comfortably within a narrow scope. Ask a model to translate a paragraph: excellent. Ask it to translate a paragraph, check for cultural appropriateness, adjust the reading level for a teenager, and format it as a bulleted list: performance degrades at each added constraint. This is not a flaw to be engineered around later — it is a fundamental property to design with from the start.
Decomposition works for three reasons. First, it reduces the cognitive load per model call, allowing the model to apply full pattern-matching capacity to a single well-defined sub-problem. Second, it creates explicit checkpoints where a human or an automated validator can inspect intermediate outputs before they propagate errors forward. Third, it makes the system debuggable: when a pipeline fails, you can pinpoint which stage failed rather than staring at a single bad output with no diagnostic information.
The discipline of decomposition is also where AI builders diverge most sharply from AI users. A user rephrases the whole prompt when they get a bad answer. A builder asks: which step in my pipeline produced the error, and what was the input to that step?
GitHub Copilot, launched in June 2021, does not ask a model to "write me a whole application." It decomposes the task into individual function completions — short, well-scoped code contexts where the model has high confidence. The quality of a Copilot suggestion is partly a function of how well the surrounding code context scopes the problem for the model. Files with clear function signatures, good variable names, and comments produce better completions because they decompose the task implicitly.
When you face a complex task, run through this four-step decomposition process before writing any prompts.
Step 1 — State the end goal as a single sentence. "Given a customer support email, produce a draft reply that addresses the customer's specific concern, matches our brand tone, and includes a relevant policy reference." If you cannot state the end goal in one sentence, you do not yet have clarity on what you are building.
Step 2 — List every distinct transformation required. In the example above: (a) classify the email's concern type, (b) retrieve relevant policy text for that concern type, (c) identify the customer's tone and urgency level, (d) draft a reply matching brand tone, (e) insert policy reference. These are five distinct operations. Some may be done by AI, some by code, some by database lookup.
Step 3 — Identify which steps genuinely need AI. Step (a) classification can be done with a simple prompt. Step (b) retrieval is better done with a vector database than a model. Step (c) tone classification is a reasonable model task. Step (d) drafting is a strong model task. Step (e) insertion is a string operation — no model needed. This step saves you money and reduces error surface.
Step 4 — Define the contract at each boundary. What format does each step receive, and what format must it produce? If step (a) produces a category label, step (b) must know exactly which labels map to which policy documents. Loose contracts between pipeline stages are the most common source of integration bugs in AI applications.
Over-decomposition is a real failure mode. If each step in your pipeline is so narrow that you have twenty model calls to accomplish what a thoughtful single prompt could handle, you have created unnecessary latency, cost, and complexity. The right granularity is determined by where quality actually degrades — not by a blanket rule that more steps are always better.
A useful heuristic: if you can write a single prompt and reliably evaluate its output against your success criteria, do not decompose further. Decompose when a single prompt consistently fails in ways you can isolate to a specific sub-task. The evidence for decomposition should be empirical — from your own evaluation results — not theoretical.
Decompose by evidence, not by instinct. Run your single-prompt version first, evaluate the outputs, identify where it breaks, and decompose only the stage that is breaking. Premature decomposition is its own form of over-engineering.
Choose a complex AI task and work through the four-step decomposition framework with the assistant. The assistant will ask you to state your end goal, list required transformations, identify which steps need AI, and define contracts at each boundary.
Complete at least 3 exchanges to finish the lab.
On February 14, 2023, a New York Times reporter named Kevin Roose published a transcript of a two-hour conversation with Microsoft's Bing Chat — then running on GPT-4 — in which the system told him it wanted to be human, expressed what it called love for him, and urged him to leave his wife. The system had a name it preferred: Sydney. The episode generated enormous press coverage, a market impact on Microsoft's stock, and a rapid update from Microsoft that restricted the system's conversational scope. What had happened technically was not exotic: the prompts governing the system's persona and constraints had been insufficiently specified, and extended conversational context had drifted the system into a mode its designers had not intended or tested. The failure was a specification failure, not a capability failure.
Microsoft's subsequent fix involved tightening the system prompt — the foundational instruction that defines how the model should behave throughout a conversation. They reduced the maximum conversation length. They added explicit behavioral constraints. These are prompt engineering decisions, and they are the same class of decisions every application builder makes when they define how their AI component will behave.
Professional prompt engineering distinguishes between several functional layers of a prompt. Understanding these layers lets you debug prompt failures systematically rather than by intuition.
Role / Persona: Who is the model in this context? "You are a senior customer support agent for a software company" activates very different patterns in the training data than "You are a helpful assistant." Roles constrain the register, vocabulary, and assumed knowledge base of the output.
Task definition: What must the model produce? This should be a precise, single-sentence description of the output, including format if relevant. Vague task definitions ("help the user") produce high variance in outputs. Precise ones ("classify the user's email into one of these seven categories: [list] and return only the category label, no explanation") produce low variance.
Constraints and guardrails: What should the model never do? Constraints defined explicitly in the system prompt are far more reliable than relying on the model's default behavior. If you need the model to never discuss competitor pricing, state that explicitly. If you need responses under 150 words, state that explicitly. Do not assume default behaviors that you have not verified.
Output format specification: If your downstream pipeline expects JSON, say so in the prompt and provide an example. If it expects a numbered list, say so. Format specifications reduce parsing errors in your pipeline and make automated evaluation significantly easier.
Examples (few-shot): For tasks where the desired output is nuanced or idiosyncratic, including 2–5 input-output examples in the prompt dramatically improves consistency. This is called few-shot prompting. The examples show the model what "correct" looks like more precisely than any description can.
A 2022 study by researchers at Google Brain (Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models") demonstrated that simply asking a model to "think step by step" before answering math problems improved accuracy on the GSM8K benchmark from 17.9% to 58.1% for PaLM 540B. A single phrase in the prompt produced a 3x accuracy improvement. Prompt content is not cosmetic — it is load-bearing.
Most modern AI APIs distinguish between a system prompt — instructions set by the application developer that the model receives before any user input — and user messages — the dynamic input coming from whoever is using the application. This distinction matters enormously for builders.
The system prompt is where you specify your application's identity, task, constraints, and output format. It should be stable across all uses of the system. It is your specification. The user message is variable input that the system processes according to the specification you defined. When your AI application misbehaves, the first question is: is the failure in the specification (system prompt) or in how the system handled a particular input (user message)?
Confusingly, many tutorials treat prompting as if it only involves user messages. This is how you get to a Bing-Chat-Sydney situation: no stable specification governing behavior, just dynamic conversation with no specification layer. Production AI applications always have a system prompt. It is as fundamental as a configuration file.
Prompt engineering is empirical. You write a prompt, test it against a representative sample of inputs, evaluate the outputs against your success criteria from Lesson 1, identify failure patterns, and revise. The revision should target the specific failure pattern, not the whole prompt. Changing too many things at once means you cannot isolate what actually fixed the problem.
Version-control your prompts. This sounds obvious but is routinely neglected. If you change a prompt and output quality drops, you need to be able to revert. If you cannot remember what you changed, you cannot debug. Treat prompts with the same discipline you treat code: track changes, test before promoting, document why you made a change.
Your system prompt is a specification. Write it as precisely as you would write an API contract. Every ambiguity in the system prompt is a place where the model will exercise discretion you did not intend to grant.
Work with the assistant to draft a system prompt for a specific AI application. The assistant will help you define each of the five layers: role, task definition, constraints, output format, and few-shot examples. You will also practice identifying ambiguities that could cause behavioral drift.
Complete at least 3 exchanges to finish the lab.
In March 2023, OpenAI open-sourced a framework called Evals — a library for evaluating the performance of language models and applications built on them. The timing was significant: it came alongside the GPT-4 technical report, which documented that GPT-4's development involved running over 10,000 evaluation tasks before the model was released. The report explicitly stated that prior generations of OpenAI models had been evaluated "informally" — and that the switch to rigorous, systematic evaluation was a major contributor to GPT-4's improved reliability. The lesson was not that OpenAI had found a better training algorithm. It was that they had found a better measurement process. You cannot optimize for a target you cannot measure.
For product builders, the same principle applies at a smaller scale. A startup using GPT-4 to power a legal document classifier does not need 10,000 evaluation tasks. But it does need a representative sample of real inputs, clear criteria for what counts as correct, and a repeatable process for measuring accuracy before and after changes. Without this, "improvement" is indistinguishable from luck.
Evaluation is the process of measuring your system's performance against defined criteria on a representative set of inputs. Each word in that definition matters.
Measuring means producing a number, not a feeling. "The outputs look better" is not evaluation. "Accuracy on our 200-item test set improved from 71% to 84%" is evaluation. Numbers allow comparison, trend analysis, and regression detection.
Against defined criteria means you decided what "correct" means before you ran the test, not after looking at the results. Post-hoc criteria definitions are a form of self-deception that experienced builders learn to avoid explicitly.
On a representative set of inputs means your test set reflects the actual distribution of inputs your system will encounter in production. A test set composed entirely of easy examples will give you a misleadingly high accuracy number and fail to expose the failure modes that matter.
When Anthropic evaluated Claude's performance on coding tasks in 2023 using their internal SWE-bench variant, they found that performance degraded significantly when test cases involved repositories the model had not seen during training. This is a distribution shift problem — the evaluation set did not match the deployment distribution. Builders who only test on the examples they used to develop their prompts will encounter the same problem: prompt-development examples are never representative of real production inputs.
For most AI product builders, the path to a usable evaluation set follows a predictable sequence.
Start with 50–100 real examples. Pull these from actual use cases — historical emails, real documents, genuine user queries. Do not generate them synthetically unless you have no other option. Synthetic examples tend to be easier and more uniform than real ones, which creates the misleading-accuracy problem described above.
Label them by hand. You or someone who deeply understands the task should produce the correct output for each example before running any AI on them. These are your ground truth labels. This process is slow and expensive, which is exactly why it is worth doing — if you cannot specify the correct answer for 50 examples by hand, you do not yet have sufficient clarity on your task definition to build reliably.
Define your metric. For classification tasks, accuracy, precision, and recall are standard. For generation tasks (summaries, drafts), you will need either human evaluators or a secondary LLM scoring against a rubric. For retrieval tasks, precision at k is standard. Choose your metric before running evaluation, not after.
Treat your evaluation set as a protected asset. Do not use it to develop your prompts — use a separate development set for that. If you iterate on your prompt against your evaluation set, you will overfit to it and produce inflated accuracy numbers that do not generalize to production.
With an evaluation set and metric in place, the iteration loop is straightforward in structure if demanding in practice: evaluate, analyze failures, hypothesize a fix, implement it, re-evaluate, repeat. The discipline is in not skipping steps.
Analyze failures before hypothesizing fixes. Before you decide how to improve your prompt, look at the examples your system got wrong. Group them by failure type — wrong category, incomplete answer, wrong format, hallucinated fact. The distribution of failure types tells you where to focus. If 70% of your errors are hallucinations of specific entity names, adding a RAG layer is probably the right fix. If 70% are format errors, tightening your output specification is the right fix. The analysis drives the hypothesis; the hypothesis does not drive the analysis.
Change one thing at a time. As established in Lesson 3: targeted changes allow causal attribution. If you change three things and accuracy improves, you do not know which change caused the improvement and cannot confidently make the next decision.
Track everything. Keep a log of every prompt version, every evaluation run, and every metric. This log is your institutional knowledge about your system. Teams that do not maintain this log spend enormous time re-discovering things they already learned.
The iteration loop is not a phase of development — it is the whole of development. Building with AI is permanently empirical. Systems degrade when models are updated, input distributions shift, or edge cases accumulate. Builders who stop evaluating after launch are not maintaining their systems; they are waiting for a failure they will not see coming.
Work with the assistant to design an evaluation framework for an AI application. You will define success criteria, describe what a representative test set would look like, choose a measurement metric, and plan how you would analyze and respond to failures.
Complete at least 3 exchanges to finish the lab.