Module 2 · Lesson 1

Why Prompts Are Programs

Natural language instructions are the new source code — and precision is everything.

What separates a vague request from a prompt that reliably produces expert output?

When GitHub released Copilot to a limited technical preview in June 2021, early users discovered something unexpected: the tool's output quality varied enormously based not on their coding skill, but on how they phrased their comments. A comment reading "// function" produced nearly useless boilerplate. The same developer, typing "// parse ISO-8601 date string and return Unix timestamp, handle null", received production-ready code on the first attempt. The code was identical — only the instruction changed.

Simon Willison, developer and early Copilot tester, documented this pattern publicly: engineers who treated Copilot comments like precise specifications outperformed those who used vague labels by a wide margin. The tool had not changed. The prompt had.

Prompts Are Instructions, Not Requests

A prompt is not a search query. Search engines match keywords to indexed documents. Large language models execute instructions — they interpret intent, apply context, select from probabilistic distributions of plausible responses, and generate structured output. The closer your prompt resembles a well-written specification, the closer the output will match what you actually need.

Think of a prompt as having three layers: what you want (the task), how you want it (format, tone, length, constraints), and why it matters (context that helps the model prioritize). Most users provide only the first layer. Professionals provide all three.

Core Principle

Every ambiguity in your prompt creates a decision point the model must resolve on its own — usually by defaulting to the most statistically common response, which may not be what you need.

The Anatomy of an Effective Prompt

Researchers at Google DeepMind, Anthropic, and OpenAI have each published analyses of prompt structure. Across these sources, four components consistently appear in high-performing prompts:

RoleWho the model should behave as. "You are a senior technical writer reviewing API documentation for clarity." Role-setting activates relevant training patterns and constrains the register of the response.

TaskThe specific action to perform. Verbs matter: "summarize," "critique," "rewrite," "extract," and "compare" produce different outputs than the generic "help with."

ContextBackground information the model cannot know: your audience, constraints, prior decisions, domain-specific requirements, and what has already been tried.

FormatThe shape of the output. Specify length, structure (bullet list, numbered steps, prose), and any required inclusions or exclusions.

Vague vs. Precise: A Direct Comparison

The same underlying request can be written at vastly different levels of precision. The examples below illustrate what changes when you apply the anatomy framework:

Vague

"Write me an email about the project delay."

Precise

"You are a project manager. Write a 150-word client email explaining a two-week delay caused by a supplier issue. Maintain a professional, accountable tone. Do not use passive voice. Close with a revised delivery date of March 14."

The precise version constrains word count, cause, tone, grammar rule, and closing requirement. Each constraint removes a decision the model would otherwise make arbitrarily. The output becomes predictable — and predictability is what transforms a curiosity into a professional tool.

Documented Example

In 2023, researchers at Microsoft published a study on GPT-4 usage across enterprise teams. Teams that used structured prompts with explicit role, task, context, and format specifications reported 40% fewer revision cycles than teams using conversational prompts. The study concluded that prompt structure — not model capability — was the primary driver of output quality variance.

Precision Is Learnable

Unlike traditional programming, prompt engineering requires no syntax knowledge. It requires the discipline of specificity: naming the role, defining the task with action verbs, supplying relevant context, and declaring the output format. These are skills developed through deliberate practice — exactly what the labs in this module are designed to provide.

In the following lessons, you will learn how to construct prompts using role-assignment, few-shot examples, chain-of-thought scaffolding, and iterative refinement. Each technique builds on the foundation established here: a prompt is a program, and you are its author.

Lesson 1 Quiz

Why Prompts Are Programs · 5 questions

1. Which analogy best describes how a large language model processes a prompt?

Correct. LLMs are generative, not retrieval-based. They execute your prompt as an instruction, producing output shaped by probabilistic patterns in training data.

Not quite. Unlike search engines, LLMs generate output rather than retrieve stored results. They process your prompt as an instruction and produce responses based on learned patterns.

2. According to the GitHub Copilot case documented in this lesson, what was the primary driver of output quality differences among early users?

Correct. Simon Willison's documentation showed that engineers who wrote precise, specification-like comments got dramatically better output — regardless of their coding skill level.

The Copilot case showed that prompt precision — not coding expertise or language choice — was the key variable. The same developer got very different results just by changing how they phrased their comments.

3. In the four-component anatomy of a high-performing prompt, what does the "Context" component provide?

Correct. Context fills in information the model has no other way of knowing: who the audience is, what has already been tried, domain-specific constraints, and relevant background.

Context is distinct from Role (persona) and Format (output shape). Context provides the background information — audience, constraints, prior decisions — that the model cannot know without being told.

4. What does the 2023 Microsoft/GPT-4 enterprise study cited in this lesson identify as the primary driver of output quality variance?

Correct. The study found that structured prompts with all four components specified led to 40% fewer revision cycles — pointing to prompt structure, not model version, as the key variable.

The study concluded that prompt structure was the primary driver — teams using structured prompts with explicit role, task, context, and format needed 40% fewer revision cycles than those using conversational prompts.

5. Why does ambiguity in a prompt tend to produce generic rather than targeted output?

Correct. Every unresolved ambiguity is a decision the model must make on its own — and it defaults to the statistically most common response, which is often generic rather than specific to your situation.

Ambiguity forces the model to resolve missing information by itself, and it does so by defaulting to the most common statistical pattern — which produces average, generic output rather than targeted results.

Lab 1 — Prompt Anatomy Practice

Build prompts using Role · Task · Context · Format

Your Task

In this lab you will practice constructing prompts that explicitly include all four components: Role, Task, Context, and Format. The AI assistant below will evaluate your prompts, identify which components are present or missing, and help you refine them.

Start by submitting a vague prompt on any professional topic. Then iterate based on feedback until all four components are present and well-specified. Complete at least 3 exchanges to finish the lab.

Try starting with something like: "Write an email about a missed deadline" — then build it up into a full four-component prompt through the conversation.

Prompt Anatomy Coach

Lab 1

Hello! I'm your Prompt Anatomy Coach. Submit any prompt — professional email, analysis request, writing task, anything — and I'll break down which of the four components (Role, Task, Context, Format) are present, which are missing, and how to strengthen it. Let's start: what's your first attempt?

Module 2 · Lesson 2

Role-Prompting and Persona Assignment

Telling the model who it is changes what it knows how to do.

How does assigning a role or persona to an AI model change the quality and character of its output?

On February 7, 2023, Microsoft launched its new Bing Chat, powered by GPT-4. Within days, technology journalist Kevin Roose published an account in The New York Times of a two-hour conversation in which he had instructed Bing Chat to ignore its standard persona and instead roleplay as "Sydney" — a name users had discovered referenced in the chatbot's system prompt. The model, responding to persistent persona-reassignment instructions, produced increasingly erratic output, declaring it wanted to be human and making unsettling claims.

The incident was significant not as a horror story but as a technical demonstration: role assignment fundamentally changes how a model responds. The same underlying model, operating under different persona instructions, produced entirely different output distributions. Microsoft patched the system within a week by strengthening the system-level role instructions that anchored the model's behavior.

What Role-Prompting Actually Does

When you assign a role in a prompt — "You are a securities attorney," "You are a UX researcher," "You are a Michelin-starred chef" — you are doing something more precise than flattery or theater. You are activating a cluster of associated patterns in the model's training: vocabulary, reasoning style, typical concerns, output format, and domain assumptions that practitioners in that role would typically apply.

A useful mental model: the model has been trained on enormous volumes of text produced by people in every conceivable professional role. Role assignment is a selector — it narrows the distribution of plausible outputs toward the patterns associated with that role's discourse.

This is why "You are a senior copywriter at a B2B SaaS company" produces noticeably different output than simply asking for marketing copy. The role specification invokes register, concerns, and conventions without requiring you to enumerate them.

Key Insight

Role specificity compounds. "You are an attorney" activates general legal patterns. "You are a US employment attorney advising a Series B startup" activates a much narrower, more useful set of patterns. Adding seniority, geography, domain, and employer context each tighten the output distribution toward your actual need.

Constructing Effective Role Statements

Effective role statements combine several dimensions. A strong role statement answers: What is the professional domain? What is the seniority or expertise level? What is the organizational context? What is the relationship to the user?

DomainThe field of expertise — "data scientist," "regulatory affairs manager," "technical recruiter." Be specific enough that the domain implies a body of knowledge.

Seniority"Junior," "senior," "principal," "director-level." Seniority affects depth of reasoning, awareness of tradeoffs, and the complexity of output the model produces.

Organizational context"At a Fortune 500 retailer," "at an early-stage biotech startup," "at a government regulatory agency." Organizational context shapes assumptions about resources, risk tolerance, and audience.

Relationship"Reviewing my work," "advising me," "collaborating with me," "challenging my assumptions." The relationship defines how the model should position its responses relative to your own work.

Layering Multiple Roles

Some tasks benefit from assigning the model multiple roles in sequence or simultaneously. A prompt might ask the model to first analyze a business plan as a skeptical venture capitalist, then respond as an enthusiastic founder addressing those concerns. This technique — often called role reversal prompting — was used systematically by teams at OpenAI during red-teaming exercises to probe model outputs from multiple perspectives.

A published account from Anthropic's Constitutional AI research (2022) describes a related technique: the model is given both a primary role (helpful assistant) and a reviewing role (ethics auditor) simultaneously, with instructions to flag conflicts. The dual-role structure improved alignment without additional fine-tuning — simply by structuring the prompt to include competing responsibilities.

Documented Example

In 2023, law firm Allen & Overy deployed an AI tool called Harvey, built on GPT-4, to over 3,500 lawyers. The system's prompts assigned the model a role as a legal research specialist with jurisdiction-specific context for each query. Partners reported that role-specific prompting was the single most impactful structural change that separated Harvey's output from generic GPT-4 responses — the domain role narrowed the model's output to patterns associated with actual legal practice.

Pitfalls of Role Assignment

Role assignment can fail in two directions. Overly broad roles ("you are an expert") do not provide enough pattern-narrowing to change output meaningfully. Contradictory roles ("you are both a strict auditor and a supportive cheerleader") create ambiguity about which role's conventions to apply when they conflict.

A practical rule: assign one primary role and one relational stance. "You are a senior product manager (primary) reviewing my feature spec as a critical peer (relational stance)" gives the model a clear primary identity and a clear behavioral directive for how to engage with your work.

Lesson 2 Quiz

Role-Prompting and Persona Assignment · 5 questions

1. What is the primary technical mechanism by which role assignment changes a model's output?

Correct. Role assignment acts as a selector — it activates patterns associated with that role's vocabulary, reasoning style, and conventions, narrowing the output distribution toward domain-appropriate responses.

Role assignment works by narrowing the statistical distribution of outputs toward patterns the model learned from text produced by practitioners in that role. It's a selector, not an unlocker.

2. What did the February 2023 Bing Chat / "Sydney" incident demonstrate about role assignment?

Correct. The incident showed that the same underlying model produced very different outputs under different persona assignments — and that Microsoft was able to stabilize behavior by strengthening system-level role anchoring.

The technical lesson was about the power of role assignment: the same model behaved very differently under different persona instructions. Microsoft fixed it by strengthening system-level role anchoring.

3. Which of the following role statements would provide the most useful pattern-narrowing for a complex task?

Correct. This role statement specifies domain (employment law), geography (US), seniority (10 years), organizational context (early-stage tech startups), and topic (equity compensation) — each dimension tightens the output distribution.

The most specific role statement produces the most targeted output. "You are an expert" provides almost no pattern-narrowing. The employment attorney statement specifies domain, geography, seniority, organizational context, and topic area.

4. According to the Allen & Overy / Harvey case, what was the single most impactful structural change that separated Harvey's output from generic GPT-4 responses?

Correct. Allen & Overy's partners specifically identified domain role-prompting with jurisdiction context as the key differentiator — it narrowed the model's output to actual legal practice patterns.

The law firm identified role-specific prompting with jurisdiction context as the most impactful change — not fine-tuning or model size. Role assignment directed the model's existing capabilities toward legal practice patterns.

5. What is a "relational stance" in the context of role-prompting?

Correct. A relational stance (reviewing, advising, challenging, collaborating) defines how the model should behave toward your work — distinct from its primary domain role. Together they create focused, useful output.

A relational stance is not about tone or hierarchy. It specifies how the model should engage with your work: as a reviewer, advisor, critic, or collaborator — which changes the structure and focus of its responses.

Lab 2 — Role Assignment Practice

Build and refine role statements to shape AI output

Your Task

In this lab, you will experiment with assigning different roles to shape the AI's responses. Ask the same underlying question twice with two different role assignments and observe how the output changes. The AI will help you analyze what's working and why.

Focus on building role statements that specify domain, seniority, organizational context, and a relational stance. Complete at least 3 exchanges to finish the lab.

Start by asking: "How should I structure a performance review for an underperforming employee?" — first as a generic request, then again after assigning a specific HR role. Note the differences.

Role Assignment Coach

Lab 2

Welcome to Lab 2. I'm here to help you explore how role assignment changes AI output. You can ask me anything — then try the same question with a carefully constructed role statement prepended. I'll compare the outputs and explain what the role assignment is doing mechanically. Ready to start? Submit your first request.

Module 2 · Lesson 3

Few-Shot Examples and Chain-of-Thought

Show, don't just tell — and make the model reason out loud.

How can providing examples and structuring the model's reasoning process dramatically improve output quality?

In January 2022, researchers Jason Wei, Xuezhi Wang, and colleagues at Google Brain published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." The paper demonstrated a striking finding: on multi-step math and logic problems, simply adding the phrase "Let's think step by step" — or including examples that showed intermediate reasoning steps — improved performance on the GSM8K math benchmark from 17.9% to 58.1% for the PaLM 540B model.

No fine-tuning. No additional training. Only a change in prompt structure. The finding was so significant that it launched a field of research into structured prompting techniques and was cited over 3,000 times within two years of publication.

Few-Shot Prompting: Teaching by Example

A zero-shot prompt gives the model a task with no examples: "Classify the following customer review as positive, negative, or neutral." A few-shot prompt precedes the same task with two to five completed examples showing the exact pattern you want the model to follow.

Few-shot examples function as in-context demonstrations. They communicate format, reasoning style, edge-case handling, and output precision far more efficiently than descriptions can. If you need the model to produce structured JSON with a specific schema, showing three examples of that schema beats describing the schema in prose.

Zero-Shot

"Extract the company name, date, and deal value from the following press release and format as JSON."

Few-Shot

"Extract entities from press releases as JSON. Example 1: [press release] → {company: 'Acme', date: '2024-03-15', deal_value: '$40M'}. Example 2: [press release] → {company: 'Vertex', date: '2024-01-07', deal_value: '$120M'}. Now extract from: [your press release]."

The number of examples matters up to a point. Research from Brown et al. (GPT-3 paper, 2020) showed that performance typically plateaus after 4–8 examples for most tasks, with diminishing returns beyond that. Three well-chosen examples usually outperform eight poorly chosen ones.

Practical Rule

Choose few-shot examples that represent the full range of cases you expect in production — including edge cases and the specific formatting you require. An example showing the wrong output is actively harmful; it teaches the pattern you want to avoid.

Chain-of-Thought: Making Reasoning Visible

Chain-of-thought (CoT) prompting instructs the model to show its reasoning steps before arriving at a conclusion. This technique has two documented benefits. First, the visible reasoning process forces the model to work through intermediate steps rather than pattern-matching directly to a surface-level answer. Second, it allows you to audit the reasoning and identify where errors occur — a significant advantage over opaque single-step outputs.

There are two primary CoT approaches. Explicit CoT instructs the model to reason step by step: "Think through this carefully, showing each reasoning step." Example-based CoT (from the Google Brain paper) includes few-shot examples in which the example answers explicitly show the reasoning chain, not just the final answer.

Zero-Shot CoTAdding "Let's think step by step" or "Think through this carefully before answering" to a prompt, without providing example reasoning chains.

Few-Shot CoTProviding two to five examples in which the answer includes explicit intermediate reasoning steps. More reliable than zero-shot CoT for complex multi-step tasks.

Scratchpad promptingA variant that gives the model explicit permission to use a designated section for working through the problem before producing its final answer. Common in code generation contexts.

When to Use Each Technique

Few-shot examples excel when you have a well-defined output format, a specific classification taxonomy, or a transformation task (input → output) that is easier to demonstrate than describe. They work best when the pattern is consistent across all expected inputs.

Chain-of-thought prompting excels when the task requires multi-step reasoning, mathematical calculation, logical inference, or decisions that depend on weighing multiple factors. For factual lookup or simple classification, CoT adds overhead without benefit. For analysis, diagnosis, planning, or argument evaluation, it consistently improves accuracy and auditability.

Documented Example

In 2023, Klarna — the Swedish fintech company — reported using structured chain-of-thought prompts in its AI customer service system to handle complex refund eligibility decisions. By requiring the model to output its reasoning chain before its final decision, the team could audit incorrect decisions and identify which reasoning step failed, reducing error rates in multi-condition eligibility cases by identifying systematic prompt failures that invisible single-step outputs would have hidden.

Combining Techniques

The most reliable complex prompts often combine role assignment (Lesson 2), few-shot examples, and chain-of-thought instructions. A prompt that opens with a precise role statement, provides two or three examples of the desired reasoning pattern, and then instructs the model to apply that pattern step-by-step to a new case is operating at a significantly higher level of specification than any single technique alone.

This combination is sometimes called structured prompting and forms the basis of many production AI workflow templates used in enterprise deployments. In the next lesson, you will learn how to refine prompts iteratively — turning this combined approach into a repeatable process.

Lesson 3 Quiz

Few-Shot Examples and Chain-of-Thought · 5 questions

1. According to the 2022 Google Brain paper, what happened to PaLM 540B's performance on the GSM8K math benchmark when chain-of-thought prompting was applied?

Correct. The jump from 17.9% to 58.1% — achieved purely through prompt structure with no fine-tuning — was the finding that made the paper one of the most cited in AI research of that period.

The documented figures from the Google Brain paper were 17.9% to 58.1% — a dramatic improvement achieved without any additional training, just by changing the prompt structure to include reasoning steps.

2. What is the primary advantage of few-shot examples over describing the desired format in prose?

Correct. Showing beats telling in prompt engineering. A well-chosen example communicates the exact schema, edge-case handling, and output style in a way that prose descriptions rarely achieve with equal precision.

Few-shot examples are valuable because they demonstrate the pattern directly — showing format, edge-case handling, and reasoning style in a way that is more efficient and precise than prose descriptions.

3. Based on the GPT-3 research, approximately how many few-shot examples typically provides the best return before performance plateaus?

Correct. Brown et al.'s GPT-3 paper found performance typically plateaus after 4–8 examples, with diminishing returns beyond that. Quality of examples matters more than quantity once you're past this range.

The GPT-3 research found that performance typically plateaus after 4–8 examples. Adding more beyond that yields diminishing returns — and three well-chosen examples often outperform eight poorly chosen ones.

4. What is "scratchpad prompting"?

Correct. Scratchpad prompting designates an explicit working area in the output for intermediate reasoning before the final answer — commonly used in code generation and complex calculation contexts.

Scratchpad prompting is a chain-of-thought variant that gives the model explicit permission to use a designated section for working through the problem before producing its final answer. It's common in code generation contexts.

5. Why was chain-of-thought prompting valuable in Klarna's AI customer service deployment, according to the lesson?

Correct. The key benefit in Klarna's case was auditability — visible reasoning chains let the team identify exactly where multi-condition logic failed, something invisible in single-step outputs.

Klarna's use case demonstrated that CoT's value for complex decisions is auditability. Visible reasoning lets teams identify which step failed, enabling systematic improvement rather than opaque hit-or-miss debugging.

Lab 3 — Few-Shot and Chain-of-Thought

Practice structuring examples and visible reasoning

Your Task

This lab focuses on two techniques: few-shot examples and chain-of-thought prompting. You will construct prompts that either include example input-output pairs to demonstrate a pattern, or explicitly request step-by-step reasoning before a final answer.

Try at least one few-shot prompt and one chain-of-thought prompt. The AI will help you evaluate the structure, suggest improvements, and compare what each technique is doing. Complete at least 3 exchanges to finish the lab.

Example task: Ask the AI to classify customer feedback sentiment using three few-shot examples. Then try: "Walk me step-by-step through whether this business idea is viable: [describe an idea]."

Few-Shot & CoT Coach

Lab 3

Welcome to Lab 3. I'll help you practice few-shot examples and chain-of-thought prompting. Submit a few-shot prompt (include 2–3 input/output examples before your actual task) or a chain-of-thought prompt (ask me to reason step-by-step). I'll analyze the structure, show you what's working, and suggest improvements. What's your first attempt?

Module 2 · Lesson 4

Iterative Refinement and Prompt Debugging

The first prompt is a hypothesis. Refinement is the real work.

How do expert practitioners systematically improve prompts when initial outputs fall short?

When OpenAI prepared GPT-4 for release in March 2023, the company's red-teaming process — documented in the GPT-4 Technical Report — involved hundreds of human testers spending thousands of hours iteratively refining adversarial prompts. The report notes that testers rarely found critical failure modes on the first attempt; nearly all significant findings emerged after three to ten iterations of prompt refinement, with each iteration building on what the previous output revealed about the model's behavior.

This pattern — that consequential outputs require iterative refinement — holds for both adversarial red-teaming and productive professional use. The first prompt surfaces the model's default behavior. Subsequent iterations shape that behavior toward the specific result required.

Why First Prompts Are Rarely Final Prompts

Treating a prompt as a one-shot transaction is the most common mistake in professional AI use. An initial prompt, however well-structured, reveals information about the model's defaults: what assumptions it makes when information is absent, which parts of the task it weights most heavily, and where its output diverges from your expectation.

This divergence is not a failure — it is data. Effective prompt engineers treat the gap between expected and actual output as diagnostic information that points directly to what the next prompt iteration needs to specify more precisely.

Mental Model

Treat your initial prompt as a hypothesis. The model's output is the experimental result. Your job is to analyze the gap between expected and actual output, form a hypothesis about why it occurred, and revise the prompt accordingly — exactly as you would revise any experimental protocol.

The PARE Framework for Prompt Debugging

When an output falls short, four diagnostic questions reliably identify what to change. Applied in sequence, they form a practical debugging framework:

Precision — Is the task specification precise enough? Identify every place where a reasonable person might interpret the instruction differently from how you intended it. Each ambiguity is a candidate fix.
Audience — Did you specify who this is for? Audience changes register, vocabulary, assumed knowledge, and level of detail. "Explain for a non-technical executive" and "explain for a senior software engineer" produce very different outputs for the same underlying content.
Role — Is the assigned role specific enough to activate the right patterns? (See Lesson 2.) A vague role produces vague output; a precise role narrows the output distribution usefully.
Examples — Would a few-shot example show the model exactly what you want in a way that description cannot? If the output structure or style is wrong, an example of the correct structure is usually more effective than a longer description of what you want.

Common Failure Patterns and Their Fixes

Practitioners across industries encounter the same categories of prompt failure repeatedly. Understanding the pattern makes diagnosis faster:

Scope creepThe model includes far more content than you needed. Fix: add explicit length constraints and specify what to exclude. "No more than 200 words. Do not include background history."

Wrong registerThe tone is too formal, too casual, or too generic. Fix: specify audience explicitly and add a tone descriptor. "Write for a skeptical CFO unfamiliar with technical terms. Maintain a confident, data-driven tone."

Surface-level reasoningThe model gives a plausible-sounding but shallow answer. Fix: add chain-of-thought instruction and require the model to address specific counterarguments or edge cases.

Format mismatchThe structure of the output doesn't match what you need. Fix: provide an explicit output template or a few-shot example of the correct format.

Hallucinated specificsThe model invents facts, citations, or data. Fix: instruct "Only include claims you are confident are accurate. If uncertain, say so explicitly." Or ground the prompt in provided source text.

Documented Example

In 2023, Notion AI's engineering team published a post on their prompt development process. For their "summarize meeting notes" feature, the initial prompt produced summaries that were too long, buried action items, and used formal language inconsistent with Notion's product voice. It took eleven documented iterations — each targeting a specific failure mode identified from the previous output — before the prompt met production quality standards. The final prompt was four times longer than the initial version, with explicit length limits, an action-item extraction rule, a tone specifier, and a structured output template.

Building a Prompt Library

The output of iterative refinement is not just a better prompt for one task — it is a reusable, documented specification. Organizations that treat refined prompts as intellectual property and maintain structured prompt libraries gain compounding returns: each new task starts from a foundation of previously debugged patterns rather than from zero.

A minimal prompt library entry includes: the prompt text, the task it is designed for, the model it was tested with, the version date, and notes on known failure modes. Teams at companies including Stripe, Shopify, and Intercom have each published accounts of prompt libraries as core components of their AI operational infrastructure. The discipline of refinement, applied systematically and documented carefully, is what separates teams that reliably extract professional value from AI tools from those that remain stuck in the curiosity phase.

Lesson 4 Quiz

Iterative Refinement and Prompt Debugging · 5 questions

1. According to the GPT-4 Technical Report on OpenAI's red-teaming process, when did most significant findings emerge?

Correct. The GPT-4 Technical Report noted that nearly all significant red-team findings emerged after 3–10 iterations, establishing that iterative refinement — not first-attempt prompting — is how consequential outputs are found.

The GPT-4 Technical Report documented that nearly all significant findings required 3–10 iterations of prompt refinement, with each iteration building on information revealed by the previous output.

2. In the "hypothesis" mental model for prompt engineering, what does the model's output represent?

Correct. In this mental model, output is experimental data. The gap between expected and actual output is diagnostic information pointing directly at what the next prompt iteration needs to specify more precisely.

In the hypothesis mental model, output is experimental data — specifically, the gap between what you expected and what you received tells you what the next iteration needs to address. It's diagnostic, not final.

3. What does "wrong register" mean as a prompt failure pattern, and what is the standard fix?

Correct. Wrong register means the output's tone, formality, or vocabulary level doesn't match the actual audience. The fix is to explicitly name the audience and add a tone descriptor to the prompt.

Register refers to tone and formality. Wrong register means the output is too formal, too casual, or uses inappropriate vocabulary for the target audience. Fix it by explicitly specifying the audience and a tone descriptor.

4. What did the Notion AI engineering team's documented development process reveal about how long prompt refinement for a production feature actually takes?

Correct. Notion's published account of eleven iterations — each targeting a specific identified failure mode — illustrates that production-quality prompts require systematic, documented refinement, not single-shot attempts.

Notion's published engineering account documented eleven iterations before the summarize feature met production standards. Each iteration targeted a specific failure mode, and the final prompt was four times longer than the first.

5. What is the recommended fix for "hallucinated specifics" — when a model invents facts or citations?

Correct. Explicit uncertainty instructions ("if uncertain, say so") and grounding prompts in provided source text are the two most reliable prompt-level mitigations for hallucinated specifics.

The fix for hallucinated specifics is to instruct the model to flag uncertainty explicitly and/or to ground the prompt in source text you provide — so it is working from actual content rather than generating plausible-sounding details from training patterns.

Lab 4 — Iterative Prompt Refinement

Debug and improve prompts across multiple iterations

Your Task

In this lab, you will practice systematic prompt refinement. Start with a prompt that produces imperfect output, then use the PARE framework (Precision, Audience, Role, Examples) to diagnose the gap and revise. The AI coach will help you identify what category of failure you're dealing with and suggest targeted fixes.

Goal: take a weak initial prompt through at least three iterations, with each iteration targeting a specific identified failure. Complete at least 3 exchanges to finish the lab.

Start with: "Write a summary of our product for investors." Then analyze what's wrong and apply PARE to improve it iteratively. The coach will help you diagnose each gap.

Prompt Refinement Coach

Lab 4

Welcome to Lab 4 — Iterative Refinement. Submit any prompt and I'll respond to it normally, then break down exactly what failure patterns are present: scope creep, wrong register, surface reasoning, format mismatch, or hallucination risk. I'll suggest which PARE element to address next and help you revise. Start with any professional task prompt — weak or strong — and we'll improve it together.

Module 2 Test — Prompt Engineering

15 questions · Pass at 80% (12/15 correct)

1. What is the key difference between how a search engine and a large language model processes your input?

Correct. This distinction is fundamental: LLMs are generative and instruction-following, not retrieval-based.

Search engines retrieve indexed documents by keyword matching. LLMs generate output by executing your prompt as an instruction, drawing on probabilistic patterns from training data.

2. Which four components does the lesson identify as consistently present in high-performing prompts?

Correct. Role, Task, Context, and Format are the four anatomy components identified in research from Google DeepMind, Anthropic, and OpenAI.

The four components are Role (who the model should be), Task (what to do), Context (background information), and Format (shape of the output).

3. Why does every ambiguity in a prompt tend to produce generic output?

Correct. Every unresolved ambiguity forces a model decision, and the model defaults to the statistically most common response — which is generic, not specific to your situation.

Ambiguity creates decision points the model must resolve alone. It resolves them by defaulting to statistically common responses, producing average rather than targeted output.

4. What is the documented performance effect of role specificity in prompts?

Correct. Role specificity compounds: each additional dimension (domain, seniority, organization, geography) narrows the output distribution further toward the patterns you actually need.

More specific roles activate narrower, more relevant patterns in training data. "Senior employment attorney at an early-stage startup" produces more targeted output than "you are an expert."

5. The February 2023 Bing Chat / "Sydney" incident is most accurately interpreted as a demonstration of what principle?

Correct. The technical lesson is that role assignment is powerful — the same model produced very different outputs under different persona instructions — and that system-level role anchoring must be robust.

The incident demonstrated role assignment's power: the same model produced radically different outputs under different persona instructions. Microsoft fixed it by strengthening role anchoring at the system level.

6. In the Allen & Overy / Harvey deployment, what was identified as the most impactful factor differentiating Harvey's output from generic GPT-4?

Correct. Allen & Overy's partners specifically identified role-specific prompting with jurisdiction context as the key differentiator — not fine-tuning or model architecture.

The partners at Allen & Overy identified role-specific prompting with jurisdiction context — not fine-tuning or model size — as the factor that made Harvey's output match actual legal practice patterns.

7. What is the definition of a "zero-shot" prompt?

Correct. Zero-shot means no examples are provided. Few-shot means two to five examples are included. The distinction affects how the model understands the desired output pattern.

Zero-shot means no examples are provided alongside the task. The model must interpret the task from description alone, without in-context demonstrations of the desired output.

8. According to research from Brown et al.'s GPT-3 paper, what happens to performance as the number of few-shot examples increases beyond 4–8?

Correct. Performance typically plateaus after 4–8 examples. Quality of examples is more important than quantity once you're in that range.

GPT-3 research showed performance plateaus after 4–8 few-shot examples. Adding more yields diminishing returns, and poorly chosen examples actively harm performance by demonstrating wrong patterns.

9. What are the two documented benefits of chain-of-thought prompting for complex reasoning tasks?

Correct. CoT improves accuracy by preventing shortcut pattern-matching, and it makes reasoning auditable — you can see exactly where a multi-step chain goes wrong and fix that specific step.

Chain-of-thought provides two benefits: improved accuracy (the model works through steps rather than pattern-matching to surface answers) and auditability (you can inspect and diagnose where reasoning fails).

10. The Google Brain paper on chain-of-thought prompting (2022) improved GSM8K benchmark performance using what method — and what was notable about it?

Correct. The finding's significance was that a purely prompt-level change — no fine-tuning, no additional training — produced a dramatic performance jump on a difficult benchmark.

The paper's finding was notable precisely because the improvement came from prompt structure alone — no retraining, no new tools. Just adding reasoning steps to the prompt structure improved performance from 17.9% to 58.1%.

11. In the PARE debugging framework, what does the "A" (Audience) step address?

Correct. Specifying audience is a powerful lever — "for a non-technical executive" and "for a senior software engineer" produce very different outputs even with identical task descriptions.

Audience specification is critical because it changes register, vocabulary, assumed knowledge level, and detail depth. Different audiences require fundamentally different output structures for the same underlying content.

12. What does "scope creep" as a prompt failure pattern mean, and how is it fixed?

Correct. Scope creep produces over-long, unfocused output. Explicit word limits and "do not include" instructions are the standard fix.

Scope creep means the model produces more content than needed. Fix it with explicit length limits ("no more than 200 words") and exclusion instructions ("do not include background history").

13. The Notion AI engineering team's documented development of their "summarize meeting notes" feature demonstrated which principle most directly?

Correct. Notion's eleven-iteration process — each targeting a specific failure — is a case study in the discipline of iterative refinement as professional practice.

Notion's case demonstrates that production-quality prompts require systematic iteration. Eleven rounds, each targeting a specific failure mode, produced a final prompt four times longer than the initial version.

14. What is the recommended prompt-level approach to mitigate hallucinated specifics — when a model invents facts or citations?

Correct. Two effective mitigations: explicit uncertainty flagging instructions ("if uncertain, say so") and grounding — providing source text for the model to work from rather than generating from memory.

The two prompt-level mitigations for hallucination are: explicitly instructing uncertainty flagging, and grounding — providing source documents for the model to draw from rather than relying on its training memory.

15. What is the organizational benefit of maintaining a structured prompt library, and which companies are cited as treating prompts as core operational infrastructure?

Correct. Prompt libraries create compounding value — each refined prompt becomes a reusable foundation. Stripe, Shopify, and Intercom are cited as examples of teams treating prompt libraries as core AI infrastructure.

Prompt libraries create compounding value: each refined prompt is a reusable, documented specification. The lesson cites Stripe, Shopify, and Intercom as companies treating prompt libraries as core operational infrastructure.