L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
Module 3 Β· Lesson 1

What Is Few-Shot Prompting?

Teaching by example β€” the fastest way to align a model's output format, tone, and logic to your exact needs.
Why does showing a model two or three examples often outperform paragraphs of instructions?

In June 2020, OpenAI published the GPT-3 paper, "Language Models are Few-Shot Learners." The title was the finding. Researchers discovered that a large language model shown just a handful of input-output pairs in its context window could perform tasks it had never been explicitly fine-tuned for β€” translation, arithmetic reasoning, even novel analogical reasoning β€” at accuracy levels that rivaled purpose-trained models. The technique required no gradient updates, no retraining, no labeled dataset of thousands. Just examples, placed directly in the prompt.

The Core Idea

Few-shot prompting means prepending one or more worked examples to your actual request. Each example is an input-output pair that demonstrates exactly what you want. The model reads your examples, infers the pattern, and applies it to the new input you provide.

This is distinct from zero-shot prompting (no examples, just instructions) and fine-tuning (updating model weights with thousands of examples). Few-shot sits in the middle: no weight changes, but far more guidance than zero-shot alone.

Zero-shot A prompt with no examples β€” just a task description. The model must infer desired behavior from instructions alone.
One-shot A single input-output example before the real query. Establishes format and basic pattern.
Few-shot Two to eight (typically) input-output examples. Provides enough variation to convey edge cases, tone, and schema.
Many-shot Dozens or hundreds of examples in a long-context window. Explored in Google DeepMind's 2024 "Many-Shot In-Context Learning" paper β€” shows continued gains on complex tasks.

A Concrete Minimal Example

Suppose you want a model to classify customer support tickets as BILLING, TECHNICAL, or ACCOUNT. Zero-shot:

Zero-Shot (Ambiguous)

Classify the following ticket into one of: BILLING, TECHNICAL, ACCOUNT.

Ticket: "I can't log in after changing my email."

Few-Shot (Anchored)

Ticket: "My invoice shows twice the amount." β†’ BILLING
Ticket: "The app crashes on startup." β†’ TECHNICAL
Ticket: "I can't log in after changing my email." β†’ ?

The few-shot version shows the model exactly what a "BILLING" response looks like, what a "TECHNICAL" one looks like, and β€” crucially β€” that the answer is a single label with no additional text. The model learns format, vocabulary, and brevity all at once from those two lines.

GPT-3 Paper Finding (Brown et al., 2020)

On the SuperGLUE benchmark, GPT-3 with few-shot prompting matched or exceeded fine-tuned BERT-large on several tasks β€” despite having seen only a handful of in-context examples rather than thousands of labeled training samples. The researchers noted that performance scaled with the number of examples, with diminishing returns past roughly 8–10 examples for most tasks.

Why Examples Beat Instructions

Instructions are inherently ambiguous. "Write a brief summary" leaves open: How brief? What tense? Does it include the conclusion? One sentence or three? Examples answer all of these implicitly β€” the model sees the exact output schema and matches it.

This is why, in practice, teams at companies like Anthropic and OpenAI recommend leading with examples whenever you have a fixed output format. The 2023 Anthropic prompt engineering guide states: "If you want Claude to output JSON with a specific schema, showing an example output is faster and more reliable than describing every field in prose."

Developer Rule of Thumb

When your task has a consistent format β€” structured data extraction, classification, code transformation, templated writing β€” start with 2–3 examples. When your task is open-ended reasoning or creative work, instructions often serve better, since rigid examples can over-constrain the model's approach.

Lesson 1 Quiz

What Is Few-Shot Prompting? β€” 3 questions
The GPT-3 paper (Brown et al., 2020) was subtitled "Language Models are Few-Shot Learners." What was the core empirical finding?
Correct. Brown et al. showed GPT-3 could match or exceed fine-tuned baselines on multiple SuperGLUE tasks using only a handful of in-context examples β€” the defining insight of the paper.
Not quite. The central finding was that just a few in-context examples sufficed to rival fine-tuned models, with no weight updates required.
What primary advantage does a few-shot example provide over a written instruction for a fixed-format output task?
Correct. A single example answers "how brief?", "what tense?", "which fields?" simultaneously β€” all things that prose instructions leave open to interpretation.
The key advantage is specificity of format, not token count or weight updates. Prose instructions are inherently ambiguous; examples are not.
According to the GPT-3 paper's findings on example count, what generally happens beyond roughly 8–10 in-context examples for most tasks?
Correct. Brown et al. observed increasing performance with more examples but noted diminishing marginal returns past roughly 8–10 for most benchmark tasks.
The paper found diminishing returns, not a sharp drop or ignoring of later examples. Each additional example adds less marginal improvement after the first several.

Lab 1 β€” Building Your First Few-Shot Prompt

Practice constructing few-shot prompts for classification and formatting tasks

Your Task

In this lab you'll practice writing few-shot prompts. Ask the assistant to help you build a few-shot prompt for a classification or data-extraction task of your choice. Try at least one of these challenges:

β‘  Ask for help building a 3-example few-shot prompt for sentiment classification (positive/negative/neutral). β‘‘ Ask what makes a good few-shot example vs. a poor one. β‘’ Share a prompt you've written and ask for feedback on its example quality.
Few-Shot Lab Assistant L1
Welcome to Lab 1. I'm here to help you build and critique few-shot prompts. Tell me about a task you'd like to tackle β€” classification, extraction, transformation β€” and we'll construct good examples together. What task do you have in mind?
Module 3 Β· Lesson 2

Anatomy of a Good Example

What separates a few-shot example that reliably steers the model from one that confuses it?
Why do two prompts with the same number of examples sometimes produce wildly different output quality?

In 2021, researchers at Google Brain published "Calibrate Before Use: Improving Few-Shot Performance of Language Models." They found that label bias in the examples themselves β€” such as always listing "positive" examples last β€” could shift the model's output distribution significantly. The order, label balance, and even surface formatting of examples affected accuracy, sometimes by 20+ percentage points on classification benchmarks. Few-shot prompting was powerful but fragile in ways that weren't obvious until studied carefully.

Five Dimensions of Example Quality

Representativeness. Each example should cover a case type the model will actually encounter. If your real inputs are long paragraphs but your examples are two-word snippets, the model will calibrate to the wrong input distribution.

Label Balance. For classification tasks, include examples from each class. A prompt with 3 "positive" examples and 1 "negative" example will bias the model toward positive, as confirmed by Zhao et al. (2021) in the Calibrate Before Use paper.

Format Consistency. Every example must use the exact same format. If one example uses "Answer: Yes" and another uses "Yes.", the model will oscillate between formats on new inputs.

Correct Labels. This sounds obvious, but matters deeply: a 2022 study by Min et al. at the University of Washington ("Rethinking the Role of Demonstrations") found that even randomly-labeled examples improved performance over zero-shot β€” but correctly-labeled examples still outperformed random labels by a meaningful margin on tasks requiring precise classification.

Coverage of Edge Cases. If there's a common boundary condition in your task β€” e.g., a support ticket that is both billing AND technical β€” include an example showing how you want that handled, even if it appears rarely.

The Min et al. Finding: Format Matters More Than Labels?

The 2022 University of Washington paper "Rethinking the Role of Demonstrations for In-Context Learning" produced a striking result: when researchers replaced correct labels with random labels in few-shot examples, GPT-3's performance dropped, but only modestly β€” often staying within 5–10% of full-accuracy performance on sentiment and topic classification.

The interpretation: models are partly learning format and input distribution from examples, not just label semantics. The input-output structure tells the model what kind of answer is expected. The actual label correctness adds additional signal on top of that structural information.

For developers, this has a practical implication: format consistency is not optional. A prompt where examples have mismatched whitespace, punctuation, or field ordering will harm performance even if all labels are correct.

Practical Implication

Before blaming example count for poor few-shot performance, audit format first. Check that every example has identical delimiters, identical field order, and identical label style. In many real-world debugging cases, a format inconsistency is the culprit β€” not insufficient examples.

Bad vs. Good: A Side-by-Side

Problematic Examples

Input: "great product" β†’ Positive
Input: "okay" β†’ neutral
Input: "I loved it so much!" β†’ positive sentiment
Input: "Terrible." β†’ NEGATIVE

Well-Formed Examples

Input: "great product" β†’ POSITIVE
Input: "okay, I guess" β†’ NEUTRAL
Input: "Terrible experience." β†’ NEGATIVE
Input: "Works as described." β†’ NEUTRAL

The problematic set has four different label formats (Positive, neutral, positive sentiment, NEGATIVE), unbalanced classes (3 positive, 1 negative, 0 neutral), and inconsistent input lengths. The well-formed set uses all-caps labels consistently, covers all three classes, and shows realistic input variation.

Token-Efficiency Note

Each example adds tokens. On GPT-4o at the 2024 pricing tier, a prompt with 6 lengthy examples could cost 3–5Γ— more per call than a 2-example prompt. Measure accuracy vs. example count empirically β€” for most production classification tasks, 2–4 well-chosen examples hit the accuracy sweet spot at a fraction of the token cost of 8+ examples.

Lesson 2 Quiz

Anatomy of a Good Example β€” 3 questions
Zhao et al. (2021, "Calibrate Before Use") found that label distribution in few-shot examples affects output. What specific artifact did they identify?
Correct. Zhao et al. showed that label position and frequency distribution in examples introduced measurable bias β€” a prompt with three "positive" examples and one "negative" would over-predict positive.
Zhao et al. found the opposite: label imbalance in examples created predictable output bias, shifting accuracy by up to 20+ percentage points toward the over-represented class.
Min et al. (2022, "Rethinking the Role of Demonstrations") found that replacing correct labels with random labels had what effect on GPT-3's few-shot performance?
Correct. The finding implied that models learn task format and input distribution from examples, not only label semantics β€” making format consistency critically important.
Min et al. found a modest drop (not zero, not zero effect) when labels were randomized, suggesting format matters as much as or more than correct labeling in many cases.
You're debugging a few-shot prompt that produces inconsistent output. Before increasing the number of examples, what should you audit first?
Correct. Format inconsistency is a common culprit for inconsistent outputs, and fixing it often resolves the issue without requiring more examples or higher cost.
While those factors can matter, format consistency is the first thing to audit in few-shot prompts. Mismatched delimiters, label casing, or field order often cause inconsistent outputs.

Lab 2 β€” Diagnosing Example Quality

Practice identifying and fixing format, balance, and coverage problems in few-shot prompts

Your Task

This lab focuses on critiquing and improving existing few-shot examples. Try one or more of these challenges with the assistant:

β‘  Paste a few-shot prompt you've written (or invent one) and ask the assistant to audit it for format consistency, label balance, and coverage gaps. β‘‘ Ask: "What's wrong with these examples?" and paste a deliberately flawed set. β‘’ Ask for a revised version of your examples with all quality issues fixed.
Example Quality Lab L2
Welcome to Lab 2. I'll help you audit and improve few-shot examples. Paste a prompt with examples β€” real or invented β€” and I'll check format consistency, label balance, coverage, and example representativeness. What would you like me to review?
Module 3 Β· Lesson 3

Few-Shot for Structured Output

Using examples to reliably generate JSON, tables, code, and other schema-constrained formats at production scale.
When your application depends on parseable output, how do you use examples to eliminate format errors without fine-tuning?

In 2023, Stripe's developer blog described using few-shot prompting to power a natural-language-to-API-query interface for their dashboard. The core challenge: getting GPT-4 to reliably return valid Stripe API filter objects as JSON, without wrapping them in prose or markdown code fences. Their solution was a five-example prompt where each example showed a natural-language question followed by a bare JSON object β€” no explanation, no code fences, just the object. Parse error rates in production dropped from roughly 18% (zero-shot) to under 2% (five-shot). No fine-tuning was required.

The Structured Output Problem

Language models are trained to produce natural text. When you ask them to produce JSON, they often add preambles ("Here is the JSON:"), use inconsistent quoting, add trailing commas, or wrap output in markdown fences. For an application that calls JSON.parse() on the response, any of these artifacts causes a runtime error.

Few-shot examples are the most direct solution: show the model exactly what valid output looks like β€” no prose, no fences, no explanation β€” and it will match that pattern.

A JSON Extraction Example
Example 1
User
Extract name and email from: "Hi, I'm Sarah Chen, reach me at sarah@example.com"
Assistant
{"name":"Sarah Chen","email":"sarah@example.com"}
Example 2
User
Extract name and email from: "Contact Tom Rivera at tomr@corp.io for details."
Assistant
{"name":"Tom Rivera","email":"tomr@corp.io"}
Example 3 (Edge Case)
User
Extract name and email from: "No contact info provided in this message."
Assistant
{"name":null,"email":null}

Note what the examples accomplish: they establish JSON without code fences, double-quoted keys, and the null-handling pattern for missing data. Three examples cover the happy path and the edge case, making the schema unambiguous.

Combining Few-Shot with System Prompts

Most production APIs support a system message distinct from the conversation. A powerful pattern is to put your schema description and instructions in the system message, and put your few-shot examples in the first few user/assistant turns. This separates concerns cleanly: instructions explain intent, examples demonstrate format.

OpenAI's API documentation (2024) explicitly recommends this pattern for JSON-mode tasks, noting that few-shot examples in the conversation history significantly improve schema adherence even when JSON mode is enabled β€” because examples reinforce not just that JSON is required, but which fields, types, and null conventions to use.

When to Use JSON Mode + Few-Shot Together

JSON mode (available in OpenAI's API via response_format: {type: "json_object"}) guarantees parseable JSON but does not guarantee your specific schema. Few-shot examples fill that gap. Use both: JSON mode as a safety net, examples to pin the schema.

Few-Shot for Code Transformation

The same principle applies to code tasks. If you want to transform JavaScript functions into TypeScript equivalents with a specific annotation style, write two or three examples showing the exact input and output. The model learns your team's type annotation conventions, naming patterns, and comment style from examples far more reliably than from written style guides.

GitHub's Copilot team has noted in public talks that the most effective internal use of few-shot is for code refactoring tasks with idiosyncratic conventions β€” cases where the desired output is correct by company-specific standards that aren't in the model's training data.

Schema Versioning Tip

When your output schema changes in production, update your examples first β€” before updating your schema description. The examples are what the model pattern-matches against most strongly. An inconsistency between your prose description and your examples will usually resolve in favor of the examples.

Lesson 3 Quiz

Few-Shot for Structured Output β€” 3 questions
In the Stripe dashboard natural-language-to-API-query use case (2023), what did five-shot prompting achieve compared to zero-shot?
Correct. The Stripe case demonstrated that a small number of well-formed examples dramatically reduced format errors in production β€” a real-world validation of the technique's reliability.
The reported outcome was a dramatic reduction in parse error rates (18% β†’ under 2%), achieved without fine-tuning or changes to model architecture.
When using OpenAI's JSON mode (response_format: {type: "json_object"}), why should you still include few-shot examples?
Correct. JSON mode is a safety net for parseability; few-shot examples are what enforce your specific schema β€” field names, types, null handling, and ordering.
JSON mode guarantees valid JSON but leaves the schema unconstrained. Examples are what pin your specific field names, types, and null conventions.
When your output schema changes in production, what should you update first to avoid model confusion?
Correct. Examples carry more behavioral weight than prose descriptions. An out-of-date example will override a correctly updated system message instruction.
Because examples carry stronger behavioral signal than prose, you should update examples first. A stale example will override even a correctly updated system message.

Lab 3 β€” Few-Shot for JSON & Structured Output

Build and test few-shot prompts that reliably produce schema-constrained output

Your Task

Practice building few-shot prompts for structured output tasks. Work through one or more of these with the assistant:

β‘  Ask the assistant to help you write a 3-example few-shot prompt that extracts structured data (e.g., product info, event details) as JSON. β‘‘ Ask how to handle null/missing fields in examples. β‘’ Share a JSON extraction prompt and ask what schema ambiguities the examples leave unresolved.
Structured Output Lab L3
Welcome to Lab 3. I specialize in helping you build few-shot prompts for JSON and other structured output formats. Tell me what data you want to extract or transform, and I'll help you design examples that pin your exact schema β€” including edge cases like missing fields and nested objects. What's your use case?
Module 3 Β· Lesson 4

Advanced Few-Shot Strategies

Dynamic example selection, chain-of-thought examples, and measuring few-shot effectiveness in production.
Once basic few-shot works, how do you scale it to handle diverse inputs and measure whether it's actually helping?

In 2022, Wei et al. at Google introduced chain-of-thought (CoT) prompting in the paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." The key contribution was showing that few-shot examples containing step-by-step reasoning traces β€” not just input-output pairs β€” dramatically improved performance on multi-step math and logic tasks. On the GSM8K grade-school math benchmark, CoT few-shot prompting improved PaLM 540B accuracy from 17.9% to 56.9%. The example itself was doing the cognitive work of teaching the model to reason, not just format.

Chain-of-Thought Few-Shot Examples

Standard few-shot: Input β†’ Output. Chain-of-thought few-shot: Input β†’ Reasoning Steps β†’ Output. By including the intermediate reasoning in your examples, you teach the model to reason through new problems the same way, rather than jumping to an answer.

Chain-of-Thought Example
User
A server handles 120 requests/minute. If traffic grows 40% on weekends, how many requests per minute on Saturday?
Assistant
Base rate: 120 req/min. Weekend increase: 40% Γ— 120 = 48 additional req/min. Saturday total: 120 + 48 = 168 req/min.

The reasoning trace in the example teaches the model to show its work β€” not because you instructed it to, but because it pattern-matched your example. This is more reliable than writing "show your reasoning" in a system message, especially for models that tend to skip steps.

Dynamic Example Selection

Static few-shot prompts use the same examples for every query. For diverse inputs, a better approach is dynamic retrieval: maintain a library of labeled examples and retrieve the most similar ones to the current input at runtime using embedding-based search.

This was formalized in the KATE paper (Liu et al., 2022 β€” "What Makes Good In-Context Examples for GPT-3?"). The finding: examples retrieved by semantic similarity to the query outperformed random static examples by 6–10% on several NLP benchmarks. The intuition: a model helps more when the examples it sees are structurally similar to the problem at hand.

Dynamic Selection in Practice

A practical implementation: store 50–100 curated (input, output) pairs in a vector database. At inference time, embed the incoming query, retrieve the 3 most similar examples, inject them into the prompt. Libraries like LangChain and LlamaIndex have built-in example selectors that implement this pattern with minimal boilerplate.

Measuring Few-Shot Effectiveness

Few-shot prompting should be treated as an engineering artifact subject to measurement, not a qualitative judgment. A practical evaluation loop:

Build an evaluation set. Collect 50–200 real inputs with known correct outputs. This is your ground truth. Even a small eval set exposes systematic failures invisible in spot-checking.

Baseline with zero-shot. Run your eval set with no examples. Record accuracy or your chosen metric. This establishes what the model knows without examples.

Sweep example counts. Run eval with 1, 2, 4, 8 examples. Plot accuracy vs. example count. The curve usually flattens β€” find the knee point where adding examples no longer meaningfully improves accuracy.

Test example selection strategies. Compare random static examples vs. curated static examples vs. dynamically retrieved examples. For high-variance input distributions, dynamic retrieval almost always wins.

Track cost-per-correct-output. More examples improve accuracy but increase token cost per call. The optimal example count is where marginal accuracy gain per additional example exceeds marginal cost.

Many-Shot Frontier (2024)

Google DeepMind's 2024 paper "Many-Shot In-Context Learning" extended few-shot research to hundreds of examples enabled by long-context models (Gemini 1.5 Pro with 1M token context). For complex tasks β€” medical diagnosis, legal reasoning β€” performance continued improving past 100 examples with no saturation plateau. For most production use cases, 2–8 examples remain optimal on cost grounds, but the research suggests ceiling is much higher when task complexity justifies it.

Self-Consistency: Few-Shot at Scale

Wang et al. (2022, Google Brain) introduced self-consistency as an extension of chain-of-thought few-shot: sample multiple reasoning paths for the same question, then take the majority-vote answer. On the MATH benchmark, self-consistency + CoT few-shot improved accuracy by 17.9 percentage points over single-path CoT. The technique is particularly effective for any task with a verifiable correct answer β€” math, code, factual lookup β€” where running the model 5–10 times and voting is cheaper than the cost of errors.

Lesson 4 Quiz

Advanced Few-Shot Strategies β€” 3 questions
Wei et al. (2022) introduced chain-of-thought few-shot prompting. What was the reported impact on PaLM 540B accuracy on the GSM8K math benchmark?
Correct. The jump from 17.9% to 56.9% was one of the most striking results in the prompting literature at the time, demonstrating the power of reasoning traces in examples.
The documented result was a dramatic improvement from 17.9% to 56.9% on GSM8K β€” achieved purely by including step-by-step reasoning in the few-shot examples.
The KATE paper (Liu et al., 2022) studied dynamic example selection. What was the key finding about retrieval-based vs. random static examples?
Correct. The KATE paper showed that retrieval-based example selection β€” finding examples structurally similar to the current query β€” consistently outperformed random selection.
KATE (2022) found that semantically similar examples retrieved at inference time outperformed random static examples by 6–10%, particularly for diverse input distributions.
What is the self-consistency technique (Wang et al., 2022) and when is it most effective?
Correct. Self-consistency runs the model multiple times and votes on the most common answer β€” particularly powerful for math, code, and factual tasks where correctness is verifiable and errors are costly.
Self-consistency means sampling multiple reasoning paths and voting on the majority answer β€” most effective for tasks with verifiable correct answers where the cost of a single wrong answer exceeds the cost of multiple API calls.

Lab 4 β€” Chain-of-Thought & Advanced Few-Shot

Practice chain-of-thought examples, dynamic selection strategy, and evaluation design

Your Task

Explore advanced few-shot techniques with the assistant. Try one or more challenges:

β‘  Ask the assistant to help you write a chain-of-thought few-shot prompt for a multi-step reasoning task (billing calculation, logic puzzle, code debugging). β‘‘ Ask how you'd design a 50-query evaluation set to measure whether your few-shot prompt is working. β‘’ Ask when dynamic example retrieval is worth the engineering overhead vs. static curated examples.
Advanced Few-Shot Lab L4
Welcome to Lab 4. I can help you with chain-of-thought example construction, evaluation set design, dynamic retrieval strategy, or self-consistency implementation. What advanced few-shot challenge would you like to tackle? Describe your task or ask about any of the techniques from the lesson.

Module 3 Test

Few-Shot Examples β€” 15 questions. Score 80% or above to pass.
1. What technical term describes the technique of placing a small number of input-output examples directly in a prompt, with no model weight updates?
Correct. Few-shot in-context learning β€” placing examples in the context window without updating weights β€” is the defining technique.
The technique is called few-shot in-context learning. Fine-tuning involves updating model weights; this technique does not.
2. The GPT-3 paper was published in which year, and what was its subtitle?
Correct. Brown et al., 2020: "Language Models are Few-Shot Learners." The title was itself the central finding.
The correct answer is 2020 β€” "Language Models are Few-Shot Learners" by Brown et al. at OpenAI.
3. Which of the following is the most critical quality dimension to audit when few-shot output is inconsistent, before adding more examples?
Correct. Format inconsistency is the most common cause of inconsistent few-shot outputs and should be audited before any other intervention.
Format consistency is the primary thing to audit. Mismatched delimiters, casing, or field order across examples is the most common source of inconsistent outputs.
4. Zhao et al. (2021, "Calibrate Before Use") found that what aspect of few-shot examples introduces measurable output bias?
Correct. Zhao et al. showed label imbalance in examples shifted output distribution by up to 20+ percentage points toward the over-represented class.
Zhao et al. specifically identified label distribution imbalance as introducing measurable bias β€” not word count or jargon.
5. Min et al. (2022) replaced correct labels with random labels in few-shot examples. What happened to performance?
Correct. The modest drop highlighted that models learn task structure and format from examples, not only label semantics β€” a finding with major practical implications for format hygiene.
Min et al. found a modest drop β€” not collapse β€” when labels were randomized. This showed that format/distribution signal is substantial even without correct labels.
6. For a production classification task, what is the typical accuracy-cost optimal range of few-shot examples for most use cases?
Correct. For most production classification tasks, 2–4 well-crafted examples hit the sweet spot between accuracy improvement and additional token cost per call.
For most production tasks, 2–4 examples is the practical optimum β€” enough to convey format, balance, and edge cases without excessive token cost.
7. The Stripe dashboard use case (2023) used few-shot prompting to achieve what specific outcome in production?
Correct. Five carefully constructed examples reduced parse errors from ~18% to under 2% β€” a real production validation of few-shot for structured output.
The Stripe case showed few-shot examples reducing JSON parse error rates from ~18% to under 2% in production, without fine-tuning.
8. Chain-of-thought (CoT) few-shot differs from standard few-shot in what key way?
Correct. The reasoning trace is what distinguishes CoT β€” it teaches the model a reasoning process, not just an answer format, enabling generalization to novel problems.
CoT adds intermediate reasoning steps between input and output in each example. The trace teaches the model how to reason, not just what format to use.
9. Wei et al. (2022) demonstrated CoT few-shot on PaLM 540B for the GSM8K math benchmark. What accuracy improvement was reported?
Correct. 17.9% β†’ 56.9% on GSM8K was one of the landmark results showing how much reasoning traces in examples can improve multi-step task performance.
Wei et al. reported a jump from 17.9% to 56.9% β€” more than tripling accuracy by including step-by-step reasoning in the few-shot examples.
10. The KATE paper (Liu et al., 2022) tested dynamic example selection. What method outperformed random static examples?
Correct. Semantic similarity retrieval at inference time consistently outperformed random static selection, with gains of 6–10% across several benchmarks.
KATE showed that embedding-based semantic retrieval β€” finding examples closest to the current query β€” outperformed random selection by 6–10%.
11. OpenAI's JSON mode (response_format: {type: "json_object"}) guarantees what, and what does it NOT guarantee?
Correct. JSON mode is a syntax safety net only. Schema specifics β€” field names, types, null conventions β€” require few-shot examples to enforce reliably.
JSON mode guarantees valid, parseable JSON β€” but not your specific schema. Few-shot examples are what enforce field names, types, and null handling.
12. Self-consistency (Wang et al., 2022) extends chain-of-thought by doing what?
Correct. Self-consistency samples N reasoning paths and votes β€” most effective for math, code, and factual tasks where correctness is verifiable and errors are costly.
Self-consistency means sampling multiple completions for the same prompt and taking the majority-vote answer β€” particularly useful when a single wrong answer is costly.
13. When your output schema changes in production, what should be updated FIRST to avoid the model using the old format?
Correct. Examples are the dominant behavioral signal. A stale example overrides correct prose instructions β€” always update examples first when the schema changes.
Examples must be updated first. Because they carry stronger behavioral signal than prose, a stale example will override a correctly updated system message instruction.
14. Google DeepMind's 2024 "Many-Shot In-Context Learning" paper found what about performance as example count grew into the hundreds?
Correct. Many-shot with long-context models showed continued gains on complex tasks past 100 examples β€” expanding the ceiling well beyond traditional 8–10 example guidance.
The DeepMind paper found continued performance gains past 100 examples for complex tasks enabled by long-context windows β€” no plateau was observed for tasks like medical diagnosis.
15. For a production few-shot prompt serving diverse input types, which strategy is most likely to improve accuracy over a static curated example set?
Correct. Dynamic retrieval ensures the model sees examples structurally closest to the current query, outperforming static selection when inputs are diverse β€” as validated by the KATE paper.
Dynamic retrieval (embedding the query and fetching semantically similar examples) consistently outperforms static example sets when input distribution is diverse β€” the key finding of the KATE paper.