In June 2020, OpenAI published the GPT-3 paper, "Language Models are Few-Shot Learners." The title was the finding. Researchers discovered that a large language model shown just a handful of input-output pairs in its context window could perform tasks it had never been explicitly fine-tuned for β translation, arithmetic reasoning, even novel analogical reasoning β at accuracy levels that rivaled purpose-trained models. The technique required no gradient updates, no retraining, no labeled dataset of thousands. Just examples, placed directly in the prompt.
Few-shot prompting means prepending one or more worked examples to your actual request. Each example is an input-output pair that demonstrates exactly what you want. The model reads your examples, infers the pattern, and applies it to the new input you provide.
This is distinct from zero-shot prompting (no examples, just instructions) and fine-tuning (updating model weights with thousands of examples). Few-shot sits in the middle: no weight changes, but far more guidance than zero-shot alone.
Suppose you want a model to classify customer support tickets as BILLING, TECHNICAL, or ACCOUNT. Zero-shot:
Classify the following ticket into one of: BILLING, TECHNICAL, ACCOUNT.
Ticket: "I can't log in after changing my email."
Ticket: "My invoice shows twice the amount." β BILLING
Ticket: "The app crashes on startup." β TECHNICAL
Ticket: "I can't log in after changing my email." β ?
The few-shot version shows the model exactly what a "BILLING" response looks like, what a "TECHNICAL" one looks like, and β crucially β that the answer is a single label with no additional text. The model learns format, vocabulary, and brevity all at once from those two lines.
On the SuperGLUE benchmark, GPT-3 with few-shot prompting matched or exceeded fine-tuned BERT-large on several tasks β despite having seen only a handful of in-context examples rather than thousands of labeled training samples. The researchers noted that performance scaled with the number of examples, with diminishing returns past roughly 8β10 examples for most tasks.
Instructions are inherently ambiguous. "Write a brief summary" leaves open: How brief? What tense? Does it include the conclusion? One sentence or three? Examples answer all of these implicitly β the model sees the exact output schema and matches it.
This is why, in practice, teams at companies like Anthropic and OpenAI recommend leading with examples whenever you have a fixed output format. The 2023 Anthropic prompt engineering guide states: "If you want Claude to output JSON with a specific schema, showing an example output is faster and more reliable than describing every field in prose."
When your task has a consistent format β structured data extraction, classification, code transformation, templated writing β start with 2β3 examples. When your task is open-ended reasoning or creative work, instructions often serve better, since rigid examples can over-constrain the model's approach.
In this lab you'll practice writing few-shot prompts. Ask the assistant to help you build a few-shot prompt for a classification or data-extraction task of your choice. Try at least one of these challenges:
In 2021, researchers at Google Brain published "Calibrate Before Use: Improving Few-Shot Performance of Language Models." They found that label bias in the examples themselves β such as always listing "positive" examples last β could shift the model's output distribution significantly. The order, label balance, and even surface formatting of examples affected accuracy, sometimes by 20+ percentage points on classification benchmarks. Few-shot prompting was powerful but fragile in ways that weren't obvious until studied carefully.
Representativeness. Each example should cover a case type the model will actually encounter. If your real inputs are long paragraphs but your examples are two-word snippets, the model will calibrate to the wrong input distribution.
Label Balance. For classification tasks, include examples from each class. A prompt with 3 "positive" examples and 1 "negative" example will bias the model toward positive, as confirmed by Zhao et al. (2021) in the Calibrate Before Use paper.
Format Consistency. Every example must use the exact same format. If one example uses "Answer: Yes" and another uses "Yes.", the model will oscillate between formats on new inputs.
Correct Labels. This sounds obvious, but matters deeply: a 2022 study by Min et al. at the University of Washington ("Rethinking the Role of Demonstrations") found that even randomly-labeled examples improved performance over zero-shot β but correctly-labeled examples still outperformed random labels by a meaningful margin on tasks requiring precise classification.
Coverage of Edge Cases. If there's a common boundary condition in your task β e.g., a support ticket that is both billing AND technical β include an example showing how you want that handled, even if it appears rarely.
The 2022 University of Washington paper "Rethinking the Role of Demonstrations for In-Context Learning" produced a striking result: when researchers replaced correct labels with random labels in few-shot examples, GPT-3's performance dropped, but only modestly β often staying within 5β10% of full-accuracy performance on sentiment and topic classification.
The interpretation: models are partly learning format and input distribution from examples, not just label semantics. The input-output structure tells the model what kind of answer is expected. The actual label correctness adds additional signal on top of that structural information.
For developers, this has a practical implication: format consistency is not optional. A prompt where examples have mismatched whitespace, punctuation, or field ordering will harm performance even if all labels are correct.
Before blaming example count for poor few-shot performance, audit format first. Check that every example has identical delimiters, identical field order, and identical label style. In many real-world debugging cases, a format inconsistency is the culprit β not insufficient examples.
Input: "great product" β Positive
Input: "okay" β neutral
Input: "I loved it so much!" β positive sentiment
Input: "Terrible." β NEGATIVE
Input: "great product" β POSITIVE
Input: "okay, I guess" β NEUTRAL
Input: "Terrible experience." β NEGATIVE
Input: "Works as described." β NEUTRAL
The problematic set has four different label formats (Positive, neutral, positive sentiment, NEGATIVE), unbalanced classes (3 positive, 1 negative, 0 neutral), and inconsistent input lengths. The well-formed set uses all-caps labels consistently, covers all three classes, and shows realistic input variation.
Each example adds tokens. On GPT-4o at the 2024 pricing tier, a prompt with 6 lengthy examples could cost 3β5Γ more per call than a 2-example prompt. Measure accuracy vs. example count empirically β for most production classification tasks, 2β4 well-chosen examples hit the accuracy sweet spot at a fraction of the token cost of 8+ examples.
This lab focuses on critiquing and improving existing few-shot examples. Try one or more of these challenges with the assistant:
In 2023, Stripe's developer blog described using few-shot prompting to power a natural-language-to-API-query interface for their dashboard. The core challenge: getting GPT-4 to reliably return valid Stripe API filter objects as JSON, without wrapping them in prose or markdown code fences. Their solution was a five-example prompt where each example showed a natural-language question followed by a bare JSON object β no explanation, no code fences, just the object. Parse error rates in production dropped from roughly 18% (zero-shot) to under 2% (five-shot). No fine-tuning was required.
Language models are trained to produce natural text. When you ask them to produce JSON, they often add preambles ("Here is the JSON:"), use inconsistent quoting, add trailing commas, or wrap output in markdown fences. For an application that calls JSON.parse() on the response, any of these artifacts causes a runtime error.
Few-shot examples are the most direct solution: show the model exactly what valid output looks like β no prose, no fences, no explanation β and it will match that pattern.
Note what the examples accomplish: they establish JSON without code fences, double-quoted keys, and the null-handling pattern for missing data. Three examples cover the happy path and the edge case, making the schema unambiguous.
Most production APIs support a system message distinct from the conversation. A powerful pattern is to put your schema description and instructions in the system message, and put your few-shot examples in the first few user/assistant turns. This separates concerns cleanly: instructions explain intent, examples demonstrate format.
OpenAI's API documentation (2024) explicitly recommends this pattern for JSON-mode tasks, noting that few-shot examples in the conversation history significantly improve schema adherence even when JSON mode is enabled β because examples reinforce not just that JSON is required, but which fields, types, and null conventions to use.
JSON mode (available in OpenAI's API via response_format: {type: "json_object"}) guarantees parseable JSON but does not guarantee your specific schema. Few-shot examples fill that gap. Use both: JSON mode as a safety net, examples to pin the schema.
The same principle applies to code tasks. If you want to transform JavaScript functions into TypeScript equivalents with a specific annotation style, write two or three examples showing the exact input and output. The model learns your team's type annotation conventions, naming patterns, and comment style from examples far more reliably than from written style guides.
GitHub's Copilot team has noted in public talks that the most effective internal use of few-shot is for code refactoring tasks with idiosyncratic conventions β cases where the desired output is correct by company-specific standards that aren't in the model's training data.
When your output schema changes in production, update your examples first β before updating your schema description. The examples are what the model pattern-matches against most strongly. An inconsistency between your prose description and your examples will usually resolve in favor of the examples.
response_format: {type: "json_object"}), why should you still include few-shot examples?Practice building few-shot prompts for structured output tasks. Work through one or more of these with the assistant:
In 2022, Wei et al. at Google introduced chain-of-thought (CoT) prompting in the paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." The key contribution was showing that few-shot examples containing step-by-step reasoning traces β not just input-output pairs β dramatically improved performance on multi-step math and logic tasks. On the GSM8K grade-school math benchmark, CoT few-shot prompting improved PaLM 540B accuracy from 17.9% to 56.9%. The example itself was doing the cognitive work of teaching the model to reason, not just format.
Standard few-shot: Input β Output. Chain-of-thought few-shot: Input β Reasoning Steps β Output. By including the intermediate reasoning in your examples, you teach the model to reason through new problems the same way, rather than jumping to an answer.
The reasoning trace in the example teaches the model to show its work β not because you instructed it to, but because it pattern-matched your example. This is more reliable than writing "show your reasoning" in a system message, especially for models that tend to skip steps.
Static few-shot prompts use the same examples for every query. For diverse inputs, a better approach is dynamic retrieval: maintain a library of labeled examples and retrieve the most similar ones to the current input at runtime using embedding-based search.
This was formalized in the KATE paper (Liu et al., 2022 β "What Makes Good In-Context Examples for GPT-3?"). The finding: examples retrieved by semantic similarity to the query outperformed random static examples by 6β10% on several NLP benchmarks. The intuition: a model helps more when the examples it sees are structurally similar to the problem at hand.
A practical implementation: store 50β100 curated (input, output) pairs in a vector database. At inference time, embed the incoming query, retrieve the 3 most similar examples, inject them into the prompt. Libraries like LangChain and LlamaIndex have built-in example selectors that implement this pattern with minimal boilerplate.
Few-shot prompting should be treated as an engineering artifact subject to measurement, not a qualitative judgment. A practical evaluation loop:
Build an evaluation set. Collect 50β200 real inputs with known correct outputs. This is your ground truth. Even a small eval set exposes systematic failures invisible in spot-checking.
Baseline with zero-shot. Run your eval set with no examples. Record accuracy or your chosen metric. This establishes what the model knows without examples.
Sweep example counts. Run eval with 1, 2, 4, 8 examples. Plot accuracy vs. example count. The curve usually flattens β find the knee point where adding examples no longer meaningfully improves accuracy.
Test example selection strategies. Compare random static examples vs. curated static examples vs. dynamically retrieved examples. For high-variance input distributions, dynamic retrieval almost always wins.
Track cost-per-correct-output. More examples improve accuracy but increase token cost per call. The optimal example count is where marginal accuracy gain per additional example exceeds marginal cost.
Google DeepMind's 2024 paper "Many-Shot In-Context Learning" extended few-shot research to hundreds of examples enabled by long-context models (Gemini 1.5 Pro with 1M token context). For complex tasks β medical diagnosis, legal reasoning β performance continued improving past 100 examples with no saturation plateau. For most production use cases, 2β8 examples remain optimal on cost grounds, but the research suggests ceiling is much higher when task complexity justifies it.
Wang et al. (2022, Google Brain) introduced self-consistency as an extension of chain-of-thought few-shot: sample multiple reasoning paths for the same question, then take the majority-vote answer. On the MATH benchmark, self-consistency + CoT few-shot improved accuracy by 17.9 percentage points over single-path CoT. The technique is particularly effective for any task with a verifiable correct answer β math, code, factual lookup β where running the model 5β10 times and voting is cheaper than the cost of errors.
Explore advanced few-shot techniques with the assistant. Try one or more challenges:
response_format: {type: "json_object"}) guarantees what, and what does it NOT guarantee?