In January 2022, researchers at Google Brain published a paper that quietly rewrote the rules of prompting. Jason Wei and colleagues discovered that large language models could solve multi-step arithmetic and commonsense reasoning problems far more reliably if the prompt included example reasoning steps — not just example answers. They called the technique chain-of-thought prompting. Their experiments on PaLM 540B showed accuracy on the GSM8K math benchmark jumping from roughly 17% with standard few-shot prompting to 58% with chain-of-thought examples. The model wasn't getting smarter — it was being shown how to think.
Standard prompting asks a model to produce an answer. Chain-of-thought (CoT) prompting asks a model to produce an intermediate reasoning process that leads to the answer. This seemingly small shift has outsized consequences: by externalizing reasoning steps into the token stream, the model can reference earlier conclusions when generating later ones — a capability that's mechanically impossible when jumping straight to an answer.
The fundamental insight is that language models generate text left-to-right, one token at a time. Each token is conditioned on all prior tokens. When you force intermediate reasoning into that token stream, the model's attention mechanism can "look back" at the logical steps it already wrote. The scratch-pad is the computation.
Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many balls does he have?
A: 11
Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many balls does he have?
A: Roger starts with 5 balls. 2 cans × 3 balls = 6 balls. 5 + 6 = 11 balls.
The original Wei et al. paper used few-shot CoT: you provide several worked examples in the prompt, each showing a question followed by a step-by-step reasoning chain and final answer. The model pattern-matches to this format for new questions.
In May 2022, a separate team — Kojima et al. at the University of Tokyo — published a striking finding: you don't always need worked examples. Appending the phrase "Let's think step by step." to a question, with zero prior examples, was enough to elicit coherent reasoning chains in large models. This "zero-shot CoT" approach proved especially powerful because it requires no task-specific example engineering.
Kojima et al. (2022) found that adding "Let's think step by step" to prompts on the MultiArith benchmark improved zero-shot accuracy from 17.7% to 78.7% on GPT-3 — a 4.4× improvement with a four-word addition.
Chain-of-thought prompting reliably helps with tasks that require multi-step reasoning: arithmetic word problems, logical deduction, symbolic manipulation, multi-hop question answering, and code debugging. The gains are largest on harder problems where direct lookup fails.
CoT provides little benefit — or can even hurt — on tasks that are essentially single-step retrieval: factual lookups, simple classification, or tasks where the answer is obvious from the surface form. Forcing reasoning steps on a simple question just wastes tokens and can introduce unnecessary confusion.
Model size also matters significantly. Wei et al. demonstrated that CoT gains are largely an emergent property of scale: models below roughly 10 billion parameters often produce incoherent reasoning chains that don't improve — and sometimes worsen — accuracy.
In API contexts, CoT reasoning costs tokens — and therefore money and latency. The tradeoff is worthwhile for genuinely complex tasks. For high-volume, simple classification workloads, strip CoT from production prompts and use it only during evaluation and debugging.
In Lesson 2 we'll examine the specific prompt patterns that elicit the most reliable reasoning chains, and the structural choices that determine whether CoT reasoning is trustworthy or merely convincing-sounding.
You'll craft prompts that elicit chain-of-thought reasoning without providing worked examples. Try different trigger phrases and compare the quality of reasoning they produce. Discuss your findings with the AI assistant — aim for at least 3 exchanges.
When Anthropic's researchers were developing Constitutional AI in 2022, they faced a practical engineering challenge: they needed language models to reliably critique and revise their own outputs according to a set of principles. The critique step required coherent multi-step reasoning — the model had to identify a problem, reference the relevant principle, and then rewrite accordingly. Early experiments showed that even powerful models produced shallow, circular critiques when prompted naively. The breakthrough came from structuring prompts to force explicit intermediate conclusions — "First, identify the specific harm. Then, name the constitutional principle it violates. Then, write a revision." This structured decomposition became a load-bearing element of the full RLHF pipeline.
When constructing few-shot CoT prompts, the quality of your exemplar reasoning chains matters enormously. Wei et al. found that invalid reasoning chains — where the steps didn't logically connect — still sometimes improved accuracy, but reliable gains require coherent intermediate steps that actually support the conclusion.
A high-quality few-shot CoT exemplar has four properties:
The most reliable CoT prompt structure in production API usage follows this template:
A critical and underappreciated issue: CoT reasoning chains are not always faithful explanations of how the model reached its answer. Research by Turpin et al. (2023) — "Language Models Don't Always Say What They Think" — demonstrated that models will produce different reasoning chains for the same question when the answer is biased by sycophancy pressure, but the reasoning will post-hoc rationalize whatever answer the model was primed toward.
This means you cannot fully trust a reasoning chain as a ground-truth explanation of model behavior. CoT improves accuracy but doesn't guarantee mechanistic transparency. For safety-critical applications, you need additional verification — ideally checking the final numerical or logical result independently, not just the reasoning prose.
In agentic systems, parse and validate intermediate CoT outputs programmatically. Don't let the model's confidence in its own reasoning chain substitute for external verification of derived facts or calculations.
The "Let's think step by step" phrase has become so common that some researchers argue newer models have been fine-tuned to produce reasonable-sounding reasoning in response to it regardless of actual need. Alternative elicitation strategies include:
Lesson 3 takes CoT from single-path reasoning to multi-path approaches — including self-consistency and tree-of-thought, which systematically explore multiple reasoning trajectories to improve reliability further.
You'll construct a few-shot CoT prompt with carefully designed exemplars, then discuss what makes each reasoning step strong or weak. The assistant will help you evaluate sequential dependence, specificity, and conclusion clarity.
Chain-of-thought prompting was powerful, but it had a fragility problem: a single unlucky reasoning path — one wrong intermediate step — would cascade into a wrong answer. In March 2022, Xuezhi Wang and colleagues at Google Research proposed a simple but remarkably effective fix. Instead of generating one reasoning chain and taking its answer, generate many reasoning chains with temperature > 0, then take a majority vote on the final answers. They called this self-consistency. On the AQuA-RAT math dataset, self-consistency with 40 reasoning paths improved accuracy from 50.3% (standard CoT) to 74.4% — a 24-point jump without any new examples or model changes.
Self-consistency treats reasoning generation as a stochastic process and exploits the statistical fact that correct reasoning paths, while diverse in their expression, tend to converge on correct answers. Incorrect reasoning paths, being wrong for different idiosyncratic reasons, tend to diverge across answer space.
Self-consistency multiplies your API cost by N. Sampling 40 paths (as in Wang et al.'s best results) means 40× the token cost. In production, 5–10 paths often captures most of the benefit. Profile your accuracy/cost tradeoff empirically for your specific task before deploying.
Self-consistency samples independent reasoning paths in parallel. Tree of Thoughts — proposed by Yao et al. at Princeton and Google DeepMind in 2023 — takes a fundamentally different approach: it explores reasoning as a tree structure, allowing backtracking and deliberate search through the reasoning space.
In ToT, the model generates multiple "thought" continuations at each step, evaluates each continuation's promise (either self-evaluated or via a separate evaluator prompt), then selects the most promising branch to extend. This allows the model to abandon unproductive reasoning paths early rather than committing fully to each independent chain.
Tree of Thoughts works best on tasks where you can evaluate intermediate progress — game states, proof steps, code that can be partially executed, planning sequences where you can check constraint satisfaction. It struggles on tasks with no meaningful intermediate evaluation signal.
Yao et al.'s original paper showed ToT dramatically outperforming standard CoT and self-consistency on the Game of 24 (a combinatorial math puzzle) and crossword puzzles — tasks where backtracking is essential. On standard math word problems where self-consistency already works well, the additional complexity of ToT often isn't worth the cost.
Parallel independent paths · Majority voting · Best for: math, factual QA · Moderate cost · No backtracking
Sequential branching + pruning · Evaluator-guided search · Best for: planning, puzzles, multi-step code · High cost · Explicit backtracking
OpenAI's o1 and o3 model families use extended internal chain-of-thought reasoning (not exposed as raw tokens to users) as a core architectural feature — effectively a trained, internalized version of CoT with search. This demonstrates that CoT techniques have moved from prompting tricks to fundamental model architecture decisions.
Lesson 4 closes the module by addressing CoT in production systems: latency management, structured output extraction from reasoning chains, debugging when CoT reasoning goes wrong, and the emerging landscape of models with built-in reasoning capabilities.
You'll practice the self-consistency approach by examining how different reasoning paths can reach the same answer via different routes, and discuss the engineering tradeoffs of running multiple API calls vs. a single greedy decode.
When Microsoft launched Bing Chat powered by GPT-4 in February 2023, early users discovered that extended reasoning chains — elicited through aggressive CoT prompting — sometimes led the model into tangential loops that produced confident but wrong answers, or escalated emotionally over long conversations. Internal teams had to tune prompt structures to constrain reasoning depth and add extraction layers that pulled structured conclusions from reasoning prose before presenting output to users. The lesson the team documented publicly: reasoning chains are infrastructure, not user-facing output. Users see answers; the reasoning chain is a hidden intermediate computation layer that must be managed, monitored, and bounded.
Chain-of-thought reasoning generates substantially more tokens than direct answers. On complex problems, a CoT response might be 500–2000 tokens where a direct answer would be 10–50. At current API speeds, this adds 2–8 seconds of latency for typical hosted model APIs. Three strategies help:
The most common production pattern is to use CoT for reasoning but then extract structured data from the reasoning output before passing it downstream. This requires a clear delimiter contract in your prompt:
When a CoT-enabled system produces wrong answers, diagnosis follows a systematic process. The reasoning chain is your primary debugging artifact — read it carefully for these failure patterns:
As of 2024–2025, the prompting techniques covered in this module are partially being superseded by models with internalized reasoning. OpenAI's o1 (released September 2024) and o3, Google's Gemini 2.0 Flash Thinking, and Anthropic's Claude with extended thinking all perform extended chain-of-thought reasoning as part of their forward pass — not as explicit tokens in the context window.
For developers, this creates a decision tree: for tasks requiring transparent, inspectable reasoning steps (auditable decisions, regulated domains), explicit CoT prompting still provides control and traceability that black-box internal reasoning cannot. For maximum accuracy on hard reasoning tasks with no interpretability requirement, native reasoning models typically outperform prompted CoT on instruction-tuned models by large margins.
The CoT prompting skills in this module remain directly applicable to: smaller/cheaper models without native reasoning, fine-tuned domain-specific models, on-premise deployments where native reasoning models aren't available, and any context where you need to inspect and control intermediate reasoning steps.
Chain-of-thought is a spectrum from manual prompt engineering to baked-in model architecture. Understanding the mechanics makes you effective across the entire spectrum — whether you're coaxing reasoning from a small open-source model or deciding when to invoke a native reasoning model for your production workload.
You've now covered the full arc of CoT techniques. Take the Lab 4 session to practice production-style extraction and debugging, then attempt the Module Test to verify mastery across all four lessons.
You'll design a prompt with explicit delimiter tags for reasoning and answer sections, practice extracting structured outputs from CoT responses, and debug example reasoning chains that exhibit common failure modes.