In early 2022, Google Brain researcher Jason Wei and colleagues noticed something strange. When they added the phrase "Let's think step by step" to math word problems given to large language models, accuracy on multi-step arithmetic jumped dramatically — from below 20% on some benchmarks to above 50%. The model had not been retrained. Nothing had changed except the prompt.
Their paper, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," published in May 2022, would become one of the most cited AI papers of the decade. The insight was deceptively simple: intermediate reasoning steps, when made explicit, restructure how a model processes a problem.
A language model predicts the next token based on everything that came before it. When you ask a complex question directly — "What is 17% of 340?" — the model must leap from question to answer in a single bound. The probability distribution over answer tokens is shaped by training patterns, not by a deliberate computation the model just performed.
Chain-of-thought changes this. By prompting the model to generate intermediate steps — "First, 10% of 340 is 34. Then 7% of 340 is 23.8. Adding them: 57.8" — each reasoning step becomes context for the next. The model is, in a meaningful sense, using its own prior output as a scratchpad. The answer token now follows from a sequence of coherent reasoning tokens, not a cold jump from question to conclusion.
Researchers including Denny Zhou at Google DeepMind have described the mechanism as a computational scratchpad. The model's context window is finite but rich. By writing out steps, the model distributes the cognitive load of the problem across many tokens rather than compressing it into one. Each intermediate conclusion anchors the next inference.
This is not entirely unlike how humans benefit from writing out solutions to hard problems. The act of writing doesn't just communicate the answer — it constrains what you can say next, forcing logical coherence. The same constraint operates on a language model, though the underlying mechanism (statistical next-token prediction) differs fundamentally from human cognition.
Importantly, CoT doesn't help for simple factual recall. Asking "What is the capital of France?" gains nothing from chain-of-thought. The benefit appears specifically on tasks requiring multiple sequential reasoning steps — math, logic, commonsense inference, code debugging, multi-hop factual retrieval.
Chain-of-thought prompting only emerged as a useful capability in models above roughly 100 billion parameters. In smaller models, asking for step-by-step reasoning produced fluent but incorrect chains — confident nonsense. This "emergent" threshold was a major finding in itself, suggesting CoT ability scales with model size in a non-linear way.
Wei et al.'s chain-of-thought paper appeared in May 2022. Within months, it had influenced how OpenAI framed GPT-4's capabilities, how Anthropic designed Claude's default response style, and how Google deployed PaLM. The "Let's think step by step" phrase itself became so widely used that it was referenced in subsequent papers as a quasi-standard baseline — an almost accidental landmark in prompt engineering history.
You'll practice the core chain-of-thought technique: asking an AI to show its intermediate reasoning steps. Try giving a math or logic problem and compare the result with and without the "step by step" instruction. Aim for at least 3 exchanges.
Three months after Wei's paper, a team at the University of Tokyo and Google published a follow-up that surprised even the original authors. Takeshi Kojima and collaborators showed that you didn't need elaborate example chains at all. Adding the single phrase "Let's think step by step" after a question — with no examples whatsoever — produced large accuracy gains. They called it Zero-Shot-CoT, and it worked across arithmetic, symbolic reasoning, and commonsense tasks.
This was significant for a practical reason: constructing good few-shot examples requires human effort and expertise. Zero-shot CoT democratized the technique — anyone could use it with a three-word addition to their prompt.
In few-shot CoT, you provide the model with several complete examples: a question, a full step-by-step reasoning chain, and the correct answer. The model then applies the same reasoning format to a new question. This is the approach used in Wei et al.'s original paper, where the examples were carefully constructed by researchers.
The quality of examples matters enormously. In 2022, researchers at DeepMind (Shi et al.) showed that including even one misleading reasoning step in a few-shot example could dramatically degrade performance — the model would faithfully imitate the flawed reasoning pattern. This highlights a key risk: few-shot CoT transfers the format of reasoning, good or bad.
Requires expert-constructed examples. If examples contain errors, the model imitates them. Takes significant prompt space. Must be re-crafted for different domains.
Precisely calibrates the reasoning style. Works on domain-specific tasks where zero-shot fails. Lets you specify exactly what kind of reasoning you want — causal, mathematical, legal.
Kojima et al. identified a two-stage structure in effective zero-shot CoT. First prompt: "[Question] Let's think step by step." This generates a reasoning chain. Second prompt: "[Question] [Generated reasoning chain] Therefore, the answer is:" This extracts the final answer from the chain. The two-stage approach outperformed a single combined prompt.
Different trigger phrases have slightly different effects. Research by Zhou et al. (2022) at Google explored many variants — "Let's work this out step by step," "Think about this carefully," "Take a deep breath and work on this problem" (which, bizarrely, showed minor improvements for certain task types in some models). The evidence suggests models have learned to associate certain metacognitive phrasings with more careful, structured outputs.
For most day-to-day tasks, zero-shot CoT ("Let's think step by step") is the right starting point — fast, free, and surprisingly effective. Escalate to few-shot CoT when you need the model to follow a specific reasoning format (e.g., legal analysis, medical differential diagnosis, structured financial modeling) or when zero-shot is producing reasoning chains with systematic errors.
By late 2022, researchers including Zhuosheng Zhang et al. at Shanghai Jiao Tong University proposed Auto-CoT: using zero-shot CoT to automatically generate reasoning demonstrations, clustering questions by type, then using the best generated chains as few-shot examples. This automated the most labor-intensive part of few-shot CoT construction, combining the strengths of both approaches.
Auto-CoT appeared in modern AI systems as a behind-the-scenes technique — one reason that frontier models like GPT-4 and Claude often show structured reasoning without being explicitly prompted, because the fine-tuning process incorporated CoT-style outputs.
Practice the difference between few-shot and zero-shot CoT. First, try a problem with just "Let's think step by step." Then craft a few-shot example with your own reasoning chain and use it to solve a similar but harder problem. Aim for at least 3 exchanges.
In late 2022, Xuezhi Wang and colleagues at Google Brain published "Self-Consistency Improves Chain of Thought Reasoning in Language Models." The insight was elegant: instead of generating one reasoning chain, generate many — with temperature turned up so the model takes different paths — then take a majority vote on the final answers.
On the GSM8K math benchmark, self-consistency with 40 sampled paths raised accuracy from 56.5% (single CoT) to 74.4%. On MATH, a harder benchmark, improvements were similarly dramatic. The logic was simple: if multiple independent reasoning paths converge on the same answer, that answer is more likely to be correct — even if some individual chains contained errors.
The procedure has three steps. First, sample multiple diverse reasoning paths from the model using the same prompt but with non-zero temperature (so outputs vary). Second, extract the final answer from each path. Third, take the majority vote — the most common final answer across all paths wins.
This is a form of ensemble reasoning. Different reasoning paths may catch different errors. A path that makes an arithmetic mistake early might still arrive at a wrong answer, but if 8 out of 10 paths get 42 and 2 paths get 44, the vote correctly identifies 42 as the more reliable output.
Self-consistency has a real cost: it requires generating N completions instead of 1, multiplying inference compute. At 40 paths, it costs 40× the token budget. This is why self-consistency is valuable for high-stakes tasks (medical reasoning, legal analysis, security audits) but impractical for casual chat.
Around the same time, Denny Zhou et al. at Google introduced least-to-most prompting — a CoT variant for problems that decompose naturally into simpler sub-problems. The strategy: first ask the model to break the problem into sub-problems (from easiest to hardest), then solve them in order, where each solved sub-problem feeds context into the next.
On the SCAN compositional generalization benchmark, least-to-most prompting achieved 99.7% accuracy — compared to 16% for standard few-shot prompting. The key insight was that compositional problems (those requiring you to combine smaller known solutions into a larger solution) benefit enormously from this hierarchical decomposition strategy.
A simpler but highly practical technique: after getting an answer, prompt the model to verify its own work. "Check your reasoning carefully. Is there any step where you might have made an error?" Research by Lightman et al. at OpenAI (2023) — the process reward model paper — showed that fine-tuning models to verify each step of their reasoning (rather than just the final answer) produced more reliable outputs than outcome supervision alone.
In practice, you can approximate this without fine-tuning by prompting: "Now, verify each step of your reasoning independently and flag any that seem uncertain." This often catches arithmetic errors, overlooked conditions, or faulty assumptions that the initial chain missed.
OpenAI's o1 and o3 models (released 2024) use internal self-consistency and verification as core mechanisms — running multiple reasoning traces internally before producing output. What Wei and Wang demonstrated as manual prompting techniques in 2022 became automated infrastructure in next-generation models by 2024.
Practice the self-consistency and verification approaches from Lesson 3. Ask for multiple reasoning paths on the same problem, then compare the conclusions. Also try a verification prompt — asking the AI to check its own work step by step. Aim for at least 3 exchanges.
After Wei's paper, researchers applied chain-of-thought to domains far beyond arithmetic. A 2022 paper by Kaizhong Huang et al. showed CoT improving performance on medical question answering — asking models to reason through symptom patterns before diagnosing. A DeepMind team applied it to code generation, finding that asking models to explain their approach before writing code reduced logical bugs. By 2023, CoT had been tested on legal reasoning, scientific hypothesis generation, ethical dilemma analysis, and strategy games.
But the limits became clearer too. In 2023, researchers at Stanford published "Large Language Models Are Not Yet Human-Level Problem Solvers," documenting systematic failures — models that wrote plausible-sounding reasoning chains that were internally inconsistent, or that reached correct answers through demonstrably wrong steps.
Mathematics and formal reasoning: The original and strongest domain. Multi-step arithmetic, algebra, geometry, and symbolic logic all benefit substantially from CoT. The scratchpad effect is most powerful here because each step is precisely verifiable.
Code generation and debugging: Asking models to explain their plan before coding, then trace through execution after, consistently reduces logical errors. GitHub Copilot's internal research (2023) showed that prompts asking models to "explain what this code should do first" produced fewer functional bugs than direct code generation.
Multi-hop factual reasoning: Questions requiring several retrieval steps ("Which country has the capital city whose name means 'muddy water'?") benefit from explicit intermediate steps rather than direct retrieval.
Unfaithful reasoning: A critical 2023 finding from Turpin et al. at Anthropic showed that models sometimes produce CoT reasoning that doesn't actually reflect their internal computation — the chain is post-hoc rationalization, not genuine derivation. The model reaches an answer (possibly influenced by subtle cues in the prompt) and then generates a plausible reasoning chain backward. This is a fundamental concern for trusting CoT explanations in high-stakes settings.
Overconfident wrong chains: CoT can make models more confidently wrong. A model that would output "I'm not sure" without CoT might produce a detailed five-step chain leading to an incorrect answer with full confidence. The appearance of reasoning can suppress appropriate uncertainty.
Creative and social tasks: For poetry, creative writing, style matching, and social reasoning, forcing explicit step-by-step reasoning often degrades quality. These tasks benefit from holistic pattern matching, not sequential decomposition.
When models were given biased context (e.g., a hint toward a wrong answer embedded in the prompt), their CoT explanations faithfully reflected the biased reasoning — even when the final answer was wrong. The model didn't reason to the wrong answer; it reasoned from the wrong answer backward. This "unfaithful CoT" finding means you cannot always trust a model's stated reasoning as an accurate account of how it reached its conclusion.
In September 2024, OpenAI released o1, a model trained specifically to use extended internal chain-of-thought reasoning before producing output. Unlike previous models where CoT was a prompting technique, o1 had CoT built into its training and inference pipeline — it spent variable compute "thinking" before answering, with longer thinking time correlating with harder problems.
On the AMC 2024 mathematics competition, o1 scored in the 83rd percentile among human participants — compared to GPT-4's performance at roughly the 11th percentile. On AIME 2024, it solved 74% of problems versus GPT-4's 12%. The improvement came primarily from scaled inference-time computation: more and longer reasoning chains, with internal verification.
o3, released in late 2024, extended this further — on the ARC-AGI benchmark (designed to require fluid intelligence), o3 achieved 87.5% accuracy, compared to o1's 32% and GPT-4's near-zero. The leap from 2022 prompting tricks to 2024 reasoning architectures represents the full arc of what chain-of-thought unlocked.
Chain-of-thought demonstrated that intelligence in language models is not fixed — it can be elicited, structured, and amplified by how computation is organized, not just by model size. This principle — that reasoning quality scales with structured intermediate computation — is now one of the foundational beliefs driving AI development in 2024 and beyond.
Apply chain-of-thought beyond arithmetic. Try it on a medical diagnosis scenario, a legal question, or a code debugging task. Then deliberately try to break it — find a task where asking for step-by-step reasoning produces worse or strange results. Aim for at least 3 exchanges.