L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 2 · Lesson 1

What Is Chain-of-Thought?

The discovery that telling an AI to "think step by step" could unlock reasoning it didn't seem to have.
Why does showing your work change the answer — not just the explanation?

In early 2022, Google Brain researcher Jason Wei and colleagues noticed something strange. When they added the phrase "Let's think step by step" to math word problems given to large language models, accuracy on multi-step arithmetic jumped dramatically — from below 20% on some benchmarks to above 50%. The model had not been retrained. Nothing had changed except the prompt.

Their paper, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," published in May 2022, would become one of the most cited AI papers of the decade. The insight was deceptively simple: intermediate reasoning steps, when made explicit, restructure how a model processes a problem.

The Core Idea

A language model predicts the next token based on everything that came before it. When you ask a complex question directly — "What is 17% of 340?" — the model must leap from question to answer in a single bound. The probability distribution over answer tokens is shaped by training patterns, not by a deliberate computation the model just performed.

Chain-of-thought changes this. By prompting the model to generate intermediate steps — "First, 10% of 340 is 34. Then 7% of 340 is 23.8. Adding them: 57.8" — each reasoning step becomes context for the next. The model is, in a meaningful sense, using its own prior output as a scratchpad. The answer token now follows from a sequence of coherent reasoning tokens, not a cold jump from question to conclusion.

Chain-of-Thought in Action — Classic Example from Wei et al. (2022)
Problem: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
Standard prompt answer: 11 ✓ (simple enough — model gets this)
Harder version: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
Without CoT: 27 ✗
With CoT: "The cafeteria started with 23 apples. They used 20, so 23 − 20 = 3. They bought 6 more, so 3 + 6 = 9."
Answer: 9 ✓
Why It Works: The Scratchpad Hypothesis

Researchers including Denny Zhou at Google DeepMind have described the mechanism as a computational scratchpad. The model's context window is finite but rich. By writing out steps, the model distributes the cognitive load of the problem across many tokens rather than compressing it into one. Each intermediate conclusion anchors the next inference.

This is not entirely unlike how humans benefit from writing out solutions to hard problems. The act of writing doesn't just communicate the answer — it constrains what you can say next, forcing logical coherence. The same constraint operates on a language model, though the underlying mechanism (statistical next-token prediction) differs fundamentally from human cognition.

Importantly, CoT doesn't help for simple factual recall. Asking "What is the capital of France?" gains nothing from chain-of-thought. The benefit appears specifically on tasks requiring multiple sequential reasoning steps — math, logic, commonsense inference, code debugging, multi-hop factual retrieval.

Key Finding — Wei et al. 2022

Chain-of-thought prompting only emerged as a useful capability in models above roughly 100 billion parameters. In smaller models, asking for step-by-step reasoning produced fluent but incorrect chains — confident nonsense. This "emergent" threshold was a major finding in itself, suggesting CoT ability scales with model size in a non-linear way.

Key Terms
Chain-of-Thought (CoT)A prompting technique in which the model is asked to produce intermediate reasoning steps before arriving at a final answer, improving accuracy on multi-step tasks.
Zero-Shot CoTEliciting chain-of-thought reasoning with a simple instructional phrase ("Let's think step by step") without providing example reasoning chains.
Few-Shot CoTProviding example problems with full reasoning chains in the prompt, so the model learns the reasoning format from those demonstrations before tackling a new problem.
Emergent AbilityA capability that appears in large models but not small ones, seemingly discontinuously, as model scale increases.
Historical Note

Wei et al.'s chain-of-thought paper appeared in May 2022. Within months, it had influenced how OpenAI framed GPT-4's capabilities, how Anthropic designed Claude's default response style, and how Google deployed PaLM. The "Let's think step by step" phrase itself became so widely used that it was referenced in subsequent papers as a quasi-standard baseline — an almost accidental landmark in prompt engineering history.

Lesson 1 Quiz

What Is Chain-of-Thought? — Check your understanding.
1. Who published the foundational chain-of-thought prompting paper in 2022?
Correct. Jason Wei et al. at Google Brain published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" in May 2022.
Not quite. The paper was published by Jason Wei and colleagues at Google Brain in May 2022.
2. What is the simplest zero-shot chain-of-thought phrase discovered to improve reasoning?
Correct. "Let's think step by step" is the zero-shot phrase that Kojima et al. (2022) demonstrated could elicit CoT reasoning without any examples.
Not quite. The specific phrase is "Let's think step by step" — simple and remarkably effective.
3. Chain-of-thought prompting is LEAST likely to improve performance on which task?
Correct. CoT helps with multi-step problems. For simple factual recall, the intermediate steps are unnecessary and offer no accuracy benefit.
Think about where reasoning steps add value. Simple factual recall doesn't require intermediate computation — there are no steps to chain.
4. According to Wei et al.'s findings, at approximately what model size does chain-of-thought become useful?
Correct. Wei et al. found CoT was an emergent ability — appearing meaningfully only in models around 100B parameters or larger.
Not quite. The researchers found this was an emergent capability appearing around 100 billion parameters — smaller models produced fluent but wrong chains.

Lab 1 — Triggering Chain-of-Thought

Practice prompting an AI to reason step by step on multi-step problems.

Your Mission

You'll practice the core chain-of-thought technique: asking an AI to show its intermediate reasoning steps. Try giving a math or logic problem and compare the result with and without the "step by step" instruction. Aim for at least 3 exchanges.

Try this starter: "Without step-by-step reasoning: If a store reduces a $240 item by 15%, then raises the sale price by 10%, what's the final price? Now solve it again, thinking step by step."
CoT Lab Assistant
Chain-of-Thought Focus
Welcome to Lab 1. I'm here to help you explore chain-of-thought prompting. Try giving me a problem and asking me to solve it both with and without step-by-step reasoning — you'll see the difference directly. What problem would you like to start with?
Module 2 · Lesson 2

Few-Shot vs. Zero-Shot CoT

Two roads to the same destination — and why the examples you provide matter enormously.
When should you demonstrate reasoning, and when is a single instruction enough?

Three months after Wei's paper, a team at the University of Tokyo and Google published a follow-up that surprised even the original authors. Takeshi Kojima and collaborators showed that you didn't need elaborate example chains at all. Adding the single phrase "Let's think step by step" after a question — with no examples whatsoever — produced large accuracy gains. They called it Zero-Shot-CoT, and it worked across arithmetic, symbolic reasoning, and commonsense tasks.

This was significant for a practical reason: constructing good few-shot examples requires human effort and expertise. Zero-shot CoT democratized the technique — anyone could use it with a three-word addition to their prompt.

Few-Shot CoT: Teaching by Example

In few-shot CoT, you provide the model with several complete examples: a question, a full step-by-step reasoning chain, and the correct answer. The model then applies the same reasoning format to a new question. This is the approach used in Wei et al.'s original paper, where the examples were carefully constructed by researchers.

The quality of examples matters enormously. In 2022, researchers at DeepMind (Shi et al.) showed that including even one misleading reasoning step in a few-shot example could dramatically degrade performance — the model would faithfully imitate the flawed reasoning pattern. This highlights a key risk: few-shot CoT transfers the format of reasoning, good or bad.

Few-Shot CoT — Risky

Requires expert-constructed examples. If examples contain errors, the model imitates them. Takes significant prompt space. Must be re-crafted for different domains.

Few-Shot CoT — Powerful

Precisely calibrates the reasoning style. Works on domain-specific tasks where zero-shot fails. Lets you specify exactly what kind of reasoning you want — causal, mathematical, legal.

Zero-Shot CoT: The Magic Phrase

Kojima et al. identified a two-stage structure in effective zero-shot CoT. First prompt: "[Question] Let's think step by step." This generates a reasoning chain. Second prompt: "[Question] [Generated reasoning chain] Therefore, the answer is:" This extracts the final answer from the chain. The two-stage approach outperformed a single combined prompt.

Different trigger phrases have slightly different effects. Research by Zhou et al. (2022) at Google explored many variants — "Let's work this out step by step," "Think about this carefully," "Take a deep breath and work on this problem" (which, bizarrely, showed minor improvements for certain task types in some models). The evidence suggests models have learned to associate certain metacognitive phrasings with more careful, structured outputs.

Practical Comparison

For most day-to-day tasks, zero-shot CoT ("Let's think step by step") is the right starting point — fast, free, and surprisingly effective. Escalate to few-shot CoT when you need the model to follow a specific reasoning format (e.g., legal analysis, medical differential diagnosis, structured financial modeling) or when zero-shot is producing reasoning chains with systematic errors.

The Auto-CoT Approach

By late 2022, researchers including Zhuosheng Zhang et al. at Shanghai Jiao Tong University proposed Auto-CoT: using zero-shot CoT to automatically generate reasoning demonstrations, clustering questions by type, then using the best generated chains as few-shot examples. This automated the most labor-intensive part of few-shot CoT construction, combining the strengths of both approaches.

Auto-CoT appeared in modern AI systems as a behind-the-scenes technique — one reason that frontier models like GPT-4 and Claude often show structured reasoning without being explicitly prompted, because the fine-tuning process incorporated CoT-style outputs.

Few-Shot CoTProviding complete example reasoning chains in the prompt before posing the target question.
Zero-Shot CoTUsing only a trigger phrase ("Let's think step by step") with no examples to elicit reasoning chains.
Auto-CoTAutomatically generating few-shot CoT examples using zero-shot CoT, then using those as demonstrations.

Lesson 2 Quiz

Few-Shot vs. Zero-Shot CoT — Check your understanding.
1. Who published the paper demonstrating that "Let's think step by step" alone (zero-shot) could elicit chain-of-thought reasoning?
Correct. Kojima et al. published the Zero-Shot-CoT paper, showing that the phrase alone — without any example chains — was sufficient to trigger reasoning.
Not quite. It was Takeshi Kojima and colleagues who published the zero-shot CoT finding, following up on Wei et al.'s original paper.
2. According to DeepMind research (Shi et al. 2022), what happens when a few-shot CoT example contains a flawed reasoning step?
Correct. Few-shot CoT transfers the format of reasoning faithfully — including errors. One bad example can degrade performance systematically.
Not quite. The research showed models imitate the reasoning format, including flawed steps — a key risk of few-shot approaches.
3. What does Auto-CoT do differently from standard few-shot CoT?
Correct. Auto-CoT uses zero-shot CoT to auto-generate examples, then uses those as few-shot demonstrations — combining the convenience of zero-shot with the power of few-shot.
Not quite. Auto-CoT's key innovation was automating the example generation using zero-shot CoT itself, reducing human labor.
4. When is few-shot CoT generally preferred over zero-shot CoT?
Correct. Few-shot CoT is worth the extra effort when you need precise control over reasoning style or when zero-shot is failing on your specific task type.
Not quite. Zero-shot is usually the fast starting point. Few-shot adds value when you need a domain-specific reasoning format or when zero-shot shows systematic errors.

Lab 2 — Few-Shot vs. Zero-Shot

Build your own reasoning examples and compare the two approaches head-to-head.

Your Mission

Practice the difference between few-shot and zero-shot CoT. First, try a problem with just "Let's think step by step." Then craft a few-shot example with your own reasoning chain and use it to solve a similar but harder problem. Aim for at least 3 exchanges.

Try this: Give a word problem, ask for the answer with zero-shot CoT. Then write a complete example solution yourself (with all reasoning steps) and ask the AI to use that same format on a new problem.
CoT Lab Assistant
Few-Shot vs. Zero-Shot
Welcome to Lab 2. We're comparing few-shot and zero-shot chain-of-thought today. Start with a problem — try it zero-shot first, then build a few-shot example and see if the reasoning format changes. What would you like to explore?
Module 2 · Lesson 3

Self-Consistency & Verification

When one chain of thought isn't enough — sampling multiple reasoning paths and voting on the answer.
If a model can reason differently each time, can we use that variation to improve accuracy?

In late 2022, Xuezhi Wang and colleagues at Google Brain published "Self-Consistency Improves Chain of Thought Reasoning in Language Models." The insight was elegant: instead of generating one reasoning chain, generate many — with temperature turned up so the model takes different paths — then take a majority vote on the final answers.

On the GSM8K math benchmark, self-consistency with 40 sampled paths raised accuracy from 56.5% (single CoT) to 74.4%. On MATH, a harder benchmark, improvements were similarly dramatic. The logic was simple: if multiple independent reasoning paths converge on the same answer, that answer is more likely to be correct — even if some individual chains contained errors.

How Self-Consistency Works

The procedure has three steps. First, sample multiple diverse reasoning paths from the model using the same prompt but with non-zero temperature (so outputs vary). Second, extract the final answer from each path. Third, take the majority vote — the most common final answer across all paths wins.

This is a form of ensemble reasoning. Different reasoning paths may catch different errors. A path that makes an arithmetic mistake early might still arrive at a wrong answer, but if 8 out of 10 paths get 42 and 2 paths get 44, the vote correctly identifies 42 as the more reliable output.

Self-consistency has a real cost: it requires generating N completions instead of 1, multiplying inference compute. At 40 paths, it costs 40× the token budget. This is why self-consistency is valuable for high-stakes tasks (medical reasoning, legal analysis, security audits) but impractical for casual chat.

Self-Consistency — GSM8K Results (Wang et al. 2022)
Standard CoT (1 path): 56.5% accuracy on GSM8K math benchmark
Self-Consistency, 10 paths: ~68% accuracy
Self-Consistency, 40 paths: 74.4% accuracy
Model: PaLM 540B — the largest publicly benchmarked model at the time
Takeaway: Diversity of reasoning paths, not just depth, improves reliability.
Least-to-Most Prompting

Around the same time, Denny Zhou et al. at Google introduced least-to-most prompting — a CoT variant for problems that decompose naturally into simpler sub-problems. The strategy: first ask the model to break the problem into sub-problems (from easiest to hardest), then solve them in order, where each solved sub-problem feeds context into the next.

On the SCAN compositional generalization benchmark, least-to-most prompting achieved 99.7% accuracy — compared to 16% for standard few-shot prompting. The key insight was that compositional problems (those requiring you to combine smaller known solutions into a larger solution) benefit enormously from this hierarchical decomposition strategy.

Verification Prompts

A simpler but highly practical technique: after getting an answer, prompt the model to verify its own work. "Check your reasoning carefully. Is there any step where you might have made an error?" Research by Lightman et al. at OpenAI (2023) — the process reward model paper — showed that fine-tuning models to verify each step of their reasoning (rather than just the final answer) produced more reliable outputs than outcome supervision alone.

In practice, you can approximate this without fine-tuning by prompting: "Now, verify each step of your reasoning independently and flag any that seem uncertain." This often catches arithmetic errors, overlooked conditions, or faulty assumptions that the initial chain missed.

Real-World Application

OpenAI's o1 and o3 models (released 2024) use internal self-consistency and verification as core mechanisms — running multiple reasoning traces internally before producing output. What Wei and Wang demonstrated as manual prompting techniques in 2022 became automated infrastructure in next-generation models by 2024.

Self-ConsistencySampling multiple diverse reasoning paths for the same problem and taking a majority vote on the final answer to improve reliability.
Least-to-Most PromptingDecomposing a complex problem into ordered sub-problems and solving them sequentially, using prior solutions as context.
Process Reward ModelA model trained to evaluate the correctness of each reasoning step, not just the final answer — enabling step-level verification.

Lesson 3 Quiz

Self-Consistency & Verification — Check your understanding.
1. What is the core mechanism of self-consistency prompting?
Correct. Self-consistency generates multiple reasoning paths with varied outputs, then votes on the most common final answer — ensemble reasoning over diverse paths.
Not quite. Self-consistency samples N diverse reasoning paths (using non-zero temperature) and takes a majority vote on final answers.
2. What was the accuracy improvement from self-consistency with 40 paths on the GSM8K benchmark (Wang et al. 2022)?
Correct. Wang et al. showed a jump from 56.5% (single CoT path) to 74.4% (40 sampled paths with majority vote) on GSM8K — a massive gain without any model changes.
Not quite. The improvement was from 56.5% to 74.4% accuracy — a nearly 18-point gain from sampling and voting across 40 paths.
3. Least-to-most prompting is particularly effective for which type of task?
Correct. Least-to-most prompting excels at compositional problems — where simpler components must be solved first and combined to solve harder ones, as shown on the SCAN benchmark.
Not quite. The technique is designed for compositional tasks — breaking complex problems into sub-problems and solving sequentially from easiest to hardest.
4. What was the key finding of Lightman et al.'s 2023 OpenAI process reward model paper?
Correct. Process reward models — trained to evaluate intermediate steps rather than just final outcomes — produced more reliable reasoning chains than outcome-only supervision.
Not quite. The key finding was that step-level (process) supervision outperformed outcome-only supervision — rewarding correct reasoning at each step, not just the final answer.

Lab 3 — Self-Consistency in Practice

Ask for multiple reasoning paths and practice verification techniques.

Your Mission

Practice the self-consistency and verification approaches from Lesson 3. Ask for multiple reasoning paths on the same problem, then compare the conclusions. Also try a verification prompt — asking the AI to check its own work step by step. Aim for at least 3 exchanges.

Try this: "Give me three different step-by-step reasoning approaches to solving this problem: A train leaves Chicago at 9am traveling at 60mph. Another train leaves Detroit (280 miles away) at 10am traveling at 80mph toward Chicago. When do they meet? After giving all three approaches, tell me which answer you're most confident in and why."
CoT Lab Assistant
Self-Consistency & Verification
Welcome to Lab 3. Today we're exploring self-consistency — asking for multiple reasoning paths and using agreement between them to identify reliable answers. We'll also practice step-by-step verification. Ready to try some problems? Go ahead and start!
Module 2 · Lesson 4

CoT Beyond Math: Limits & Frontiers

Chain-of-thought across domains — and the honest accounting of where it fails.
Where does explicit reasoning help, where does it hurt, and what did the o1 models change?

After Wei's paper, researchers applied chain-of-thought to domains far beyond arithmetic. A 2022 paper by Kaizhong Huang et al. showed CoT improving performance on medical question answering — asking models to reason through symptom patterns before diagnosing. A DeepMind team applied it to code generation, finding that asking models to explain their approach before writing code reduced logical bugs. By 2023, CoT had been tested on legal reasoning, scientific hypothesis generation, ethical dilemma analysis, and strategy games.

But the limits became clearer too. In 2023, researchers at Stanford published "Large Language Models Are Not Yet Human-Level Problem Solvers," documenting systematic failures — models that wrote plausible-sounding reasoning chains that were internally inconsistent, or that reached correct answers through demonstrably wrong steps.

Where CoT Works Well

Mathematics and formal reasoning: The original and strongest domain. Multi-step arithmetic, algebra, geometry, and symbolic logic all benefit substantially from CoT. The scratchpad effect is most powerful here because each step is precisely verifiable.

Code generation and debugging: Asking models to explain their plan before coding, then trace through execution after, consistently reduces logical errors. GitHub Copilot's internal research (2023) showed that prompts asking models to "explain what this code should do first" produced fewer functional bugs than direct code generation.

Multi-hop factual reasoning: Questions requiring several retrieval steps ("Which country has the capital city whose name means 'muddy water'?") benefit from explicit intermediate steps rather than direct retrieval.

Where CoT Fails or Misfires

Unfaithful reasoning: A critical 2023 finding from Turpin et al. at Anthropic showed that models sometimes produce CoT reasoning that doesn't actually reflect their internal computation — the chain is post-hoc rationalization, not genuine derivation. The model reaches an answer (possibly influenced by subtle cues in the prompt) and then generates a plausible reasoning chain backward. This is a fundamental concern for trusting CoT explanations in high-stakes settings.

Overconfident wrong chains: CoT can make models more confidently wrong. A model that would output "I'm not sure" without CoT might produce a detailed five-step chain leading to an incorrect answer with full confidence. The appearance of reasoning can suppress appropriate uncertainty.

Creative and social tasks: For poetry, creative writing, style matching, and social reasoning, forcing explicit step-by-step reasoning often degrades quality. These tasks benefit from holistic pattern matching, not sequential decomposition.

Critical Finding — Turpin et al. 2023 (Anthropic)

When models were given biased context (e.g., a hint toward a wrong answer embedded in the prompt), their CoT explanations faithfully reflected the biased reasoning — even when the final answer was wrong. The model didn't reason to the wrong answer; it reasoned from the wrong answer backward. This "unfaithful CoT" finding means you cannot always trust a model's stated reasoning as an accurate account of how it reached its conclusion.

OpenAI o1 and o3: CoT as Architecture

In September 2024, OpenAI released o1, a model trained specifically to use extended internal chain-of-thought reasoning before producing output. Unlike previous models where CoT was a prompting technique, o1 had CoT built into its training and inference pipeline — it spent variable compute "thinking" before answering, with longer thinking time correlating with harder problems.

On the AMC 2024 mathematics competition, o1 scored in the 83rd percentile among human participants — compared to GPT-4's performance at roughly the 11th percentile. On AIME 2024, it solved 74% of problems versus GPT-4's 12%. The improvement came primarily from scaled inference-time computation: more and longer reasoning chains, with internal verification.

o3, released in late 2024, extended this further — on the ARC-AGI benchmark (designed to require fluid intelligence), o3 achieved 87.5% accuracy, compared to o1's 32% and GPT-4's near-zero. The leap from 2022 prompting tricks to 2024 reasoning architectures represents the full arc of what chain-of-thought unlocked.

The Frontier: What CoT Proved

Chain-of-thought demonstrated that intelligence in language models is not fixed — it can be elicited, structured, and amplified by how computation is organized, not just by model size. This principle — that reasoning quality scales with structured intermediate computation — is now one of the foundational beliefs driving AI development in 2024 and beyond.

Unfaithful CoTWhen a model's stated reasoning chain doesn't reflect its actual computation — a post-hoc rationalization rather than the genuine derivation of the answer.
Inference-Time ComputeComputational resources used during generation (not training). Scaling inference-time compute — by sampling more, reasoning longer — is distinct from scaling model size.
o1 / o3OpenAI models (2024) with extended internal chain-of-thought built into the architecture, trained to "think" variably based on problem difficulty before outputting answers.

Lesson 4 Quiz

CoT Beyond Math: Limits & Frontiers — Check your understanding.
1. What did the Turpin et al. 2023 Anthropic study find about chain-of-thought explanations?
Correct. "Unfaithful CoT" — reasoning chains that rationalize a predetermined answer rather than genuinely deriving it — is a real and documented phenomenon identified by Turpin et al. at Anthropic.
Not quite. Turpin et al. found that models can produce plausible CoT chains that don't reflect actual computation — particularly when the prompt contains bias toward a specific answer.
2. What distinguished OpenAI's o1 model from previous models like GPT-4 in terms of reasoning?
Correct. o1 represented a shift from CoT as a prompting trick to CoT as an architectural feature — trained to reason internally with variable compute before answering.
Not quite. The key distinction was architectural: o1 had internal chain-of-thought reasoning built in, with inference-time compute scaling to problem difficulty.
3. For which type of task does forcing chain-of-thought reasoning MOST likely degrade performance?
Correct. Creative and holistic tasks benefit from pattern-level fluency, not sequential decomposition. Forcing explicit reasoning steps often produces stilted, formulaic creative output.
Not quite. CoT helps sequential, logical tasks but can hinder holistic creative tasks where the quality comes from fluid pattern-matching rather than step-by-step decomposition.
4. On the AIME 2024 benchmark, what was the approximate performance difference between o1 and GPT-4?
Correct. o1's 74% vs GPT-4's 12% on AIME 2024 was one of the most striking demonstrations that inference-time chain-of-thought reasoning represented a qualitative leap, not a marginal improvement.
Not quite. The gap was dramatic: o1 solved 74% of AIME 2024 problems compared to GPT-4's 12% — a more than 6× improvement driven primarily by scaled internal reasoning.

Lab 4 — CoT Across Domains

Apply chain-of-thought to non-math tasks and probe its limits.

Your Mission

Apply chain-of-thought beyond arithmetic. Try it on a medical diagnosis scenario, a legal question, or a code debugging task. Then deliberately try to break it — find a task where asking for step-by-step reasoning produces worse or strange results. Aim for at least 3 exchanges.

Try this: "A 45-year-old patient has fatigue, cold intolerance, weight gain, and dry skin. Think step by step through the differential diagnosis. What's the most likely diagnosis?" Then try: "Write a haiku about autumn. Let's think step by step." Compare the outputs.
CoT Lab Assistant
CoT Limits & Domains
Welcome to Lab 4 — the most interesting one. We're going beyond math to see where chain-of-thought genuinely helps, where it's neutral, and where it might actively make things worse. Try applying it to a medical, legal, or creative task and tell me what you observe. Let's explore!

Module 2 — Test

Chain-of-Thought Prompting · 15 questions · Pass at 80%
1. Chain-of-thought prompting was first formally demonstrated in which year?
Correct. Wei et al. published the CoT paper in May 2022.
The foundational CoT paper by Wei et al. was published in May 2022.
2. Which institution published the foundational chain-of-thought paper?
Correct. Jason Wei and colleagues at Google Brain published the CoT paper.
The paper came from Google Brain — Jason Wei et al.
3. How does chain-of-thought improve model accuracy on multi-step problems?
Correct. Each reasoning step becomes context for the next token prediction — distributing complex computation across many steps rather than a single jump.
The mechanism is that intermediate steps become context — the model uses its own prior output as a scratchpad for subsequent reasoning.
4. Zero-shot CoT was developed by:
Correct. Kojima et al. demonstrated zero-shot CoT — showing that a simple phrase without examples could elicit chain-of-thought reasoning.
Zero-shot CoT was from Takeshi Kojima and colleagues — they showed the phrase alone, without examples, was sufficient.
5. In few-shot CoT, what is the primary risk of using poorly constructed examples?
Correct. Few-shot CoT transfers reasoning format faithfully — including errors. Bad examples produce bad reasoning chains consistently.
The risk is imitation: the model faithfully copies the reasoning format of the examples, including any flawed steps (Shi et al., DeepMind 2022).
6. What is Auto-CoT?
Correct. Auto-CoT (Zhang et al., Shanghai Jiao Tong) automatically generates few-shot examples using zero-shot CoT — removing the need for hand-crafted demonstrations.
Auto-CoT uses zero-shot CoT to generate examples automatically, then uses those as few-shot demonstrations — combining both approaches.
7. Self-consistency prompting, as developed by Wang et al. (2022), involves:
Correct. Sample N paths at non-zero temperature, extract final answers from each, majority vote wins — ensemble reasoning over diverse paths.
Self-consistency samples multiple diverse reasoning paths (using temperature) and takes the majority vote answer — ensemble reasoning without additional training.
8. Least-to-most prompting is specifically designed for problems that are:
Correct. Least-to-most excels at compositional tasks — breaking into sub-problems from easiest to hardest. On SCAN it achieved 99.7% vs. 16% for standard prompting.
Least-to-most prompting is for compositional problems — where simpler parts must be solved first and combined to solve the harder whole.
9. "Unfaithful CoT," identified by Turpin et al. at Anthropic (2023), refers to:
Correct. Unfaithful CoT is when the model's stated reasoning doesn't reflect its actual computation — particularly when biased prompts cause post-hoc rationalization of predetermined answers.
Unfaithful CoT means the reasoning chain is a rationalization, not a genuine derivation — the model's stated reasoning doesn't match how it actually computed the answer.
10. On which benchmark did o3 achieve 87.5% accuracy, compared to near-zero for GPT-4?
Correct. ARC-AGI (designed to require fluid intelligence) was one of the most striking demonstrations of reasoning improvement — o3 at 87.5% vs. near-zero for GPT-4.
It was ARC-AGI — the benchmark designed to test fluid intelligence — where o3's 87.5% accuracy dramatically outpaced GPT-4's near-zero performance.
11. According to Wei et al. (2022), chain-of-thought prompting is an "emergent" ability because:
Correct. Below ~100B parameters, CoT produced fluent but incorrect chains — the ability to reason beneficially through steps only emerged at larger scales.
It's emergent because it appears non-linearly at scale — small models produce fluent but incorrect CoT chains, while the capability becomes genuinely useful only above ~100B parameters.
12. What is the primary practical cost of self-consistency prompting?
Correct. Generating 40 paths costs 40× the tokens — practical for high-stakes tasks, but too expensive for routine interactions.
The cost is multiplicative: N paths means N× the inference compute and token budget — worthwhile for critical decisions, impractical for casual use.
13. The "scratchpad hypothesis" explains CoT's effectiveness by suggesting that:
Correct. The scratchpad hypothesis: by writing steps, the model distributes reasoning across many tokens rather than compressing it into one answer token — each step constrains subsequent predictions.
The scratchpad hypothesis is that written steps serve as external working memory — distributing complex computation across many tokens rather than requiring a single impossible leap.
14. Process reward models (Lightman et al., OpenAI 2023) differ from standard outcome reward models in that they:
Correct. Process reward models provide feedback at each reasoning step — enabling the model to learn what good intermediate reasoning looks like, not just what correct final answers look like.
Process reward models evaluate each step of reasoning — providing learning signal for intermediate quality, not just whether the final answer was right.
15. Which task type is MOST likely to benefit from chain-of-thought prompting?
Correct. Legal analysis under specific conditions requires multiple sequential inferences — exactly the kind of multi-hop, multi-step reasoning CoT is designed to support.
The legal reasoning task requires multiple sequential inferences — which is exactly what CoT is designed to support. Creative tasks, translation, and simple recall don't benefit nearly as much.