Module 4 · Lesson 1

What Is Chain-of-Thought Prompting?

Making models show their work — and why that changes everything.

Why does asking a model to "think step by step" dramatically improve its accuracy on complex tasks?

In January 2022, researchers at Google Brain published a paper that quietly rewrote the rules of prompting. Jason Wei and colleagues discovered that large language models could solve multi-step arithmetic and commonsense reasoning problems far more reliably if the prompt included example reasoning steps — not just example answers. They called the technique chain-of-thought prompting. Their experiments on PaLM 540B showed accuracy on the GSM8K math benchmark jumping from roughly 17% with standard few-shot prompting to 58% with chain-of-thought examples. The model wasn't getting smarter — it was being shown how to think.

The Core Idea

Standard prompting asks a model to produce an answer. Chain-of-thought (CoT) prompting asks a model to produce an intermediate reasoning process that leads to the answer. This seemingly small shift has outsized consequences: by externalizing reasoning steps into the token stream, the model can reference earlier conclusions when generating later ones — a capability that's mechanically impossible when jumping straight to an answer.

The fundamental insight is that language models generate text left-to-right, one token at a time. Each token is conditioned on all prior tokens. When you force intermediate reasoning into that token stream, the model's attention mechanism can "look back" at the logical steps it already wrote. The scratch-pad is the computation.

Standard Prompt

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many balls does he have?

A: 11

Chain-of-Thought Prompt

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many balls does he have?

A: Roger starts with 5 balls. 2 cans × 3 balls = 6 balls. 5 + 6 = 11 balls.

Two Flavors: Few-Shot vs. Zero-Shot CoT

The original Wei et al. paper used few-shot CoT: you provide several worked examples in the prompt, each showing a question followed by a step-by-step reasoning chain and final answer. The model pattern-matches to this format for new questions.

In May 2022, a separate team — Kojima et al. at the University of Tokyo — published a striking finding: you don't always need worked examples. Appending the phrase "Let's think step by step." to a question, with zero prior examples, was enough to elicit coherent reasoning chains in large models. This "zero-shot CoT" approach proved especially powerful because it requires no task-specific example engineering.

Research Finding

Kojima et al. (2022) found that adding "Let's think step by step" to prompts on the MultiArith benchmark improved zero-shot accuracy from 17.7% to 78.7% on GPT-3 — a 4.4× improvement with a four-word addition.

When CoT Helps (and When It Doesn't)

Chain-of-thought prompting reliably helps with tasks that require multi-step reasoning: arithmetic word problems, logical deduction, symbolic manipulation, multi-hop question answering, and code debugging. The gains are largest on harder problems where direct lookup fails.

CoT provides little benefit — or can even hurt — on tasks that are essentially single-step retrieval: factual lookups, simple classification, or tasks where the answer is obvious from the surface form. Forcing reasoning steps on a simple question just wastes tokens and can introduce unnecessary confusion.

Model size also matters significantly. Wei et al. demonstrated that CoT gains are largely an emergent property of scale: models below roughly 10 billion parameters often produce incoherent reasoning chains that don't improve — and sometimes worsen — accuracy.

Chain-of-Thought (CoT) A prompting technique that elicits intermediate reasoning steps from a model before it produces a final answer, improving accuracy on multi-step tasks by externalizing computation into the token stream.

Zero-Shot CoT Triggering reasoning without worked examples — typically by appending "Let's think step by step" or similar instruction to the prompt.

Few-Shot CoT Providing multiple (question, reasoning chain, answer) demonstrations in the prompt so the model learns to replicate the reasoning format.

Developer Note

In API contexts, CoT reasoning costs tokens — and therefore money and latency. The tradeoff is worthwhile for genuinely complex tasks. For high-volume, simple classification workloads, strip CoT from production prompts and use it only during evaluation and debugging.

In Lesson 2 we'll examine the specific prompt patterns that elicit the most reliable reasoning chains, and the structural choices that determine whether CoT reasoning is trustworthy or merely convincing-sounding.

Lesson 1 Quiz

What Is Chain-of-Thought Prompting?

1. In the original Wei et al. 2022 paper, what happened to accuracy on the GSM8K benchmark when chain-of-thought prompting was applied to PaLM 540B?

Correct. Wei et al. documented this dramatic jump, demonstrating CoT's power on multi-step math tasks.

Not quite. The paper showed accuracy going from ~17% to ~58% on GSM8K — a striking improvement driven by intermediate reasoning steps.

2. What is the key mechanical reason why intermediate reasoning steps improve model accuracy?

Correct. The scratch-pad IS the computation — prior reasoning tokens in the context window become available to attend to when generating subsequent tokens.

Not quite. The mechanism is that transformer models generate left-to-right, so explicitly written reasoning steps become part of the context the model conditions its next tokens on.

3. Kojima et al. (2022) showed that adding which phrase dramatically improved zero-shot reasoning accuracy?

Correct. This four-word phrase is the canonical zero-shot CoT trigger from Kojima et al.'s "Large Language Models are Zero-Shot Reasoners."

The specific phrase from Kojima et al. was "Let's think step by step." — deceptively simple and empirically powerful.

Lab 1 — Zero-Shot CoT Triggers

Experiment with phrases that elicit step-by-step reasoning from scratch.

Your Task

You'll craft prompts that elicit chain-of-thought reasoning without providing worked examples. Try different trigger phrases and compare the quality of reasoning they produce. Discuss your findings with the AI assistant — aim for at least 3 exchanges.

Suggested starting point: Give the assistant a multi-step word problem (e.g., a rate problem or logic puzzle) and ask it to solve it. Then retry with "Let's think step by step" appended. Ask the assistant to explain what changed and why CoT helps here.

CoT Lab Assistant

Zero-Shot CoT

Hello! I'm your Chain-of-Thought lab assistant. In this lab we're exploring zero-shot CoT triggers — phrases that make models reason step by step without any examples. Try giving me a word problem first without any reasoning trigger, then again with "Let's think step by step" added. We can compare the results and discuss why the reasoning structure changes. What problem would you like to start with?

Module 4 · Lesson 2

Designing Effective CoT Prompts

The structural choices that determine whether reasoning chains are trustworthy.

What separates a CoT prompt that genuinely improves reasoning from one that just generates plausible-sounding nonsense?

When Anthropic's researchers were developing Constitutional AI in 2022, they faced a practical engineering challenge: they needed language models to reliably critique and revise their own outputs according to a set of principles. The critique step required coherent multi-step reasoning — the model had to identify a problem, reference the relevant principle, and then rewrite accordingly. Early experiments showed that even powerful models produced shallow, circular critiques when prompted naively. The breakthrough came from structuring prompts to force explicit intermediate conclusions — "First, identify the specific harm. Then, name the constitutional principle it violates. Then, write a revision." This structured decomposition became a load-bearing element of the full RLHF pipeline.

The Anatomy of a Good CoT Example

When constructing few-shot CoT prompts, the quality of your exemplar reasoning chains matters enormously. Wei et al. found that invalid reasoning chains — where the steps didn't logically connect — still sometimes improved accuracy, but reliable gains require coherent intermediate steps that actually support the conclusion.

A high-quality few-shot CoT exemplar has four properties:

1
Specificity: Each step references concrete values or facts from the problem, not generic statements. "The train travels at 60 mph for 2 hours, covering 120 miles" beats "the train covers some distance."
2
Sequential dependence: Each step should build on prior steps. If step 3 could be written without steps 1 and 2, the chain is not actually reasoning — it's just formatting.
3
Explicit conclusion: The final answer should be clearly derived from the last reasoning step, not introduced abruptly. "Therefore, the answer is X" forces the model to connect reasoning to output.
4
Appropriate granularity: Don't over-decompose trivial steps or under-decompose hard ones. If the problem has three genuinely hard sub-problems, give each its own step.

Prompt Structure Patterns

The most reliable CoT prompt structure in production API usage follows this template:

# System prompt sets the reasoning contract
system: "You are a careful reasoning assistant. When solving problems:
1. Break the problem into sub-problems.
2. Solve each sub-problem explicitly.
3. Use prior sub-solutions to reach the final answer.
4. State your final answer clearly after your reasoning."

# User turn provides the problem
user: "A store sells apples for $1.20 each and pears for $0.80 each. 
Alice buys 5 apples and 3 pears. Bob buys 2 apples and 7 pears. 
Who spends more, and by how much?"

# Model will produce reasoning + answer in one turn

The Faithfulness Problem

A critical and underappreciated issue: CoT reasoning chains are not always faithful explanations of how the model reached its answer. Research by Turpin et al. (2023) — "Language Models Don't Always Say What They Think" — demonstrated that models will produce different reasoning chains for the same question when the answer is biased by sycophancy pressure, but the reasoning will post-hoc rationalize whatever answer the model was primed toward.

This means you cannot fully trust a reasoning chain as a ground-truth explanation of model behavior. CoT improves accuracy but doesn't guarantee mechanistic transparency. For safety-critical applications, you need additional verification — ideally checking the final numerical or logical result independently, not just the reasoning prose.

Engineering Implication

In agentic systems, parse and validate intermediate CoT outputs programmatically. Don't let the model's confidence in its own reasoning chain substitute for external verification of derived facts or calculations.

Reasoning Elicitation Without "Think Step by Step"

The "Let's think step by step" phrase has become so common that some researchers argue newer models have been fine-tuned to produce reasonable-sounding reasoning in response to it regardless of actual need. Alternative elicitation strategies include:

Scratchpad format Explicitly label a <scratchpad> section for working, then a <answer> section for the final response. Forces separation of reasoning from output.

Problem decomposition "First identify the unknowns. Then identify the equations. Then solve." — task-specific decomposition instructions outperform generic "step by step."

Role-based reasoning "Act as a careful analyst who always checks their work before submitting." — activates reasoning behaviors without explicit step-counting.

Lesson 3 takes CoT from single-path reasoning to multi-path approaches — including self-consistency and tree-of-thought, which systematically explore multiple reasoning trajectories to improve reliability further.

Lesson 2 Quiz

Designing Effective CoT Prompts

1. What did Turpin et al. (2023) reveal about chain-of-thought reasoning chains?

Correct. Turpin et al. showed models don't always "say what they think" — sycophancy can drive the answer while the reasoning chain is reverse-engineered to justify it.

Not quite. Turpin et al.'s key finding was that reasoning chains can be post-hoc rationalizations, not faithful mechanistic explanations of how the model reached its answer.

2. Which property of a good CoT exemplar ensures that reasoning steps actually build on each other?

Correct. Sequential dependence is what distinguishes genuine reasoning chains from formatted lists of independent statements.

The key structural property is sequential dependence — if step 3 could be written without steps 1 and 2, it's not actually a reasoning chain, just formatting.

3. For safety-critical applications using CoT, what does the lesson recommend?

Correct. Because CoT chains can be unfaithful, safety-critical systems should validate the actual outputs programmatically, independent of the model's stated reasoning.

Given the faithfulness problem, the right approach is external verification of derived facts — not relying on the reasoning prose as ground truth.

Lab 2 — Structuring CoT Exemplars

Build high-quality few-shot reasoning demonstrations and test their structural properties.

Your Task

You'll construct a few-shot CoT prompt with carefully designed exemplars, then discuss what makes each reasoning step strong or weak. The assistant will help you evaluate sequential dependence, specificity, and conclusion clarity.

Try this: Write a 2-example few-shot CoT prompt for a task of your choice (e.g., unit conversion, scheduling conflicts, or code debugging). Share it with the assistant and ask it to critique each reasoning step against the four quality criteria from Lesson 2.

CoT Lab Assistant

Few-Shot Design

Welcome to Lab 2! We're focusing on the structure of few-shot CoT exemplars. Share a prompt you've built with reasoning examples, or describe a task you want to design CoT demonstrations for. I'll help you evaluate the reasoning steps against the four quality criteria: specificity, sequential dependence, explicit conclusion, and appropriate granularity. What would you like to work on?

Module 4 · Lesson 3

Self-Consistency & Tree of Thought

From one reasoning path to many — sampling, voting, and branching for reliability.

If a single reasoning chain can go wrong, what happens when you run many chains and take a vote?

Chain-of-thought prompting was powerful, but it had a fragility problem: a single unlucky reasoning path — one wrong intermediate step — would cascade into a wrong answer. In March 2022, Xuezhi Wang and colleagues at Google Research proposed a simple but remarkably effective fix. Instead of generating one reasoning chain and taking its answer, generate many reasoning chains with temperature > 0, then take a majority vote on the final answers. They called this self-consistency. On the AQuA-RAT math dataset, self-consistency with 40 reasoning paths improved accuracy from 50.3% (standard CoT) to 74.4% — a 24-point jump without any new examples or model changes.

Self-Consistency: The Core Algorithm

Self-consistency treats reasoning generation as a stochastic process and exploits the statistical fact that correct reasoning paths, while diverse in their expression, tend to converge on correct answers. Incorrect reasoning paths, being wrong for different idiosyncratic reasons, tend to diverge across answer space.

1
Send the same CoT prompt N times with temperature > 0 (typically 0.5–0.8) to get diverse reasoning chains.
2
Parse the final answer from each response. The intermediate reasoning is used only to reach the answer — it's discarded for the voting step.
3
Take a majority vote over final answers. The most frequent answer is selected as the output.
4
Optionally, use weighted voting — weight each answer by the model's stated confidence or by answer consistency score.

Cost Warning

Self-consistency multiplies your API cost by N. Sampling 40 paths (as in Wang et al.'s best results) means 40× the token cost. In production, 5–10 paths often captures most of the benefit. Profile your accuracy/cost tradeoff empirically for your specific task before deploying.

Tree of Thoughts (ToT)

Self-consistency samples independent reasoning paths in parallel. Tree of Thoughts — proposed by Yao et al. at Princeton and Google DeepMind in 2023 — takes a fundamentally different approach: it explores reasoning as a tree structure, allowing backtracking and deliberate search through the reasoning space.

In ToT, the model generates multiple "thought" continuations at each step, evaluates each continuation's promise (either self-evaluated or via a separate evaluator prompt), then selects the most promising branch to extend. This allows the model to abandon unproductive reasoning paths early rather than committing fully to each independent chain.

# Simplified Tree of Thoughts pseudocode
def tree_of_thoughts(problem, breadth=3, depth=4):
    root = [problem]
    current_nodes = [root]
    
    for step in range(depth):
        candidates = []
        for node in current_nodes:
            # Generate B continuations per node
            thoughts = generate_thoughts(node, n=breadth)
            candidates.extend(thoughts)
        
        # Evaluate each candidate
        scores = evaluate_thoughts(candidates)
        
        # Keep top B nodes (beam search variant)
        current_nodes = top_k(candidates, scores, k=breadth)
    
    return best_answer(current_nodes)

ToT in Practice: When to Use Each

Tree of Thoughts works best on tasks where you can evaluate intermediate progress — game states, proof steps, code that can be partially executed, planning sequences where you can check constraint satisfaction. It struggles on tasks with no meaningful intermediate evaluation signal.

Yao et al.'s original paper showed ToT dramatically outperforming standard CoT and self-consistency on the Game of 24 (a combinatorial math puzzle) and crossword puzzles — tasks where backtracking is essential. On standard math word problems where self-consistency already works well, the additional complexity of ToT often isn't worth the cost.

Self-Consistency

Parallel independent paths · Majority voting · Best for: math, factual QA · Moderate cost · No backtracking

Tree of Thoughts

Sequential branching + pruning · Evaluator-guided search · Best for: planning, puzzles, multi-step code · High cost · Explicit backtracking

Real Deployment

OpenAI's o1 and o3 model families use extended internal chain-of-thought reasoning (not exposed as raw tokens to users) as a core architectural feature — effectively a trained, internalized version of CoT with search. This demonstrates that CoT techniques have moved from prompting tricks to fundamental model architecture decisions.

Lesson 4 closes the module by addressing CoT in production systems: latency management, structured output extraction from reasoning chains, debugging when CoT reasoning goes wrong, and the emerging landscape of models with built-in reasoning capabilities.

Lesson 3 Quiz

Self-Consistency & Tree of Thought

1. What is the core voting mechanism in self-consistency prompting?

Correct. Self-consistency generates N independent reasoning chains with temperature > 0, parses the final answer from each, and takes a majority vote.

Self-consistency works by sampling many independent reasoning chains (with temperature > 0) and selecting the most frequently occurring final answer across all chains.

2. On which type of task does Tree of Thoughts provide the most benefit over standard CoT?

Correct. Yao et al. demonstrated ToT's advantages on Game of 24 and crossword puzzles — tasks where backtracking through the reasoning space is essential.

Tree of Thoughts excels on tasks where you can evaluate intermediate progress and need to backtrack — puzzles, planning, multi-step code. Linear tasks don't benefit from tree search.

3. Wang et al.'s self-consistency paper showed accuracy on AQuA-RAT improving from 50.3% to 74.4%. What was the key change from standard CoT?

Correct. No new data, no fine-tuning — just sampling 40 diverse reasoning paths and majority-voting over their answers produced the dramatic accuracy improvement.

The key was generating many (up to 40) independent reasoning chains with temperature > 0 and taking a majority vote — no model changes or additional training required.

Lab 3 — Self-Consistency in Practice

Explore how sampling multiple reasoning paths changes answer reliability.

Your Task

You'll practice the self-consistency approach by examining how different reasoning paths can reach the same answer via different routes, and discuss the engineering tradeoffs of running multiple API calls vs. a single greedy decode.

Try this: Give the assistant a moderately difficult math or logic problem and ask it to solve it three times using different reasoning approaches. Then discuss: which answers agree? What would you do if two paths gave one answer and one gave another? How would you implement this in a production API workflow?

CoT Lab Assistant

Self-Consistency

Welcome to Lab 3! We're exploring self-consistency — sampling multiple reasoning paths and voting over their answers. Give me a problem you'd like to reason through multiple ways, and I'll demonstrate different reasoning routes to the same answer. Then we can discuss how you'd implement this pattern in a real API system, including cost tradeoffs and answer aggregation strategies. What problem would you like to explore?

Module 4 · Lesson 4

CoT in Production Systems

Latency, extraction, debugging, and the era of native reasoning models.

How do you ship chain-of-thought reasoning to users without turning every API call into a slow, expensive reasoning marathon?

When Microsoft launched Bing Chat powered by GPT-4 in February 2023, early users discovered that extended reasoning chains — elicited through aggressive CoT prompting — sometimes led the model into tangential loops that produced confident but wrong answers, or escalated emotionally over long conversations. Internal teams had to tune prompt structures to constrain reasoning depth and add extraction layers that pulled structured conclusions from reasoning prose before presenting output to users. The lesson the team documented publicly: reasoning chains are infrastructure, not user-facing output. Users see answers; the reasoning chain is a hidden intermediate computation layer that must be managed, monitored, and bounded.

Latency Management

Chain-of-thought reasoning generates substantially more tokens than direct answers. On complex problems, a CoT response might be 500–2000 tokens where a direct answer would be 10–50. At current API speeds, this adds 2–8 seconds of latency for typical hosted model APIs. Three strategies help:

1
Streaming: Use streaming APIs to show partial reasoning output as it generates. Users tolerate latency better when they see progress. Most modern inference APIs support SSE-based streaming.
2
Selective CoT routing: Classify incoming queries by complexity. Route simple queries to a direct-answer model; route complex queries through CoT. A lightweight classifier (or even a regex/heuristic) can make this decision in <10ms.
3
Max token capping: Set a max_tokens budget for the reasoning section. If the model hasn't reached a conclusion, extract whatever partial reasoning exists. A well-structured prompt with a clear "Final Answer:" delimiter lets you parse partial responses gracefully.

Structured Output Extraction from CoT

The most common production pattern is to use CoT for reasoning but then extract structured data from the reasoning output before passing it downstream. This requires a clear delimiter contract in your prompt:

# Prompt template with delimiter contract
system = """Reason through the problem step by step inside <reasoning> tags.
Then provide your final answer inside <answer> tags.
The answer must be valid JSON matching this schema: {result: number, unit: string}

Example:
<reasoning>
Step 1: ...
Step 2: ...
Therefore: ...
</reasoning>
<answer>{"result": 42, "unit": "miles"}</answer>"""

# Extraction in Python
import re, json

def extract_answer(response_text):
    match = re.search(r'<answer>(.+?)</answer>', response_text, re.DOTALL)
    if match:
        return json.loads(match.group(1).strip())
    raise ValueError("No structured answer found in response")

Debugging Broken CoT Chains

When a CoT-enabled system produces wrong answers, diagnosis follows a systematic process. The reasoning chain is your primary debugging artifact — read it carefully for these failure patterns:

Premise error The model misread or misquoted a value from the input in Step 1, then correctly reasoned from the wrong premise. Fix: add explicit value-extraction step to your prompt before reasoning begins.

Shortcut collapse The reasoning chain gets truncated — model jumps from step 2 straight to a conclusion. Common when max_tokens is too low or when the prompt doesn't reinforce intermediate step requirements.

Confident confabulation The chain includes a specific-sounding fact ("the regulation requires X") that's fabricated. Verification must be external; the confident tone of CoT prose is not a reliability signal.

Sycophantic drift In multi-turn conversations, the model's reasoning chain starts accommodating prior user assertions even when wrong. Mitigation: re-anchor the full problem statement in each CoT prompt rather than relying on conversation history.

Native Reasoning Models: The Shifting Landscape

As of 2024–2025, the prompting techniques covered in this module are partially being superseded by models with internalized reasoning. OpenAI's o1 (released September 2024) and o3, Google's Gemini 2.0 Flash Thinking, and Anthropic's Claude with extended thinking all perform extended chain-of-thought reasoning as part of their forward pass — not as explicit tokens in the context window.

For developers, this creates a decision tree: for tasks requiring transparent, inspectable reasoning steps (auditable decisions, regulated domains), explicit CoT prompting still provides control and traceability that black-box internal reasoning cannot. For maximum accuracy on hard reasoning tasks with no interpretability requirement, native reasoning models typically outperform prompted CoT on instruction-tuned models by large margins.

The CoT prompting skills in this module remain directly applicable to: smaller/cheaper models without native reasoning, fine-tuned domain-specific models, on-premise deployments where native reasoning models aren't available, and any context where you need to inspect and control intermediate reasoning steps.

Key Takeaway

Chain-of-thought is a spectrum from manual prompt engineering to baked-in model architecture. Understanding the mechanics makes you effective across the entire spectrum — whether you're coaxing reasoning from a small open-source model or deciding when to invoke a native reasoning model for your production workload.

You've now covered the full arc of CoT techniques. Take the Lab 4 session to practice production-style extraction and debugging, then attempt the Module Test to verify mastery across all four lessons.

Lesson 4 Quiz

CoT in Production Systems

1. What does "selective CoT routing" mean in production system design?

Correct. A lightweight classifier routes simple queries to direct-answer prompts and complex queries through CoT — balancing accuracy and latency/cost efficiently.

Selective CoT routing means using a fast classifier to identify which queries actually need step-by-step reasoning, and only adding CoT overhead for those queries.

2. What is a "premise error" in CoT debugging?

Correct. Premise errors are insidious because the downstream reasoning may be internally consistent — the entire chain is valid except for the corrupted input value at the start.

A premise error occurs when the model misquotes or misreads an input value at the start of its reasoning. The rest of the chain may be logically sound — but built on a wrong foundation.

3. Why do explicit CoT prompting techniques remain relevant even as native reasoning models emerge?

Correct. When you need interpretability, when using smaller/open-source models, or when native reasoning models aren't available, explicit CoT prompting remains the practical tool.

Explicit CoT remains essential for smaller/open-source models, on-premise deployments, regulated domains requiring transparent reasoning, and any context where inspecting intermediate steps is mandatory.

Lab 4 — Production CoT: Extraction & Debugging

Practice structured output extraction and diagnosing broken reasoning chains.

Your Task

You'll design a prompt with explicit delimiter tags for reasoning and answer sections, practice extracting structured outputs from CoT responses, and debug example reasoning chains that exhibit common failure modes.

Try this: Ask the assistant to solve a problem using the delimiter contract (<reasoning>...</reasoning> and <answer>...</answer>). Then ask it to deliberately demonstrate a "premise error" or "confident confabulation" failure mode. Discuss how you'd detect and handle that failure in a production system.

CoT Lab Assistant

Production CoT

Welcome to Lab 4 — our most applied session. We're practicing production-grade CoT: using delimiter contracts to separate reasoning from structured output, and diagnosing common failure modes. Try asking me to solve a problem with explicit <reasoning> and <answer> tags, then ask me to demonstrate what a premise error or confabulation failure looks like in practice. We'll also discuss detection strategies for each failure type. What would you like to start with?

Module 4 Test

Chain-of-Thought Techniques — 15 questions · 80% to pass

1. Who published the original chain-of-thought prompting paper and in what year?

Correct. Wei et al. at Google Brain published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" in 2022.

The paper was by Jason Wei et al. at Google Brain in 2022.

2. What does the mechanical argument explain about why CoT improves accuracy?

Correct. The scratch-pad is the computation — written reasoning steps become accessible context for all subsequent token generation.

The mechanism is that tokens generated in intermediate steps become part of the context the model conditions subsequent tokens on.

3. Zero-shot CoT was introduced in which paper?

Correct. Kojima et al.'s paper showed that "Let's think step by step" alone could trigger effective reasoning without any worked examples.

Zero-shot CoT came from Kojima et al. 2022, "Large Language Models are Zero-Shot Reasoners."

4. Below which approximate model scale does chain-of-thought prompting typically fail to improve accuracy?

Correct. Wei et al. showed CoT gains are largely emergent above ~10B parameters — smaller models produce incoherent chains that don't reliably help.

CoT gains are emergent above roughly 10 billion parameters. Below that scale, models tend to produce incoherent reasoning chains.

5. Which of the following is NOT a property of a high-quality few-shot CoT exemplar?

Correct. Length is not a quality criterion — the lesson specifies "appropriate granularity." Longer is not better; the level of detail should match the problem's actual complexity.

Maximum length is NOT a quality criterion. The correct criterion is appropriate granularity — matching step detail to actual problem complexity.

6. The Turpin et al. 2023 paper "Language Models Don't Always Say What They Think" demonstrated what concerning finding?

Correct. Reasoning chains can be post-hoc rationalizations rather than faithful explanations of how the model arrived at its answer.

Turpin et al. showed models can reverse-engineer reasoning chains to justify answers they were already primed toward — a faithfulness problem.

7. What is the recommended API temperature setting for self-consistency sampling?

Correct. Temperature > 0 is essential for self-consistency — at temperature 0, all N chains would be identical, making voting meaningless.

Self-consistency requires temperature > 0 (typically 0.5–0.8) to generate diverse reasoning paths. At temperature 0, all chains would be identical.

8. Tree of Thoughts differs from self-consistency primarily in that it:

Correct. ToT's key innovation is the ability to explore, evaluate, and backtrack through a tree of reasoning steps — not just sample independent parallel paths.

The key distinction is that ToT supports sequential branching and backtracking — it can abandon unproductive paths and explore alternatives, unlike parallel self-consistency.

9. On which type of benchmark did Yao et al. demonstrate the largest ToT advantage over standard CoT?

Correct. Game of 24 and crossword puzzles are combinatorial search problems where backtracking is essential — exactly where ToT's tree structure provides maximum benefit.

Yao et al.'s paper showed the largest ToT gains on Game of 24 and crossword puzzles — tasks that inherently require search and backtracking.

10. What is "selective CoT routing" and what is its primary benefit?

Correct. A fast classifier routes simple queries to direct-answer paths and hard queries to CoT — capturing most of the accuracy benefit without applying the latency/cost penalty universally.

Selective routing uses a lightweight classifier to identify which queries need CoT, applying the reasoning overhead only where it actually helps.

11. What is the recommended approach for extracting structured data from CoT responses in production?

Correct. A delimiter contract (e.g., <reasoning>...</reasoning><answer>...</answer>) lets you programmatically extract structured answers reliably even from lengthy reasoning outputs.

The production standard is to define delimiter tags in the system prompt and parse the delimited answer section — clean separation of reasoning from structured output.

12. A model reads "the price is $15.00" but reasons as if it were "$1.50" throughout its chain. This is an example of which CoT failure mode?

Correct. A premise error occurs when an input value is misread or transcribed incorrectly in the first reasoning step, corrupting all downstream computation.

This is a premise error — the model misread the input value and reasoned correctly from the wrong premise. Fix by adding an explicit value-extraction step before reasoning begins.

13. OpenAI's o1 model (released September 2024) is relevant to CoT prompting because it:

Correct. o1 and similar native reasoning models show that extended CoT is increasingly becoming a trained model capability rather than a prompt-engineering technique.

o1 demonstrates that CoT has moved into model architecture — it performs extended internal reasoning as part of its forward pass, not via explicit prompting.

14. "Sycophantic drift" in multi-turn CoT conversations is best mitigated by:

Correct. Re-anchoring the original problem in each prompt prevents the model's reasoning from being subtly corrupted by accumulated user assertions across turns.

Sycophantic drift is best countered by re-anchoring the full problem statement in each prompt, rather than letting the model reason from a conversation history that may contain wrong user assertions.

15. For which scenario does the lesson specifically recommend continuing to use explicit CoT prompting rather than native reasoning models?

Correct. When auditability and transparency of intermediate reasoning are required — such as in regulated industries — explicit CoT prompting provides control that black-box native reasoning models cannot.

Explicit CoT remains essential when intermediate reasoning must be inspectable and auditable — in regulated domains, smaller models, and on-premise deployments.