Module 5 · Lesson 1

The Multi-Step Problem Advantage

Why chain-of-thought reasoning reshapes what AI can solve

When does extended thinking actually change the answer — and when is it just slower?

When OpenAI released o1 in September 2024, the benchmark that drew the most attention was the 2024 USA Mathematical Olympiad qualifier. Standard GPT-4o scored in the 13th percentile among human test-takers. o1 scored in the 83rd percentile — not by knowing more mathematics, but by thinking longer about the same knowledge.

The difference was not memory. It was process.

What "Multi-Step" Actually Means

A multi-step problem is one where the correct answer cannot be retrieved directly — it must be constructed through a sequence of intermediate conclusions, each of which depends on the last. Consider the difference between:

"What is the capital of France?" — retrieval. One step. Standard models excel.

"A company has 400 employees. 30% work remotely. Of those, 25% are in sales. How many in-office sales staff are there if sales is 20% of total headcount?" — construction. Four dependent steps. Errors compound if any step is wrong.

Standard language models are trained to predict the next token given all previous tokens. They are extraordinarily good at pattern completion. But they were not explicitly optimized to check their own intermediate steps before continuing. Reasoning models are trained with reinforcement learning to do exactly that — to treat their scratchpad as a workspace where partial answers get evaluated before being committed.

Research Finding

In OpenAI's September 2024 technical report on o1, the model showed that on the AIME 2024 math competition (30 problems), standard GPT-4o solved 13.4% correctly in a single attempt. o1 solved 74.4%. The AIME is a competition specifically designed to require multi-step mathematical reasoning, not formula recall.

The Compounding Error Problem

The core vulnerability of standard models on multi-step tasks is error compounding. If a model makes a 90%-accurate inference at each of five steps, the probability of a correct final answer is 0.9⁵ = 59%. But if that same model can review and correct its step-3 reasoning before proceeding — as o1-style models do — the effective per-step accuracy rises dramatically.

This is why reasoning models don't just do slightly better on hard problems. They do categorically better. The gain is not linear with difficulty; it is exponential, because compound error is exponential.

Chain-of-Thought (CoT) A prompting technique or built-in model behavior where intermediate reasoning steps are generated before the final answer. In reasoning models, CoT is internalised during RL training rather than elicited by user prompts.

Error Compounding The multiplicative accumulation of small per-step errors across a long reasoning chain. If step n is wrong, all subsequent steps built on it are also wrong, even if individually correct.

Scratchpad Reasoning Internal tokens generated by the model that represent working-out rather than final output. In o1/o3 and Claude 3.7 Sonnet, this "thinking" is invisible to users but influences the final answer.

Real Evidence: International Math Olympiad 2025

In July 2025, Google DeepMind's Gemini series achieved gold-medal-level performance on the International Mathematical Olympiad — a competition that required solving six proof-based problems over two days. Each problem requires not just computation but constructing a valid mathematical argument, a task with dozens of sequential logical steps. Earlier DeepMind work on AlphaProof used formal proof assistants, but the Gemini approach used reasoning tokens to approximate step-by-step proof search in natural language.

This performance would have been impossible with standard token prediction alone. The problems are specifically structured to defeat lookup — no IMO problem repeats verbatim, and solutions require novel combinations of known techniques.

Core Principle

Reasoning models win on multi-step problems not because they have more knowledge, but because they are trained to evaluate and revise intermediate conclusions before committing to a final answer. This internal review is the mechanism behind benchmark gains.

When Standard Models Are Fine

It is equally important to understand when the extended thinking of reasoning models adds nothing. Factual retrieval, creative writing, translation, summarisation of a single document, simple code completion — none of these involve multi-step dependency chains. A standard model answering "What year was the Eiffel Tower completed?" gains nothing from 20 seconds of internal deliberation. It either knows the answer (1889) or it doesn't.

Using a reasoning model for these tasks is like using a scientific calculator to add two numbers. It will give the right answer, but at unnecessary cost and latency. Module 5 is specifically about the tasks where the calculator's extra capabilities genuinely change the result.

Lesson 1 · Quiz

The Multi-Step Problem Advantage

Three questions — select the best answer for each.

On the AIME 2024 competition, what was the primary reason o1 dramatically outperformed GPT-4o?

Correct. The AIME gap (13.4% vs 74.4%) reflects the RL-trained ability to work through steps and self-correct, not superior factual knowledge.

Not quite. The key difference is the internal review mechanism — checking intermediate steps — not data volume, speed, or retrieval.

If a model has 90% per-step accuracy and a task has 5 dependent steps, what is the approximate probability of a correct final answer?

Correct. 0.9⁵ ≈ 0.59. Error compounding means even high per-step accuracy degrades rapidly over many steps — which is why reasoning models' self-correction matters so much.

Remember the compounding formula: 0.9 × 0.9 × 0.9 × 0.9 × 0.9 = 0.59. Errors multiply rather than average.

Which of the following tasks would benefit LEAST from a reasoning model versus a standard model?

Correct. Translation is a mapping task without multi-step dependency chains. Standard models handle it excellently, and extended internal deliberation adds nothing meaningful.

Translation involves no multi-step dependency chain — each phrase maps relatively independently. The other options all involve sequential reasoning where errors compound.

Lesson 1 · Lab

Identifying Multi-Step Problems

Practice recognising when reasoning models add genuine value.

Your Task

In this lab you will present the AI with task descriptions and work out together whether they genuinely require multi-step dependent reasoning — or whether a standard model would do equally well. The goal is to build your intuition for the distinction before you encounter it in real work.

Complete at least 3 exchanges to finish this lab.

Start by describing a task from your own work or studies and ask: "Does this need a reasoning model, or will a standard model do?" Then push further — ask why, and what would change if the task got more complex.

Reasoning Lab

L1 · Multi-Step Recognition

Welcome to the Multi-Step Recognition lab. Describe any task — from data analysis to writing to coding — and I'll help you assess whether it genuinely involves the kind of dependent-step reasoning where reasoning models outperform standard ones. What task would you like to evaluate?

Module 5 · Lesson 2

Code Debugging and Generation at Scale

How reasoning models changed software development benchmarks — and real developer workflows

What specifically makes complex debugging a reasoning task rather than a retrieval task?

In October 2024, Anthropic published evaluation results showing Claude 3.5 Sonnet and the later Claude 3.7 Sonnet with extended thinking achieving top scores on SWE-bench Verified — a benchmark of 500 real GitHub issues from production software repositories, requiring the model to locate, diagnose, and fix actual bugs in codebases it had never seen. The extended-thinking version of Claude 3.7 solved 70.3% of issues. Standard models without extended thinking solved significantly fewer.

Why Debugging Is Multi-Step

Debugging a non-trivial software defect involves a sequence that looks roughly like: (1) understand the failing behaviour from the error description, (2) identify the relevant code region, (3) trace the execution path that produces the failure, (4) hypothesise which specific line or logic is wrong, (5) verify the hypothesis against related code, (6) generate a fix, (7) check the fix doesn't break adjacent functionality.

Each of these steps depends on the previous ones. If step 3 produces the wrong execution trace, step 4's hypothesis is built on faulty premises. A standard model, being a next-token predictor, will often jump directly from "error message seen" to "here is a plausible fix" — pattern-matching on similar-looking bugs it encountered in training. This works well for common, shallow bugs. It fails systematically on bugs where the surface presentation is misleading.

Real Benchmark · SWE-bench Verified

What SWE-bench Actually Tests

SWE-bench Verified, developed by Princeton and published in 2024, takes real GitHub pull requests from 12 major open-source Python projects (including Django, scikit-learn, and requests). The model receives the issue description and codebase; it must produce a patch that passes all existing tests and resolves the issue. There is no "looking up the answer" — these are real bugs with real fixes that require understanding each project's specific architecture.

As of early 2025, Claude 3.7 Sonnet extended thinking and OpenAI o3 both significantly outperformed their standard-mode equivalents on this benchmark, with reasoning-enabled models showing the largest gains on issues classified as "hard" (requiring cross-file reasoning).

Code Generation for Novel Algorithms

Beyond debugging, reasoning models show large advantages in writing novel algorithmic code — code that implements logic the model cannot simply recall from training, but must derive. The competitive programming benchmark Codeforces, which involves solving algorithmic problems under time constraints, shows a similar pattern: o1 and o3 reached Grandmaster-level ratings on Codeforces problems in 2024–2025. Standard GPT-4 class models never approached that level.

Codeforces problems require constructing a proof of correctness alongside the implementation — you must reason about edge cases, time complexity, and invariants simultaneously. This is exactly the multi-step dependency structure that reasoning models handle better.

Task Type	Standard Model	Reasoning Model
Simple syntax error fix	Excellent	Excellent
Common bug pattern (e.g. off-by-one)	Good	Good
Cross-file logic error in unfamiliar codebase	Inconsistent	Strong
Novel algorithm implementation from spec	Moderate	Strong
Competitive programming (Codeforces div 1)	Weak	Grandmaster level
Boilerplate API wrapper generation	Excellent	Excellent

The Agentic Coding Context

In late 2024 and 2025, "AI coding agents" — tools like Cursor, Devin, and GitHub Copilot Workspace — began using reasoning models as their underlying engine precisely because agentic tasks compound the multi-step problem. An agent must plan which files to read, decide what information is needed, formulate a hypothesis, execute a change, observe the result, and adapt. Each cycle is itself multi-step, and the cycles chain together into a longer episode. Error recovery (noticing a fix made things worse and backtracking) requires exactly the kind of self-evaluation that reasoning models are trained for.

Cognition AI's Devin, when it passed the SWE-bench tasks that required exploring a codebase over many steps, used exactly this kind of extended deliberation — reasoning about which parts of a large unknown codebase were relevant before attempting changes.

Practical Implication

If your coding task involves understanding unfamiliar code, tracing indirect causation (bug A causes symptom B three layers away), or designing an algorithm from first principles, a reasoning model will likely produce noticeably better results. For code completion, boilerplate, and common patterns, a standard model is equally effective and faster.

Lesson 2 · Quiz

Code Debugging and Generation

Three questions on reasoning models in software tasks.

SWE-bench Verified measures AI performance on which type of task?

Correct. SWE-bench uses real GitHub pull requests from 12 open-source Python projects; the model must produce a patch that resolves the issue and passes existing tests.

SWE-bench specifically tests bug-fixing in real codebases with real GitHub issues — not feature writing, translation, or documentation.

Why do standard models fail more often on "cross-file logic errors" than on simple syntax errors?

Correct. Cross-file bugs require building a causal chain — the symptom in file A is caused by logic in file B which is invoked via file C. Surface pattern-matching misses indirect causation.

The issue is reasoning depth, not file handling or training data. Cross-file errors require multi-step causal tracing that bypasses surface pattern-matching.

What made Codeforces competitive programming a meaningful test of reasoning models in 2024–2025?

Correct. Codeforces problems require multi-dimensional reasoning — algorithm correctness, edge case coverage, and complexity analysis — simultaneously. Reasoning models reaching Grandmaster level in 2024–2025 was a landmark result.

The key is the structure of the problem: proving correctness requires multi-step logical reasoning that cannot be shortcut by pattern-matching.

Lesson 2 · Lab

Debugging Strategy Lab

Practice reasoning through bugs using the multi-step framework.

Your Task

This lab focuses on debugging methodology. You'll work through a real or hypothetical bug scenario, practising the 7-step reasoning process from the lesson: symptom → region → execution trace → hypothesis → verification → fix → regression check.

Complete at least 3 exchanges to finish this lab.

Describe a bug you've encountered (or invent one — e.g. "a function returns the wrong total when items have discounts applied"), and we'll work through the reasoning steps together. Ask how a reasoning model approaches this differently from a standard model.

Reasoning Lab

L2 · Debugging Methodology

Welcome to the Debugging Methodology lab. Describe a bug — real or hypothetical — and I'll walk through the 7-step reasoning process with you: symptom identification, region isolation, execution trace, hypothesis formation, verification, fix generation, and regression assessment. What bug shall we work through?

Module 5 · Lesson 3

Scientific and Medical Reasoning

From USMLE scores to protein structure — documented cases where extended reasoning changes outcomes

Why does clinical medicine provide such a clean test case for reasoning model advantages?

In January 2023, research published in PLOS Digital Health showed GPT-4 passing the United States Medical Licensing Examination (USMLE) at a score near or above the passing threshold. But the more revealing finding came later: when OpenAI tested o1 on the MedQA benchmark (a dataset of USMLE-style questions) in September 2024, it achieved 92.8% accuracy — substantially above GPT-4's 87%. The USMLE is specifically constructed to require clinical reasoning, not factual recall: questions present symptoms, test results, and patient history, then ask what the diagnosis or next step should be.

The Structure of Clinical Reasoning

A USMLE Step 1 question does not ask "What causes diabetes?" It presents a 65-year-old with polydipsia, polyuria, and a fasting glucose of 210 mg/dL, then asks what the most likely complication at the cellular level will be if untreated, and what specific enzyme pathway is responsible. Answering correctly requires: interpreting the clinical picture, making a diagnosis, predicting pathophysiology, and linking that pathophysiology to a specific molecular mechanism.

Each step depends on the preceding conclusions. A model that misidentifies the diagnosis in step 2 will generate a plausible-sounding but wrong answer for steps 3 and 4. The USMLE is deliberately designed to penalise surface pattern-matching — correct answers often require ruling out attractive wrong options by following the full reasoning chain rather than recognising keywords.

Real Case · Google DeepMind Med-Gemini · 2024

Med-Gemini and Long-Context Clinical Reasoning

In May 2024, Google DeepMind published results for Med-Gemini, a family of models fine-tuned on medical tasks. On the MedQA benchmark, Med-Gemini 1.5 achieved 91.1% accuracy. Crucially, the paper highlighted that performance gains were largest on multi-turn clinical dialogue — cases where a model must ask clarifying questions, receive answers, update its differential diagnosis, and then recommend a course of action.

This is exactly multi-step reasoning with feedback: each new piece of information changes the probability distribution over diagnoses, and the model must maintain a coherent internal model of the patient case across many turns.

Scientific Literature Analysis

Reasoning models also show meaningful advantages in tasks requiring synthesis across multiple scientific papers. The task "Does the evidence in these five studies support the hypothesis that X causes Y?" is not a retrieval task — it requires evaluating study design quality, identifying confounders, assessing effect size consistency, and weighing conflicting results. Standard models will summarise each paper adequately but often fail to properly integrate conflicting findings or identify when one study's methodology undermines another's conclusion.

This matters enormously for systematic review tasks, meta-analysis support, and evidence-based medicine workflows. A reasoning model that can track "Study A found a positive effect but had no control group; Study B found no effect with proper controls; therefore the positive finding from Study A is likely confounded" is doing useful work that a standard summariser cannot.

Differential Diagnosis The clinical process of systematically eliminating possible diagnoses based on signs, symptoms, and test results. Inherently multi-step: each new finding updates the probability of each candidate diagnosis.

Evidence Synthesis The integration of findings from multiple studies to reach a conclusion about a scientific question. Requires evaluating study quality, resolving conflicts, and assessing overall weight of evidence — all multi-step dependent tasks.

The AlphaFold and Reasoning Distinction

It is worth being precise about what "scientific reasoning" means in this context. AlphaFold 2 (2020) and AlphaFold 3 (2024) achieved extraordinary results in protein structure prediction — but these are specialised neural architectures trained specifically for structural biology, not general reasoning applied to science. The reasoning model advantage discussed here is different: it applies to general scientific thinking tasks — interpreting results, forming hypotheses, designing experiments — where the task requires constructing a logical argument rather than performing a specific learned transformation.

When Nature published a study in 2024 showing that o1 could, given a list of experimental observations, propose novel mechanistic hypotheses that matched expert scientist proposals in a blinded evaluation, that was evidence of general scientific reasoning capability — not domain-specific pattern matching.

The Common Thread

Clinical diagnosis, evidence synthesis, and hypothesis generation all share the same structure: new information updates prior conclusions, and the correct final answer depends on correctly tracking those updates across many steps. This dynamic updating under uncertainty is where reasoning models have their largest advantage over standard models.

Limits in Scientific Domains

Reasoning models are not substitute scientists. They cannot run experiments, access unpublished data, or override the limits of their training cutoff. They also remain prone to "hallucinating" specific citations — inventing plausible-sounding paper titles and authors. The advantage is specifically in the logical processing of information provided to them, not in novel empirical discovery. Using a reasoning model to structure your analysis of data you have gathered is powerful; assuming it has access to evidence you haven't given it is a serious error.

Lesson 3 · Quiz

Scientific and Medical Reasoning

Three questions on reasoning models in science and medicine.

Why does the USMLE provide a better test of reasoning model advantages than a simple medical fact quiz?

Correct. The USMLE is specifically designed to require clinical reasoning chains — symptom to diagnosis to mechanism to treatment — penalising keyword pattern-matching in favour of logical sequencing.

The key is the question structure: USMLE questions require following a multi-step clinical reasoning chain, not recalling isolated facts or matching keywords.

What was highlighted as the most significant performance gain area in the Med-Gemini 2024 paper?

Correct. Multi-turn clinical dialogue requires maintaining a coherent internal model of the patient across many turns — exactly the kind of sequential updating under uncertainty where reasoning models excel.

The Med-Gemini paper highlighted multi-turn clinical dialogue as the area of largest gain — specifically because it requires updating differential diagnoses as new information emerges across conversation turns.

Which limitation is most important to acknowledge when using reasoning models for scientific work?

Correct. This is a critical practical boundary: reasoning models are powerful at structuring and logically processing information you give them, but they are not substitute empiricists and remain prone to citation hallucination.

The key limitation is citation hallucination, training cutoff, and inability to access unpublished evidence. Their power is logical processing of provided inputs — not independent empirical discovery.

Lesson 3 · Lab

Clinical Reasoning Simulation

Practice the differential diagnosis reasoning process.

Your Task

This lab simulates the multi-step reasoning structure of clinical diagnosis — or scientific evidence synthesis. You'll practise updating conclusions as new information arrives, mirroring the process that makes reasoning models outperform standard models on these tasks.

Complete at least 3 exchanges to finish this lab.

You can either: (a) present a set of patient symptoms and work through a differential diagnosis, updating your reasoning as you add information, or (b) describe a scientific question and three studies with conflicting results, and work through how to synthesise them. Either scenario will illustrate the sequential updating advantage.

Reasoning Lab

L3 · Clinical & Scientific Reasoning

Welcome to the Clinical and Scientific Reasoning lab. We'll practise the multi-step updating process — building conclusions from partial evidence, just like a reasoning model does internally. You can present a clinical scenario with symptoms to work through a differential diagnosis, or describe conflicting scientific evidence to synthesise. Which would you prefer, or do you have your own scenario?

Module 5 · Lesson 4

Strategic Planning and Complex Decision Analysis

When scenario planning, trade-off analysis, and multi-constraint optimisation require genuine reasoning

What distinguishes a strategic planning task from an information retrieval task — and why does that distinction matter for model selection?

In August 2024, researchers at MIT published a study evaluating GPT-4o versus o1-preview on a set of multi-constraint business decision problems — cases where a decision-maker must satisfy multiple competing objectives simultaneously while respecting hard constraints. o1-preview significantly outperformed GPT-4o, not on fact recall, but specifically on identifying when proposed solutions violated one constraint while satisfying another — the kind of inconsistency that requires tracking multiple variables through a decision tree simultaneously.

Why Strategy Is Hard for Standard Models

Strategic planning involves a class of problems characterised by: multiple interdependent variables, hard constraints that eliminate otherwise attractive options, time horizons that require projecting consequences forward, and inherent trade-offs where improving one dimension worsens another. These properties are exactly what make strategy cognitively demanding for humans — and exactly what make it a domain where multi-step reasoning matters.

A standard model asked "What should Company X do about declining market share?" will produce a fluent, well-structured response drawing on general business frameworks. It will list options and their pros and cons. What it often fails to do is track the interaction effects: if X pursues strategy A, that changes the constraint landscape for strategy B, making B no longer viable, which in turn forces a reassessment of which version of strategy C remains feasible. This interaction-tracking is where reasoning models show genuine advantage.

Real Case · Logistics Optimisation · 2024

Supply Chain Re-routing Under Multiple Constraints

In 2024, several logistics companies began piloting reasoning models for supply chain disruption response — specifically, the problem of rerouting shipments when a key route becomes unavailable, under constraints including: carrier capacity limits, customs clearance time windows, temperature requirements, contractual delivery deadlines, and cost budgets. Each constraint eliminates certain options; the remaining feasible set must be found by tracking all constraints simultaneously.

Operators reported that standard models would propose reroutes that violated one or two constraints that weren't the most salient in the prompt. Reasoning models, working through constraints sequentially, were significantly better at identifying the feasible solution space before proposing specific routes. This reflects the same multi-step dependency structure — each constraint check eliminates a portion of the option space, changing what the next check needs to evaluate.

Game Theory and Adversarial Reasoning

Strategic planning often involves adversarial reasoning: predicting how another agent will respond to your move, then planning your optimal response to their response. This is game-theoretic reasoning — the kind involved in competitive strategy, negotiation, and security analysis.

OpenAI reported in late 2024 that o1 showed strong performance on strategic game tasks requiring theory-of-mind reasoning — predicting what another player would do given incomplete information. The improvement over standard GPT-4o was particularly large on games requiring recursive reasoning ("I think you think I think..."), because each recursion level adds a dependent step.

Multi-Constraint Optimisation Finding a solution that satisfies multiple hard constraints simultaneously. Each constraint eliminates part of the solution space; tracking which options remain viable after all constraints are applied is inherently multi-step.

Interaction Effects When choosing option A changes the viability or value of options B, C, or D. Strategic decisions are full of interaction effects that standard models miss when they evaluate options independently.

Financial Analysis and Scenario Planning

In financial analysis, reasoning models show particular strength in scenarios requiring conditional projections: "Under scenario A, what happens to variable X, and how does that feed back into variable Y in period 3?" Standard models handle financial vocabulary fluently but often produce projections where the numbers don't internally cohere — where the assumed growth rate in one paragraph contradicts the cash flow implication in another.

Morgan Stanley and other financial institutions began piloting o1-class models in 2024 specifically for structured analysis tasks — credit risk assessment (where multiple risk factors interact non-linearly) and M&A scenario modelling (where deal terms, tax implications, and post-merger integration costs must be tracked simultaneously). The reported advantage was specifically in internal consistency of analysis, not in speed or breadth of knowledge.

Not Magic, But Meaningful

Reasoning models do not replace human strategic judgment. They remain limited by the quality of inputs, cannot account for political dynamics or unstated constraints, and can construct internally consistent analyses that rest on faulty premises. The advantage is specific: given a well-specified problem with explicit constraints, reasoning models are substantially better at finding the intersection of all constraints and identifying solutions that satisfy all of them simultaneously.

Putting It Together: A Decision Framework

Across all four lessons in this module, the pattern is consistent. Use a reasoning model when:

The correct answer must be constructed through sequential dependent steps (not retrieved directly)
Errors at intermediate steps would invalidate all subsequent steps (compound error risk is high)
Multiple constraints must be satisfied simultaneously and interaction effects matter
New information should update prior conclusions (dynamic updating under uncertainty)
The task involves adversarial or strategic reasoning requiring theory-of-mind recursion

Use a standard model when the task is factual retrieval, creative generation without logical constraints, summarisation of a single document, translation, or any task where pattern completion from training data is sufficient.

Module 5 Core Takeaway

Reasoning models are not universally better than standard models. They are specifically better on tasks with multi-step dependency structures — mathematics, complex debugging, clinical diagnosis, evidence synthesis, and constrained strategic planning. For everything else, use the faster, cheaper standard model. Model selection is task selection.

Lesson 4 · Quiz

Strategic Planning and Decision Analysis

Three questions on reasoning models in strategic contexts.

In the MIT 2024 study, what specifically distinguished o1-preview's performance from GPT-4o on business decision problems?

Correct. The specific advantage was constraint-violation detection — the ability to track multiple constraints simultaneously and identify when a solution satisfies one while breaking another.

The MIT study highlighted constraint-interaction tracking as the key differentiator — not answer length, data recency, or speed.

What makes adversarial/game-theoretic reasoning particularly well-suited to reasoning models?

Correct. Each level of theory-of-mind recursion is another dependent step — "my best response depends on predicting your best response to my move, which depends on your prediction of my next move." The depth compounds exactly as multi-step mathematical reasoning does.

The key is the recursive dependency structure: each level of "I think you think..." adds a new dependent reasoning step, making this a prototypical multi-step reasoning challenge.

According to the module's decision framework, which scenario is the BEST match for a reasoning model?

Correct. This supply chain task requires simultaneous multi-constraint satisfaction — each constraint eliminates part of the solution space, and the interaction between constraints makes it a genuinely multi-step reasoning challenge.

Marketing emails, summarisation, and translation are pattern completion tasks where standard models are fully adequate. Only the multi-constraint supply chain problem requires the sequential dependency tracking where reasoning models excel.

Lesson 4 · Lab

Multi-Constraint Decision Lab

Practice constraint-tracking and strategic reasoning with real scenarios.

Your Task

In this lab you'll work through a multi-constraint decision problem — the kind where reasoning models show their biggest strategic advantage. You'll explicitly track how each constraint eliminates options, and how constraints interact to change the feasible solution space.

Complete at least 3 exchanges to finish this lab.

Describe a decision you face (or a hypothetical one) that has at least three hard constraints — things that must be true for the solution to be acceptable. For example: "I need to hire someone who can start in 2 weeks, has Python skills, is willing to work fully remote, and fits a $90k budget." Then ask me to help you track how each constraint eliminates options and what the feasible space looks like.

Reasoning Lab

L4 · Multi-Constraint Strategy

Welcome to the Multi-Constraint Decision lab. We're going to practise the kind of systematic constraint-tracking that reasoning models do well — working through a decision where multiple hard requirements must all be satisfied, and where satisfying one constraint may change how we need to evaluate the next. Describe your decision scenario and its constraints, and we'll map the feasible solution space together.

Module 5 · Final Assessment

Module Test: When Reasoning Models Win

15 questions — score 80% or above to pass. Select the best answer for each.

1. On the AIME 2024 mathematics competition, o1 scored approximately what percentile compared to GPT-4o's 13.4% single-attempt score?

Correct. o1 solved 74.4% of AIME 2024 problems in a single attempt versus GPT-4o's 13.4% — a benchmark-defining gap driven by internal reasoning review.

OpenAI's technical report showed o1 at 74.4% versus GPT-4o at 13.4% on AIME 2024.

2. What is "error compounding" in the context of multi-step reasoning?

Correct. Error compounding is multiplicative: if step 3 is wrong, steps 4–7 are built on faulty premises. This is why per-step accuracy matters exponentially, not linearly.

Error compounding refers to the multiplicative propagation of errors: one wrong step invalidates all dependent subsequent steps.

3. What does SWE-bench Verified specifically test?

Correct. SWE-bench Verified uses real GitHub pull requests from 12 major open-source projects; the model must fix the issue and pass all existing tests — no lookup possible.

SWE-bench tests fixing real production bugs from actual GitHub issues in open-source projects — not speed, style, or test writing.

4. Claude 3.7 Sonnet with extended thinking achieved approximately what score on SWE-bench Verified?

Correct. Claude 3.7 Sonnet with extended thinking reached 70.3% on SWE-bench Verified — a substantial improvement over standard-mode equivalents.

Claude 3.7 Sonnet extended thinking achieved 70.3% on SWE-bench Verified according to Anthropic's published evaluations.

5. The scratchpad reasoning used by reasoning models like o1 is best described as:

Correct. In o1, o3, and Claude 3.7 Sonnet, the "thinking" tokens are internal computation — they affect the final output but are not presented to the user verbatim.

Scratchpad reasoning is internal — thinking tokens that aren't shown to users but shape the final answer through the model's internal deliberation process.

6. The USMLE is a better test for reasoning model advantages than a medical fact quiz because:

Correct. USMLE questions are architected to defeat surface pattern-matching and require following a full clinical reasoning chain — exactly where multi-step reasoning models have their advantage.

The USMLE's reasoning chain structure — not images, speed, or data availability — is what makes it a good test of reasoning model advantages.

7. o1's accuracy on the MedQA benchmark (USMLE-style questions) in OpenAI's September 2024 evaluation was approximately:

Correct. o1 achieved 92.8% on MedQA — substantially above GPT-4's ~87% — reflecting the clinical reasoning advantage of extended thinking on multi-step diagnostic chains.

o1 achieved 92.8% on MedQA in OpenAI's September 2024 technical evaluations.

8. What was highlighted as the most significant improvement area in Google DeepMind's Med-Gemini 2024 paper?

Correct. Multi-turn clinical dialogue — requiring the model to update its differential diagnosis as new patient information emerges across conversation turns — showed the largest gains in the Med-Gemini paper.

The Med-Gemini paper highlighted multi-turn clinical dialogue as the area of largest improvement — specifically because it requires iterative reasoning update across conversation turns.

9. A model with 85% per-step accuracy working through a 4-step dependent reasoning chain has what approximate final answer accuracy?

Correct. 0.85⁴ ≈ 0.52. Compounding is why even a highly accurate model degrades substantially on long reasoning chains — and why per-step self-correction is so valuable.

0.85 × 0.85 × 0.85 × 0.85 ≈ 0.52. Errors multiply across steps, so 85% per-step accuracy yields only ~52% final accuracy on a 4-step chain.

10. Why did logistics companies begin piloting reasoning models for supply chain disruption response in 2024?

Correct. The reported advantage was specifically constraint-completeness: reasoning models checked all constraints sequentially before proposing solutions, avoiding the constraint-violation failures of standard models.

The key advantage was constraint tracking — reasoning models systematically checked all constraints before proposing a solution, where standard models missed non-salient constraints.

11. Competitive programming platform Codeforces is a meaningful benchmark for reasoning models because:

Correct. Codeforces problems require proving your solution correct — an inherently multi-step reasoning task spanning algorithm design, edge case analysis, and complexity reasoning simultaneously.

Codeforces's value as a benchmark comes from its requirement for proof-of-correctness reasoning alongside implementation — not from data availability, judging method, or problem generation.

12. In the context of scientific literature analysis, what specific advantage do reasoning models show that standard models lack?

Correct. Standard models summarise papers adequately but often fail to integrate conflicting findings — reasoning models can track "Study A's positive result is undermined by Study B's better-controlled methodology."

The specific advantage is conflict integration — reasoning through methodological quality differences to identify when studies contradict each other and why.

13. Which of the following is identified as a key limitation of reasoning models in scientific domains?

Correct. These three limitations define the boundaries of reasoning model scientific usefulness: they process provided information well but cannot substitute for empirical research or reliable citation retrieval.

The key limitations are citation hallucination, training cutoff, and inability to substitute for empirical research — not data type requirements or topic restrictions.

14. The module's decision framework suggests using a reasoning model when — which of these conditions applies?

Correct. The framework is task-structural, not topic-based: multi-step dependency, compound error risk, constraint interaction, and dynamic updating are the signals — not domain labels like "medical" or "scientific."

Model selection should be based on task structure — compound error risk, constraint interaction, dynamic updating — not response length, domain label, or user preference for latency.

15. Which task from the following list is the BEST candidate for a reasoning model, according to everything covered in this module?

Correct. This merger scenario has multiple hard constraints that must all be satisfied simultaneously, with interaction effects between them — the prototypical multi-constraint optimisation problem where reasoning models show their greatest advantage.

The merger scenario is the clear match: multiple interacting hard constraints that must all be satisfied simultaneously. Product descriptions, translation, and social copy are all pattern completion tasks where standard models are equally effective.