When OpenAI released o1 in September 2024, the benchmark that drew the most attention was the 2024 USA Mathematical Olympiad qualifier. Standard GPT-4o scored in the 13th percentile among human test-takers. o1 scored in the 83rd percentile — not by knowing more mathematics, but by thinking longer about the same knowledge.
The difference was not memory. It was process.
A multi-step problem is one where the correct answer cannot be retrieved directly — it must be constructed through a sequence of intermediate conclusions, each of which depends on the last. Consider the difference between:
"What is the capital of France?" — retrieval. One step. Standard models excel.
"A company has 400 employees. 30% work remotely. Of those, 25% are in sales. How many in-office sales staff are there if sales is 20% of total headcount?" — construction. Four dependent steps. Errors compound if any step is wrong.
Standard language models are trained to predict the next token given all previous tokens. They are extraordinarily good at pattern completion. But they were not explicitly optimized to check their own intermediate steps before continuing. Reasoning models are trained with reinforcement learning to do exactly that — to treat their scratchpad as a workspace where partial answers get evaluated before being committed.
In OpenAI's September 2024 technical report on o1, the model showed that on the AIME 2024 math competition (30 problems), standard GPT-4o solved 13.4% correctly in a single attempt. o1 solved 74.4%. The AIME is a competition specifically designed to require multi-step mathematical reasoning, not formula recall.
The core vulnerability of standard models on multi-step tasks is error compounding. If a model makes a 90%-accurate inference at each of five steps, the probability of a correct final answer is 0.9⁵ = 59%. But if that same model can review and correct its step-3 reasoning before proceeding — as o1-style models do — the effective per-step accuracy rises dramatically.
This is why reasoning models don't just do slightly better on hard problems. They do categorically better. The gain is not linear with difficulty; it is exponential, because compound error is exponential.
In July 2025, Google DeepMind's Gemini series achieved gold-medal-level performance on the International Mathematical Olympiad — a competition that required solving six proof-based problems over two days. Each problem requires not just computation but constructing a valid mathematical argument, a task with dozens of sequential logical steps. Earlier DeepMind work on AlphaProof used formal proof assistants, but the Gemini approach used reasoning tokens to approximate step-by-step proof search in natural language.
This performance would have been impossible with standard token prediction alone. The problems are specifically structured to defeat lookup — no IMO problem repeats verbatim, and solutions require novel combinations of known techniques.
Reasoning models win on multi-step problems not because they have more knowledge, but because they are trained to evaluate and revise intermediate conclusions before committing to a final answer. This internal review is the mechanism behind benchmark gains.
It is equally important to understand when the extended thinking of reasoning models adds nothing. Factual retrieval, creative writing, translation, summarisation of a single document, simple code completion — none of these involve multi-step dependency chains. A standard model answering "What year was the Eiffel Tower completed?" gains nothing from 20 seconds of internal deliberation. It either knows the answer (1889) or it doesn't.
Using a reasoning model for these tasks is like using a scientific calculator to add two numbers. It will give the right answer, but at unnecessary cost and latency. Module 5 is specifically about the tasks where the calculator's extra capabilities genuinely change the result.
In this lab you will present the AI with task descriptions and work out together whether they genuinely require multi-step dependent reasoning — or whether a standard model would do equally well. The goal is to build your intuition for the distinction before you encounter it in real work.
Complete at least 3 exchanges to finish this lab.
In October 2024, Anthropic published evaluation results showing Claude 3.5 Sonnet and the later Claude 3.7 Sonnet with extended thinking achieving top scores on SWE-bench Verified — a benchmark of 500 real GitHub issues from production software repositories, requiring the model to locate, diagnose, and fix actual bugs in codebases it had never seen. The extended-thinking version of Claude 3.7 solved 70.3% of issues. Standard models without extended thinking solved significantly fewer.
Debugging a non-trivial software defect involves a sequence that looks roughly like: (1) understand the failing behaviour from the error description, (2) identify the relevant code region, (3) trace the execution path that produces the failure, (4) hypothesise which specific line or logic is wrong, (5) verify the hypothesis against related code, (6) generate a fix, (7) check the fix doesn't break adjacent functionality.
Each of these steps depends on the previous ones. If step 3 produces the wrong execution trace, step 4's hypothesis is built on faulty premises. A standard model, being a next-token predictor, will often jump directly from "error message seen" to "here is a plausible fix" — pattern-matching on similar-looking bugs it encountered in training. This works well for common, shallow bugs. It fails systematically on bugs where the surface presentation is misleading.
SWE-bench Verified, developed by Princeton and published in 2024, takes real GitHub pull requests from 12 major open-source Python projects (including Django, scikit-learn, and requests). The model receives the issue description and codebase; it must produce a patch that passes all existing tests and resolves the issue. There is no "looking up the answer" — these are real bugs with real fixes that require understanding each project's specific architecture.
As of early 2025, Claude 3.7 Sonnet extended thinking and OpenAI o3 both significantly outperformed their standard-mode equivalents on this benchmark, with reasoning-enabled models showing the largest gains on issues classified as "hard" (requiring cross-file reasoning).
Beyond debugging, reasoning models show large advantages in writing novel algorithmic code — code that implements logic the model cannot simply recall from training, but must derive. The competitive programming benchmark Codeforces, which involves solving algorithmic problems under time constraints, shows a similar pattern: o1 and o3 reached Grandmaster-level ratings on Codeforces problems in 2024–2025. Standard GPT-4 class models never approached that level.
Codeforces problems require constructing a proof of correctness alongside the implementation — you must reason about edge cases, time complexity, and invariants simultaneously. This is exactly the multi-step dependency structure that reasoning models handle better.
| Task Type | Standard Model | Reasoning Model |
|---|---|---|
| Simple syntax error fix | Excellent | Excellent |
| Common bug pattern (e.g. off-by-one) | Good | Good |
| Cross-file logic error in unfamiliar codebase | Inconsistent | Strong |
| Novel algorithm implementation from spec | Moderate | Strong |
| Competitive programming (Codeforces div 1) | Weak | Grandmaster level |
| Boilerplate API wrapper generation | Excellent | Excellent |
In late 2024 and 2025, "AI coding agents" — tools like Cursor, Devin, and GitHub Copilot Workspace — began using reasoning models as their underlying engine precisely because agentic tasks compound the multi-step problem. An agent must plan which files to read, decide what information is needed, formulate a hypothesis, execute a change, observe the result, and adapt. Each cycle is itself multi-step, and the cycles chain together into a longer episode. Error recovery (noticing a fix made things worse and backtracking) requires exactly the kind of self-evaluation that reasoning models are trained for.
Cognition AI's Devin, when it passed the SWE-bench tasks that required exploring a codebase over many steps, used exactly this kind of extended deliberation — reasoning about which parts of a large unknown codebase were relevant before attempting changes.
If your coding task involves understanding unfamiliar code, tracing indirect causation (bug A causes symptom B three layers away), or designing an algorithm from first principles, a reasoning model will likely produce noticeably better results. For code completion, boilerplate, and common patterns, a standard model is equally effective and faster.
This lab focuses on debugging methodology. You'll work through a real or hypothetical bug scenario, practising the 7-step reasoning process from the lesson: symptom → region → execution trace → hypothesis → verification → fix → regression check.
Complete at least 3 exchanges to finish this lab.
In January 2023, research published in PLOS Digital Health showed GPT-4 passing the United States Medical Licensing Examination (USMLE) at a score near or above the passing threshold. But the more revealing finding came later: when OpenAI tested o1 on the MedQA benchmark (a dataset of USMLE-style questions) in September 2024, it achieved 92.8% accuracy — substantially above GPT-4's 87%. The USMLE is specifically constructed to require clinical reasoning, not factual recall: questions present symptoms, test results, and patient history, then ask what the diagnosis or next step should be.
A USMLE Step 1 question does not ask "What causes diabetes?" It presents a 65-year-old with polydipsia, polyuria, and a fasting glucose of 210 mg/dL, then asks what the most likely complication at the cellular level will be if untreated, and what specific enzyme pathway is responsible. Answering correctly requires: interpreting the clinical picture, making a diagnosis, predicting pathophysiology, and linking that pathophysiology to a specific molecular mechanism.
Each step depends on the preceding conclusions. A model that misidentifies the diagnosis in step 2 will generate a plausible-sounding but wrong answer for steps 3 and 4. The USMLE is deliberately designed to penalise surface pattern-matching — correct answers often require ruling out attractive wrong options by following the full reasoning chain rather than recognising keywords.
In May 2024, Google DeepMind published results for Med-Gemini, a family of models fine-tuned on medical tasks. On the MedQA benchmark, Med-Gemini 1.5 achieved 91.1% accuracy. Crucially, the paper highlighted that performance gains were largest on multi-turn clinical dialogue — cases where a model must ask clarifying questions, receive answers, update its differential diagnosis, and then recommend a course of action.
This is exactly multi-step reasoning with feedback: each new piece of information changes the probability distribution over diagnoses, and the model must maintain a coherent internal model of the patient case across many turns.
Reasoning models also show meaningful advantages in tasks requiring synthesis across multiple scientific papers. The task "Does the evidence in these five studies support the hypothesis that X causes Y?" is not a retrieval task — it requires evaluating study design quality, identifying confounders, assessing effect size consistency, and weighing conflicting results. Standard models will summarise each paper adequately but often fail to properly integrate conflicting findings or identify when one study's methodology undermines another's conclusion.
This matters enormously for systematic review tasks, meta-analysis support, and evidence-based medicine workflows. A reasoning model that can track "Study A found a positive effect but had no control group; Study B found no effect with proper controls; therefore the positive finding from Study A is likely confounded" is doing useful work that a standard summariser cannot.
It is worth being precise about what "scientific reasoning" means in this context. AlphaFold 2 (2020) and AlphaFold 3 (2024) achieved extraordinary results in protein structure prediction — but these are specialised neural architectures trained specifically for structural biology, not general reasoning applied to science. The reasoning model advantage discussed here is different: it applies to general scientific thinking tasks — interpreting results, forming hypotheses, designing experiments — where the task requires constructing a logical argument rather than performing a specific learned transformation.
When Nature published a study in 2024 showing that o1 could, given a list of experimental observations, propose novel mechanistic hypotheses that matched expert scientist proposals in a blinded evaluation, that was evidence of general scientific reasoning capability — not domain-specific pattern matching.
Clinical diagnosis, evidence synthesis, and hypothesis generation all share the same structure: new information updates prior conclusions, and the correct final answer depends on correctly tracking those updates across many steps. This dynamic updating under uncertainty is where reasoning models have their largest advantage over standard models.
Reasoning models are not substitute scientists. They cannot run experiments, access unpublished data, or override the limits of their training cutoff. They also remain prone to "hallucinating" specific citations — inventing plausible-sounding paper titles and authors. The advantage is specifically in the logical processing of information provided to them, not in novel empirical discovery. Using a reasoning model to structure your analysis of data you have gathered is powerful; assuming it has access to evidence you haven't given it is a serious error.
This lab simulates the multi-step reasoning structure of clinical diagnosis — or scientific evidence synthesis. You'll practise updating conclusions as new information arrives, mirroring the process that makes reasoning models outperform standard models on these tasks.
Complete at least 3 exchanges to finish this lab.
In August 2024, researchers at MIT published a study evaluating GPT-4o versus o1-preview on a set of multi-constraint business decision problems — cases where a decision-maker must satisfy multiple competing objectives simultaneously while respecting hard constraints. o1-preview significantly outperformed GPT-4o, not on fact recall, but specifically on identifying when proposed solutions violated one constraint while satisfying another — the kind of inconsistency that requires tracking multiple variables through a decision tree simultaneously.
Strategic planning involves a class of problems characterised by: multiple interdependent variables, hard constraints that eliminate otherwise attractive options, time horizons that require projecting consequences forward, and inherent trade-offs where improving one dimension worsens another. These properties are exactly what make strategy cognitively demanding for humans — and exactly what make it a domain where multi-step reasoning matters.
A standard model asked "What should Company X do about declining market share?" will produce a fluent, well-structured response drawing on general business frameworks. It will list options and their pros and cons. What it often fails to do is track the interaction effects: if X pursues strategy A, that changes the constraint landscape for strategy B, making B no longer viable, which in turn forces a reassessment of which version of strategy C remains feasible. This interaction-tracking is where reasoning models show genuine advantage.
In 2024, several logistics companies began piloting reasoning models for supply chain disruption response — specifically, the problem of rerouting shipments when a key route becomes unavailable, under constraints including: carrier capacity limits, customs clearance time windows, temperature requirements, contractual delivery deadlines, and cost budgets. Each constraint eliminates certain options; the remaining feasible set must be found by tracking all constraints simultaneously.
Operators reported that standard models would propose reroutes that violated one or two constraints that weren't the most salient in the prompt. Reasoning models, working through constraints sequentially, were significantly better at identifying the feasible solution space before proposing specific routes. This reflects the same multi-step dependency structure — each constraint check eliminates a portion of the option space, changing what the next check needs to evaluate.
Strategic planning often involves adversarial reasoning: predicting how another agent will respond to your move, then planning your optimal response to their response. This is game-theoretic reasoning — the kind involved in competitive strategy, negotiation, and security analysis.
OpenAI reported in late 2024 that o1 showed strong performance on strategic game tasks requiring theory-of-mind reasoning — predicting what another player would do given incomplete information. The improvement over standard GPT-4o was particularly large on games requiring recursive reasoning ("I think you think I think..."), because each recursion level adds a dependent step.
In financial analysis, reasoning models show particular strength in scenarios requiring conditional projections: "Under scenario A, what happens to variable X, and how does that feed back into variable Y in period 3?" Standard models handle financial vocabulary fluently but often produce projections where the numbers don't internally cohere — where the assumed growth rate in one paragraph contradicts the cash flow implication in another.
Morgan Stanley and other financial institutions began piloting o1-class models in 2024 specifically for structured analysis tasks — credit risk assessment (where multiple risk factors interact non-linearly) and M&A scenario modelling (where deal terms, tax implications, and post-merger integration costs must be tracked simultaneously). The reported advantage was specifically in internal consistency of analysis, not in speed or breadth of knowledge.
Reasoning models do not replace human strategic judgment. They remain limited by the quality of inputs, cannot account for political dynamics or unstated constraints, and can construct internally consistent analyses that rest on faulty premises. The advantage is specific: given a well-specified problem with explicit constraints, reasoning models are substantially better at finding the intersection of all constraints and identifying solutions that satisfy all of them simultaneously.
Across all four lessons in this module, the pattern is consistent. Use a reasoning model when:
Use a standard model when the task is factual retrieval, creative generation without logical constraints, summarisation of a single document, translation, or any task where pattern completion from training data is sufficient.
Reasoning models are not universally better than standard models. They are specifically better on tasks with multi-step dependency structures — mathematics, complex debugging, clinical diagnosis, evidence synthesis, and constrained strategic planning. For everything else, use the faster, cheaper standard model. Model selection is task selection.
In this lab you'll work through a multi-constraint decision problem — the kind where reasoning models show their biggest strategic advantage. You'll explicitly track how each constraint eliminates options, and how constraints interact to change the feasible solution space.
Complete at least 3 exchanges to finish this lab.