When OpenAI released o1 in September 2024, benchmark watchers noticed something strange. On the American Mathematics Competition 2024, GPT-4o scored 13%. The new o1 model scored 83% — on the exact same problems, using essentially the same underlying weights. The difference was not a bigger model. It was more thinking time.
Every large language model goes through two distinct phases. Training is where the model learns — consuming enormous quantities of text, adjusting billions of parameters, building compressed representations of language and knowledge. This phase is expensive and slow, consuming months of compute and tens of millions of dollars for frontier models.
Inference is where the model is actually used — you send it a prompt, it produces a response. Traditionally, inference was cheap and fast: the model made one forward pass through its parameters and emitted tokens. No second-guessing. No revision. Whatever came out first was the answer.
Test-time compute is the idea of spending significantly more compute during inference — at the moment the model is actually answering your question. Instead of one pass, the model might explore dozens of solution paths, check its own work, revise, or run specialized verification steps before committing to an answer.
Test-time compute (TTC) refers to additional computational resources devoted to the inference phase — after training — allowing a model to think longer, explore more paths, and self-verify before producing output. It is sometimes called "inference-time compute" or "thinking compute."
The traditional scaling paradigm held that more capable AI required bigger models trained on more data with more GPUs. This logic is not wrong, but it has limits. Training compute roughly doubles every 12 months, and the cost of building each new frontier model is growing exponentially. At some point, simply making training bigger runs into fundamental constraints — data availability, energy, economics.
Test-time compute offers a different lever. Rather than asking "how do we train a smarter model," it asks "how do we get a trained model to think more carefully about hard problems?" The resource trade-off shifts: instead of spending more on training once, you spend more on inference per query — but only when the query is difficult enough to warrant it.
This is actually how human experts work. A cardiologist reading an ECG doesn't use the same effort for every tracing. For a routine case they glance and move on. For an ambiguous one they zoom in, compare to prior studies, consult a colleague. Intelligence scales effort to difficulty. Test-time compute allows AI to do the same.
Test-time compute does not make a model know more things. It gives a model time to figure out what it already knows how to do. The knowledge is in the weights; the test-time compute is the deliberation that correctly assembles that knowledge into an answer.
In this lab you will probe the concept of test-time compute by asking questions about it. Try to understand: what types of problems benefit most from extended thinking? When is it wasteful? How does the compute budget idea relate to how you personally think through hard problems?
When OpenAI published the o1 system card, they included a striking observation: on certain coding and math problems, the model's internal reasoning chain contained explicit self-corrections — moments where it started down a wrong path, recognized the error, and redirected. The model was not just thinking; it was thinking and auditing its own thinking. This double-loop process was responsible for much of its advantage over single-pass models.
A reasoning model's thinking phase is not a single linear stream of tokens. It is a structured process with several distinct activities happening in interleaved fashion:
One of the counterintuitive findings from reasoning model research is that allowing a model to explore and discard wrong paths actually improves final accuracy — even though those wrong paths cost compute. The reason is that exploration reveals the structure of the problem. A path that fails illuminates which constraints are binding, which assumptions are incorrect, which direction is more promising.
This is directly analogous to how expert human problem-solvers work. In a 1985 study of expert vs. novice chess players, Adriaan de Groot found that grandmasters did not consider more moves — they considered roughly the same number as weaker players. But the moves they chose to explore were the right moves, because their pattern recognition directed search efficiently. Reasoning models learn a similar skill: their training teaches them which branches of a problem space are worth exploring.
Self-consistency sampling formalizes this: generate multiple independent solutions, then take the majority vote. In Google's 2023 work on self-consistency with chain-of-thought, this technique improved performance on mathematical reasoning benchmarks by 17 percentage points over single-path chain-of-thought.
A key training innovation behind o1-class models is the process reward model (PRM). Traditional reinforcement learning from human feedback (RLHF) judges only the final answer — was the output good or bad? A PRM judges the reasoning steps individually. Each step in a chain of thought gets a score for whether it is logically valid and productive.
Training with a PRM teaches the model not just to get right answers but to take good reasoning steps. This is why o1-class models can genuinely backtrack — they have learned that mid-chain corrections are rewarded, not penalized. The model is incentivized to notice when a step is wrong and fix it, rather than committing to a flawed path because it started there.
In OpenAI's published evaluations, o1 achieved a score equivalent to the 89th percentile on the 2024 USA Mathematical Olympiad (USAMO) — a competition requiring proof-based reasoning, not just numerical calculation. GPT-4o scored near the median. The key observed difference in the o1 system card: the model's chain-of-thought included explicit proof-checking steps where it constructed counterexamples to test its own conjectures.
OpenAI's o1 technical report described a consistent finding: across a range of tasks, performance improved as a smooth function of the number of thinking tokens allowed. Doubling the thinking token budget moved the model measurably up the performance curve. This inference scaling law mirrors the training scaling laws described by Hoffmann et al. (Chinchilla, 2022) for the training regime.
The practical implication is that system designers can tune the compute budget to match the stakes of a task. A coding assistant helping with a routine function might use a minimal thinking budget. A system verifying a medical diagnosis recommendation might use the maximum available budget. Cost and latency scale with budget, so the tradeoff is explicit and controllable.
The thinking chain is not output to the user, but it shapes everything the user sees. Understanding that a reasoning model is performing search — not retrieval — changes how you should prompt it. Give it a hard constraint problem with clear success criteria, not a vague question expecting a single lookup.
Explore how reasoning chains work internally. Ask questions about process reward models, self-consistency sampling, or how to design prompts that leverage a model's backtracking ability. Try asking the assistant to show you how it would structure a reasoning chain for a specific problem type.
After o1's release, researchers at Princeton, MIT, and several AI labs began stress-testing it. They found a revealing pattern: on well-defined hard problems — competition mathematics, formal logic, complex code debugging — o1 dramatically outperformed GPT-4o. But on tasks requiring up-to-date knowledge, creative writing style, or factual retrieval, the thinking tokens added latency without proportional benefit. The extra compute had solved the wrong problem.
Extended reasoning delivers the most value on problems with certain characteristics. They have verifiable correctness — there is a right answer and the model can check whether it has found one. They require multi-step deduction — no single lookup resolves them; you must chain inferences. And they have intermediate checkpoints — the path to the answer has natural sub-goals that can be validated before proceeding.
Reasoning models handle poorly-suited tasks expensively. Tasks that primarily require factual retrieval — "What year did the French Revolution begin?" — don't benefit from extended reasoning because no chain of inference is required; the answer is either in the weights or it isn't. Thinking tokens are wasted searching for something that doesn't require search.
Tasks requiring creative judgment — "Write a poem in the style of Mary Oliver" — also see minimal gains. There is no correct answer to verify, no backtracking path that is objectively better. The model's stylistic quality comes from training, not from deliberation.
Time-sensitive queries are another poor fit. If a user needs a quick answer, a 30-second thinking chain is useless even if it produces a marginally more accurate result. System designers at companies like Anthropic and Google have noted that routing queries to the appropriate tier of reasoning — minimal for simple tasks, maximum for complex ones — is itself a significant design challenge.
ARC-AGI (Abstraction and Reasoning Corpus) was designed by François Chollet specifically to resist pattern memorization — each puzzle requires novel reasoning from first principles. GPT-4o scored approximately 5% on ARC-AGI. With high test-time compute, o3 scored 87.5% in March 2025. However, the compute cost for the high-efficiency setting was estimated at roughly $17 per puzzle — illustrating the sharp tradeoff between performance and cost that characterizes TTC-heavy workloads.
A reasoning model thinking for 30 seconds before answering is fine for a one-off hard math problem. It is catastrophic for a customer service chatbot handling thousands of simultaneous queries where users expect responses in under two seconds. OpenAI acknowledged this in the o1 product release: the model was positioned explicitly for "tasks that benefit from careful reasoning" rather than as a replacement for GPT-4o in latency-sensitive applications.
Google's December 2024 release of Gemini 2.0 Flash Thinking addressed part of this by optimizing a reasoning model for speed — producing thinking outputs with lower latency than o1, though with some accuracy tradeoff. This represents an emerging tier: "fast reasoning" models that apply moderate extended thinking rather than exhaustive search.
Test-time compute is a precision instrument, not a general upgrade. The question to ask for any task is: does this problem have a correct answer that can be verified, requiring multiple inference steps to reach? If yes, extended reasoning helps. If no, you're paying for thinking that isn't solving anything.
You are a system designer deciding which queries to route to an expensive reasoning model vs. a fast standard model. Describe tasks to the assistant and get its analysis of whether extended reasoning compute would help. Then challenge its reasoning — when is the boundary ambiguous?
When DeepSeek released R1 with open weights on January 20, 2025, the AI industry's assumption that extended reasoning required billions in proprietary infrastructure was upended. R1 matched o1 on most reasoning benchmarks using a novel training approach — group relative policy optimization (GRPO) — that achieved competitive reasoning quality at a fraction of the reported cost. Within weeks, researchers worldwide were fine-tuning reasoning models on consumer hardware.
DeepSeek's published technical report claimed that R1 was trained for approximately $5.6 million — compared to estimates of $100 million or more for comparable frontier models at OpenAI and Google. While exact comparisons are difficult (different hardware, different objectives), the order-of-magnitude cost reduction was real and verifiable by the benchmark results.
The key innovation was GRPO, which eliminated the need for a separate critic model (as in standard PPO reinforcement learning) by using group-relative reward normalization. This made the training pipeline simpler and cheaper while achieving similar or better results on reasoning tasks. R1 also distilled its reasoning capabilities into smaller models — down to 1.5B parameters — that could run on consumer laptops while retaining meaningful reasoning ability.
The open-weights release had immediate effects. Within weeks of R1's release, researchers at institutions without frontier model access were running extended reasoning experiments. Startups were building reasoning-capable products on top of R1's open weights rather than paying API fees. The inference-scaling paradigm had escaped the closed frontier model ecosystem.
Several architectural directions are emerging for the next generation of test-time compute systems:
OpenAI's o-series research strongly suggested that inference scaling laws exist: more thinking tokens produce better answers, following a roughly log-linear relationship. But Anthropic's research published in late 2024 noted a complication: the scaling relationship may saturate for some task classes. Beyond a certain thinking budget, adding more tokens does not improve the answer — the model has exhausted the productive search space and begins generating redundant reasoning paths.
This means the optimal compute budget is task-dependent and potentially learnable. A model that can estimate "how much thinking is enough" for a given query would provide large economic benefits. This remains an active research problem as of mid-2025.
In early 2025, OpenAI demonstrated o3 completing multi-step research tasks in the FrontierMath benchmark — a set of problems described by their creators as "extremely challenging research-level mathematics." o3 solved approximately 25% of these problems, compared to essentially 0% for all previous models. The benchmark's creators noted that even professional mathematicians take hours to verify some solutions. This performance required the longest thinking chains yet deployed commercially — potentially thousands of reasoning tokens per problem.
The emergence of test-time compute as a viable scaling axis has several important implications. First, capability gains no longer require only larger training runs — a well-trained model with an extended thinking budget can dramatically outperform a much larger model on the right tasks. This partially decouples capability from model size.
Second, the open-source replication of reasoning capabilities (via DeepSeek-R1 and its successors) means that extended reasoning is rapidly democratizing. Tasks that required frontier model access in 2024 can be performed by researchers with consumer hardware in 2025. The capability gap between open and closed models narrowed significantly and faster than most analysts predicted.
Third, and most important for practitioners: the relevant question has shifted from "is this model smart enough" to "is this model using its smartness in the right way on this problem." Prompting, task design, and compute budget allocation become as important as model selection.
Test-time compute is not merely a product feature — it represents a fundamental shift in how AI capability scales. Training scaling laws remain important, but inference scaling laws are now a parallel axis. The most capable AI systems of 2025 and beyond will likely be those that intelligently allocate compute across both dimensions: trained well, and thinking carefully.
Engage with the assistant about the future of inference scaling. Challenge assumptions, explore the generator-verifier architecture, debate whether reasoning saturation limits the paradigm, or dig into what DeepSeek-R1's success means for the competitive landscape. Think critically about what inference scaling can and cannot achieve.