When OpenAI first exposed a temperature slider in the ChatGPT API playground, developers noticed something striking: the same prompt sent ten times at temperature 0 returned nearly identical text every single run, while the same prompt at temperature 2 produced ten wildly different paragraphs — some poetic, some incoherent. The parameter was not a mystery; it had been in research papers since the 1980s. But watching it operate live made the mechanism viscerally clear.
At every generation step an LLM produces a vector of raw scores called logits — one number per token in its vocabulary (GPT-4 uses roughly 100,000 tokens). These numbers are not yet probabilities. A softmax function converts them: it exponentiates each logit and divides by the sum, so all values are positive and sum to 1.0. The result is a probability distribution over the next token.
Before softmax is applied, temperature T is used to divide every logit by T. That is the entire operation. No neural network weights change. No fine-tuning occurs. Just division by a scalar.
P(tokeni) = exp(logiti / T) / Σ exp(logitj / T) — where T is temperature. When T = 1.0 the distribution is unmodified. When T → 0 the highest logit dominates completely. When T is large all logits converge toward equal probability.
Imagine a five-token vocabulary and the model has assigned these raw logits: "cat" = 4.0, "dog" = 3.0, "bird" = 1.5, "fish" = 0.5, "rock" = −1.0. After softmax at T=1.0:
At T = 0.2 (cold), dividing logits by 0.2 magnifies differences — "cat" absorbs nearly all probability mass, often exceeding 99%. At T = 2.0 (hot), dividing by 2.0 flattens differences — all five tokens become more equally likely.
The word "temperature" is borrowed from statistical physics. In a cold system, particles settle into their lowest-energy states; in a hot system, they scatter. The mathematical form is identical — the Boltzmann distribution uses exactly this exponential-over-sum structure. When Geoff Hinton and colleagues adapted it for Boltzmann machines and later for language generation, the name came along naturally.
This is not merely metaphor. The same principle that describes why iron becomes paramagnetic above its Curie temperature governs why your LLM starts producing nonsense above temperature 1.5.
Temperature does not change what the model knows. It changes how confidently the model commits to what it knows. A model at T=0 always picks the statistically most likely next token. A model at T=1.5 picks from a much wider range — including tokens that are plausible but unusual.
You are talking to an AI tutor specialised in temperature and logit mathematics. Explore how the logit → temperature division → softmax pipeline actually works. Ask it to walk you through a numerical example, or probe what happens at the extremes (T=0, T=100).
Anthropic's model cards and API documentation for Claude explicitly expose three parameters: temperature, top_p, and top_k. Their guidance notes that raising temperature while simultaneously lowering top_p is redundant — both reshape the effective distribution — and recommends adjusting only one at a time. This reflects a genuine engineering tradeoff that shipped models navigated in practice: pure temperature is easy to reason about mathematically, but it can accidentally assign significant probability to tokens that are semantically absurd in context.
Suppose a model's vocabulary has 100,000 tokens. At temperature 1.0, most probability mass sits on perhaps a few hundred plausible continuations. But the remaining 99,700+ tokens still receive tiny nonzero probabilities. Over many sampling steps, the model will eventually land on one of them. A single highly-improbable token can derail a coherent paragraph.
The solution is to restrict which tokens are even eligible for sampling before a final draw is made. Three strategies dominate in practice.
Always select the argmax token. Equivalent to T→0. Completely deterministic. Fast and consistent, but prone to repetition loops and boring output. Used in translation tasks where precision matters most.
Keep only the K highest-probability tokens; set all others to zero probability; renormalise; sample. K=50 is a common default. Problem: K=50 means very different things when the model is confident (top token has 80%) versus uncertain (top 50 tokens all near 2%).
Keep the smallest set of tokens whose cumulative probability exceeds P. Introduced by Holtzman et al. (2019) "The Curious Case of Neural Text Degeneration." P=0.9 means: keep tokens until we've covered 90% of the probability mass, then renormalise and sample. Adapts to confidence level automatically.
Keep any token whose probability exceeds P × (probability of the top token). If top token has 60% probability and min_p=0.05, keep all tokens with ≥3% probability. Proposed in 2023–2024 as an improvement for creative tasks — handles both high- and low-confidence positions gracefully.
Ari Holtzman and colleagues at the University of Washington published "The Curious Case of Neural Text Degeneration" (ICLR 2020), demonstrating that pure temperature sampling leads to repetition and incoherence at scale. Their nucleus (top-p) sampling showed measurably better human preference scores on open-ended generation tasks.
The key insight: the number of tokens needed to cover 90% of probability mass varies enormously by context. When the next word is highly constrained ("The Eiffel Tower is located in ___"), the nucleus contains just a handful of tokens. When the context is ambiguous ("Once upon a time there was a ___"), the nucleus spans hundreds. Top-p handles both cases gracefully; top-k does not.
Human evaluators consistently rated nucleus-sampled text as more coherent and interesting than text from pure temperature sampling or beam search. The paper showed that maximisation strategies (greedy, beam search) lead to "dull, repetitive, sometimes incoherent text" because they systematically avoid the long tail of plausible continuations that give human text its richness.
In OpenAI's and Anthropic's APIs, temperature and top-p interact: temperature reshapes the full distribution first, then top-p truncates it. Setting both high is counterproductive — the temperature widens the distribution, and top-p cuts off the wide tail anyway. The practical advice from Anthropic's documentation: if you want creative output, raise temperature and leave top-p at its default (0.999 or 1.0). If you want focused output, lower top-p and leave temperature at 1.0.
Real-world defaults as of 2024: GPT-4 API defaults to temperature=1.0, top_p=1.0. Claude defaults to temperature=1.0. Llama.cpp defaults to temperature=0.8, top_k=40, top_p=0.95 — reflecting the open-source community's preference for slightly more conservative generation.
For factual or code tasks: temperature 0.0–0.3, top-p 1.0. For creative writing: temperature 0.8–1.2, top-p 0.9–0.95. Never push temperature above 1.5 in production — the quality degradation is rarely worth any diversity gain at that point.
Explore the tradeoffs between top-k, top-p, and min-p with your AI tutor. Ask it to compare them numerically, explain when you'd choose one over another, or walk through the nucleus sampling algorithm step by step.
When GitHub Copilot launched in technical preview in June 2021, internal teams at GitHub and OpenAI ran extensive A/B tests on temperature settings for code completion. Public postmortems and developer blog posts noted that temperature values above 0.4 produced too many syntactically invalid suggestions, while values below 0.1 produced repetitive boilerplate. The production system settled on approximately 0.2–0.3 for single-line completions and slightly higher values for multi-line block suggestions where more diverse structure is acceptable. This range — never disclosed officially but inferred from API response analysis — illustrates a core principle: the right temperature depends on the cost of errors in your task.
Different tasks have fundamentally different relationships between diversity and quality. A wrong character in a function name breaks code. A surprising metaphor in a poem is a feature. Engineers calibrate accordingly:
T = 0.0–0.3. Correctness dominates. Syntax errors compound through subsequent tokens. Even T=0.5 introduces sufficient randomness to produce subtle bugs in 10–20% of completions on complex tasks.
T = 0.0–0.4. Consistent, reproducible answers. Low temperature reduces hallucination frequency by forcing the model to its highest-confidence outputs. OpenAI recommends T=0 for classification tasks.
T = 0.7–1.2. Diversity and surprise are valuable. The "curious case" phenomenon (Holtzman 2019) shows human raters prefer varied text. Claude's constitutional AI approach allows higher temperatures because safety filters operate at another layer.
T = 0.5–0.9. Balance between consistency and naturalness. ChatGPT's default T=1.0 for chat reflects the finding that robotic-sounding responses (low T) reduce user satisfaction even if they're technically correct.
One of the most practically important connections: higher temperature increases hallucination rate. This is a direct consequence of the mechanism — raising temperature gives non-maximal tokens higher sampling probability. Among those non-maximal tokens are often factually incorrect continuations that the model still assigns some probability mass to.
A 2023 study from the University of Oxford and DeepMind (published as part of the TruthfulQA benchmark analysis) observed that LLaMA-based models showed approximately 15–20% higher hallucination rates at T=1.0 compared to T=0.3 on closed-domain factual questions. The effect was smaller for larger models — larger models assign more concentrated probability to correct tokens — but remained measurable across all model sizes tested.
This creates a fundamental engineering dilemma for RAG (retrieval-augmented generation) systems: you need enough temperature to produce natural-sounding prose, but not so much that the model departs from the retrieved context into invented facts.
"Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both." — This explicit caution reflects the production reality that stacking both parameters is confusing to reason about and rarely beneficial.
The concept extends beyond LLMs. In diffusion models like Stable Diffusion, "temperature" or "guidance scale" controls a related tradeoff. The classifier-free guidance scale (CFG) introduced by Ho and Salimans (2022) behaves analogously: low CFG produces diverse but prompt-misaligned images; high CFG produces prompt-faithful but artefact-heavy images. Midjourney's "stylize" parameter plays a similar role. Users quickly discover empirically what temperature researchers formalised: there is always a diversity-quality frontier.
Beam search — maintaining the top B partial sequences at each step — was the dominant decoding strategy in neural machine translation (Google Translate used it extensively through 2020). It produces high-BLEU-score translations but notoriously repetitive, stilted text in open-ended generation. The transition from beam search to temperature sampling with top-p was one of the key engineering decisions that made ChatGPT feel conversational rather than robotic. Google Brain's 2022 paper "Scaling Instruction-Finetuned Language Models" (FLAN-T5) confirmed that sampling-based decoding with moderate temperature outperformed beam search on human preference evaluations for dialogue tasks.
Temperature is a dial on the diversity-accuracy frontier. Every production system must choose where it sits on that frontier based on what failure modes are most costly for its users. There is no universally correct setting — only context-appropriate ones.
Discuss real-world temperature decisions with your AI tutor. Explore why different applications choose different settings, how hallucination risk scales with temperature, or what beam search vs. sampling looks like in practice.
When OpenAI released GPT-3 in June 2020, early API users discovered a persistent problem: even at moderate temperatures, the model would enter repetition loops — sometimes generating the same phrase dozens of times in a row. This wasn't a hallucination problem or a temperature problem. It was a structural quirk of autoregressive generation: once a token pattern achieves high probability, it can become self-reinforcing, creating a probability attractor. OpenAI's response was to introduce presence penalties and frequency penalties as first-class API parameters — logit adjustments that discount recently-used or frequently-used tokens. These parameters shipped before GPT-4 and remain in the API today.
Repetition in LLM output is not random. It arises because the model's probability distribution is conditioned on its own previous output. If the model generates "the" at position N, the context now contains "the", making "the" somewhat more likely at position N+1 (many common phrases begin with "the"). Over many steps, high-frequency tokens can create feedback loops.
This is distinct from the hallucination problem. A model can be repetitive and factually accurate simultaneously. The issue is purely about the dynamics of autoregressive sampling.
OpenAI's API exposes two penalty parameters, both operating on logits before softmax:
The formula: adjusted_logit = logit − frequency_penalty × token_count − presence_penalty × (token_appeared ? 1 : 0). OpenAI's documentation suggests values between 0 and 2.0; negative values increase repetition (occasionally useful for tasks like generating structured data where repeated tokens are expected).
An important nuance: frequency penalty primarily prevents phrase-level looping. Presence penalty primarily prevents topic-level looping (mentioning the same concept repeatedly). They address different aspects of the repetition problem and are often used together.
For most chat applications: frequency_penalty = 0.5–1.0, presence_penalty = 0.0–0.6. For creative writing: presence_penalty 0.5–1.0 encourages narrative to move forward. For code generation: both near 0 — repetition in code is often intentional (variable reuse, standard patterns).
A different class of problem: getting LLMs to output valid JSON, SQL, or other structured formats reliably. Temperature and sampling handle randomness, but they don't enforce schema compliance. The solution is constrained decoding: at each token position, mask to zero any tokens that would make the output invalid according to the target schema, then sample from only the valid remainder.
Libraries like Outlines (from Normal Computing, 2023) and Microsoft's Guidance implement this. They convert a JSON schema or regular grammar into a set of logit masks that change at each step as the model generates partial output. The result: the model samples freely from its distribution but can only ever produce syntactically valid output. Temperature still controls diversity within valid outputs — it's not bypassed, just constrained.
The Outlines library from Normal Computing demonstrated that constrained decoding with LLM-native grammars could reliably produce valid JSON, SQL, and regex-matched output with no accuracy cost versus unconstrained generation — and dramatically lower parsing error rates. This approach is now built into llama.cpp's grammar sampling feature (--grammar-file flag) and adopted by vLLM and other inference engines.
A 2020 paper by Basu et al. proposed Mirostat — an algorithm that dynamically adjusts temperature at each token to maintain a target perplexity (surprise level). Instead of setting a fixed temperature, you set a target "text quality" measured by cross-entropy, and the algorithm adjusts temperature step by step to stay near that target.
Mirostat is implemented in llama.cpp and Kobold AI. Users report it handles long-form generation better than fixed temperature — avoiding the quality collapse that sometimes occurs when temperature is set high and the model wanders into low-quality regions. It represents the state of the art for local model inference as of 2024 and illustrates the direction the field is moving: away from fixed hyperparameters toward adaptive, feedback-controlled decoding.
OpenAI's o1 and o3 models (late 2023–2024) introduced a different paradigm: chain-of-thought reasoning tokens generated internally before the final answer. These reasoning traces are reportedly generated at moderate temperature to explore different solution paths, while the final answer extraction uses lower temperature for consistency. This "think hot, answer cold" approach — sampling creatively during deliberation but committing precisely at output — represents a sophisticated application of temperature as a cognitive control.
Google's Gemini team reported similar findings in their 2023 technical report: best-of-N sampling (generating N outputs at higher temperature, then selecting the best by a verifier) outperforms single greedy generation on mathematical reasoning by significant margins, despite introducing more randomness. The diversity creates a search space that the verifier can then optimise over.
Temperature is the foundation, but production LLM decoding stacks multiple layers: temperature → top-p/top-k/min-p → frequency/presence penalties → constrained grammar masks → optional adaptive algorithms like Mirostat. Understanding each layer independently lets you diagnose which one to adjust when output quality problems arise.
Explore advanced decoding techniques with your AI tutor: repetition penalties, constrained/structured generation, Mirostat adaptive temperature, and the emerging "think hot, answer cold" pattern. Ask for practical advice or dig into how these systems work mechanically.