L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 4 · Lesson 1

What Is Temperature?

The single number that turns a calculator into a poet — or a parrot.
How does one parameter change an LLM from deterministic machine to creative engine?

When OpenAI first exposed a temperature slider in the ChatGPT API playground, developers noticed something striking: the same prompt sent ten times at temperature 0 returned nearly identical text every single run, while the same prompt at temperature 2 produced ten wildly different paragraphs — some poetic, some incoherent. The parameter was not a mystery; it had been in research papers since the 1980s. But watching it operate live made the mechanism viscerally clear.

Logits, Softmax, and the Raw Probability Table

At every generation step an LLM produces a vector of raw scores called logits — one number per token in its vocabulary (GPT-4 uses roughly 100,000 tokens). These numbers are not yet probabilities. A softmax function converts them: it exponentiates each logit and divides by the sum, so all values are positive and sum to 1.0. The result is a probability distribution over the next token.

Before softmax is applied, temperature T is used to divide every logit by T. That is the entire operation. No neural network weights change. No fine-tuning occurs. Just division by a scalar.

The Formula

P(tokeni) = exp(logiti / T) / Σ exp(logitj / T) — where T is temperature. When T = 1.0 the distribution is unmodified. When T → 0 the highest logit dominates completely. When T is large all logits converge toward equal probability.

Visualising the Effect

Imagine a five-token vocabulary and the model has assigned these raw logits: "cat" = 4.0, "dog" = 3.0, "bird" = 1.5, "fish" = 0.5, "rock" = −1.0. After softmax at T=1.0:

cat
52%
52%
dog
29%
29%
bird
11%
11%
fish
5%
5%
rock
3%
3%

At T = 0.2 (cold), dividing logits by 0.2 magnifies differences — "cat" absorbs nearly all probability mass, often exceeding 99%. At T = 2.0 (hot), dividing by 2.0 flattens differences — all five tokens become more equally likely.

Temperature Effects at a Glance
T = 0.2 (Cold)
cat
96%
dog
3%
bird
T = 1.0 (Default)
cat
52%
dog
29%
bird
11%
T = 2.0 (Hot)
cat
33%
dog
27%
bird
21%

The Thermodynamics Analogy

The word "temperature" is borrowed from statistical physics. In a cold system, particles settle into their lowest-energy states; in a hot system, they scatter. The mathematical form is identical — the Boltzmann distribution uses exactly this exponential-over-sum structure. When Geoff Hinton and colleagues adapted it for Boltzmann machines and later for language generation, the name came along naturally.

This is not merely metaphor. The same principle that describes why iron becomes paramagnetic above its Curie temperature governs why your LLM starts producing nonsense above temperature 1.5.

LogitRaw unnormalized score output by the model's final linear layer for each vocabulary token.
SoftmaxFunction converting a vector of real numbers into a probability distribution summing to 1.0.
Temperature TScalar divisor applied to logits before softmax. T<1 sharpens; T>1 flattens the distribution.
Greedy DecodingSpecial case where T→0: always select the token with the highest probability.
Key Insight

Temperature does not change what the model knows. It changes how confidently the model commits to what it knows. A model at T=0 always picks the statistically most likely next token. A model at T=1.5 picks from a much wider range — including tokens that are plausible but unusual.

Lesson 1 Quiz

What Is Temperature?
What mathematical operation does temperature perform on logits before softmax?
Correct. Temperature divides every logit by T before softmax. This single operation reshapes the entire probability distribution without changing any model weights.
Not quite. Temperature divides every logit by T — a simple scalar division before the softmax function is applied.
At temperature T = 0.1, what happens to the probability distribution over tokens?
Correct. Dividing by a small number magnifies the differences between logits, making the already-highest-scoring token dominate even more completely.
Not quite. Dividing by a small T magnifies differences between logits, causing the distribution to sharpen dramatically around the top token.
Where does the word "temperature" in LLM sampling originate?
Correct. The Boltzmann distribution in statistical mechanics has exactly this exponential-over-sum structure, and the analogy was carried directly into machine learning by researchers including Geoff Hinton.
Not quite. The term comes from statistical physics — the Boltzmann distribution, which governs how particles distribute across energy states at a given temperature, uses the identical mathematical form.

Lab 1: Exploring Temperature

Conversation practice — ask 3+ questions to complete

Your Task

You are talking to an AI tutor specialised in temperature and logit mathematics. Explore how the logit → temperature division → softmax pipeline actually works. Ask it to walk you through a numerical example, or probe what happens at the extremes (T=0, T=100).

Suggested start: "Can you show me with actual numbers how dividing logits by T=0.5 versus T=2.0 changes the resulting probabilities?"
Temperature Tutor Lesson 1
Welcome! I'm here to help you understand temperature in LLMs — the logit division, the softmax, and why it matters. What would you like to explore? You can ask me to crunch real numbers, explain edge cases, or connect this to how models you've used actually behave.
Module 4 · Lesson 2

Sampling Strategies

Temperature alone is blunt. Top-k, top-p, and min-p sculpt the distribution with precision.
What techniques let engineers keep creativity without allowing the model to sample from garbage tokens?

Anthropic's model cards and API documentation for Claude explicitly expose three parameters: temperature, top_p, and top_k. Their guidance notes that raising temperature while simultaneously lowering top_p is redundant — both reshape the effective distribution — and recommends adjusting only one at a time. This reflects a genuine engineering tradeoff that shipped models navigated in practice: pure temperature is easy to reason about mathematically, but it can accidentally assign significant probability to tokens that are semantically absurd in context.

Why Temperature Alone Is Insufficient

Suppose a model's vocabulary has 100,000 tokens. At temperature 1.0, most probability mass sits on perhaps a few hundred plausible continuations. But the remaining 99,700+ tokens still receive tiny nonzero probabilities. Over many sampling steps, the model will eventually land on one of them. A single highly-improbable token can derail a coherent paragraph.

The solution is to restrict which tokens are even eligible for sampling before a final draw is made. Three strategies dominate in practice.

Greedy Decoding

Always select the argmax token. Equivalent to T→0. Completely deterministic. Fast and consistent, but prone to repetition loops and boring output. Used in translation tasks where precision matters most.

Top-K Sampling

Keep only the K highest-probability tokens; set all others to zero probability; renormalise; sample. K=50 is a common default. Problem: K=50 means very different things when the model is confident (top token has 80%) versus uncertain (top 50 tokens all near 2%).

Top-P (Nucleus) Sampling

Keep the smallest set of tokens whose cumulative probability exceeds P. Introduced by Holtzman et al. (2019) "The Curious Case of Neural Text Degeneration." P=0.9 means: keep tokens until we've covered 90% of the probability mass, then renormalise and sample. Adapts to confidence level automatically.

Min-P Sampling

Keep any token whose probability exceeds P × (probability of the top token). If top token has 60% probability and min_p=0.05, keep all tokens with ≥3% probability. Proposed in 2023–2024 as an improvement for creative tasks — handles both high- and low-confidence positions gracefully.

Nucleus Sampling in Detail

Ari Holtzman and colleagues at the University of Washington published "The Curious Case of Neural Text Degeneration" (ICLR 2020), demonstrating that pure temperature sampling leads to repetition and incoherence at scale. Their nucleus (top-p) sampling showed measurably better human preference scores on open-ended generation tasks.

The key insight: the number of tokens needed to cover 90% of probability mass varies enormously by context. When the next word is highly constrained ("The Eiffel Tower is located in ___"), the nucleus contains just a handful of tokens. When the context is ambiguous ("Once upon a time there was a ___"), the nucleus spans hundreds. Top-p handles both cases gracefully; top-k does not.

Holtzman et al. 2019 — Key Finding

Human evaluators consistently rated nucleus-sampled text as more coherent and interesting than text from pure temperature sampling or beam search. The paper showed that maximisation strategies (greedy, beam search) lead to "dull, repetitive, sometimes incoherent text" because they systematically avoid the long tail of plausible continuations that give human text its richness.

Combining Parameters

In OpenAI's and Anthropic's APIs, temperature and top-p interact: temperature reshapes the full distribution first, then top-p truncates it. Setting both high is counterproductive — the temperature widens the distribution, and top-p cuts off the wide tail anyway. The practical advice from Anthropic's documentation: if you want creative output, raise temperature and leave top-p at its default (0.999 or 1.0). If you want focused output, lower top-p and leave temperature at 1.0.

Real-world defaults as of 2024: GPT-4 API defaults to temperature=1.0, top_p=1.0. Claude defaults to temperature=1.0. Llama.cpp defaults to temperature=0.8, top_k=40, top_p=0.95 — reflecting the open-source community's preference for slightly more conservative generation.

Top-KRestrict sampling to the K tokens with highest probability; renormalise over only those K.
Top-P (Nucleus)Restrict sampling to the smallest token set covering cumulative probability ≥ P.
Min-PKeep tokens whose probability ≥ min_p × max_token_probability. Scales with model confidence.
Repetition PenaltyMultiplicative discount applied to logits of tokens already appearing in context, reducing loops.
Practical Rule

For factual or code tasks: temperature 0.0–0.3, top-p 1.0. For creative writing: temperature 0.8–1.2, top-p 0.9–0.95. Never push temperature above 1.5 in production — the quality degradation is rarely worth any diversity gain at that point.

Lesson 2 Quiz

Sampling Strategies
What is the core problem with top-K sampling that nucleus (top-p) sampling solves?
Correct. K=50 when the model is confident covers the same percentage as K=500 when uncertain — a mismatch. Top-p automatically expands or contracts based on where probability mass actually sits.
Not quite. The core issue is adaptive coverage: K=50 covers vastly different amounts of probability mass depending on model confidence. Top-p solves this by targeting a fixed fraction of mass rather than a fixed count.
According to Holtzman et al. (2019), what does pure maximisation decoding (greedy / beam search) produce?
Correct. The paper's central finding was that maximisation strategies produce degenerate text — looping, generic, and bland — because human language actually draws heavily from the non-maximum distribution.
Not quite. Holtzman et al. found the opposite: maximisation produces dull, repetitive text because it ignores the rich probability tail that gives natural human language its variety and richness.
How does min-p sampling differ from top-p sampling?
Correct. Min-p keeps tokens whose probability exceeds min_p × p(top_token). If the model is very confident, the absolute threshold rises; if uncertain, it falls. This is more adaptive than top-p's fixed cumulative cutoff.
Not quite. Min-p uses a relative threshold — it keeps tokens whose probability is at least min_p times the top token's probability — making the actual cutoff scale dynamically with model confidence.

Lab 2: Sampling Strategies

Conversation practice — ask 3+ questions to complete

Your Task

Explore the tradeoffs between top-k, top-p, and min-p with your AI tutor. Ask it to compare them numerically, explain when you'd choose one over another, or walk through the nucleus sampling algorithm step by step.

Suggested start: "Walk me through top-p sampling step by step with a concrete example — show me how to build the nucleus from a probability distribution."
Sampling Strategy Tutor Lesson 2
Ready to dig into sampling strategies! I can walk you through top-k, top-p (nucleus), min-p, or repetition penalties — with real numbers if you like. What would you like to start with?
Module 4 · Lesson 3

Temperature in Practice

How real deployed systems choose and tune temperature — from GitHub Copilot to GPT-4 to Stable Diffusion.
What do production engineers actually set temperature to, and why does it vary so much by task?

When GitHub Copilot launched in technical preview in June 2021, internal teams at GitHub and OpenAI ran extensive A/B tests on temperature settings for code completion. Public postmortems and developer blog posts noted that temperature values above 0.4 produced too many syntactically invalid suggestions, while values below 0.1 produced repetitive boilerplate. The production system settled on approximately 0.2–0.3 for single-line completions and slightly higher values for multi-line block suggestions where more diverse structure is acceptable. This range — never disclosed officially but inferred from API response analysis — illustrates a core principle: the right temperature depends on the cost of errors in your task.

Task-Dependent Temperature Guidelines

Different tasks have fundamentally different relationships between diversity and quality. A wrong character in a function name breaks code. A surprising metaphor in a poem is a feature. Engineers calibrate accordingly:

Code Generation

T = 0.0–0.3. Correctness dominates. Syntax errors compound through subsequent tokens. Even T=0.5 introduces sufficient randomness to produce subtle bugs in 10–20% of completions on complex tasks.

Factual Q&A

T = 0.0–0.4. Consistent, reproducible answers. Low temperature reduces hallucination frequency by forcing the model to its highest-confidence outputs. OpenAI recommends T=0 for classification tasks.

Creative Writing

T = 0.7–1.2. Diversity and surprise are valuable. The "curious case" phenomenon (Holtzman 2019) shows human raters prefer varied text. Claude's constitutional AI approach allows higher temperatures because safety filters operate at another layer.

Dialogue / Chat

T = 0.5–0.9. Balance between consistency and naturalness. ChatGPT's default T=1.0 for chat reflects the finding that robotic-sounding responses (low T) reduce user satisfaction even if they're technically correct.

Temperature and Hallucination

One of the most practically important connections: higher temperature increases hallucination rate. This is a direct consequence of the mechanism — raising temperature gives non-maximal tokens higher sampling probability. Among those non-maximal tokens are often factually incorrect continuations that the model still assigns some probability mass to.

A 2023 study from the University of Oxford and DeepMind (published as part of the TruthfulQA benchmark analysis) observed that LLaMA-based models showed approximately 15–20% higher hallucination rates at T=1.0 compared to T=0.3 on closed-domain factual questions. The effect was smaller for larger models — larger models assign more concentrated probability to correct tokens — but remained measurable across all model sizes tested.

This creates a fundamental engineering dilemma for RAG (retrieval-augmented generation) systems: you need enough temperature to produce natural-sounding prose, but not so much that the model departs from the retrieved context into invented facts.

OpenAI API Documentation Guidance (2024)

"Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both." — This explicit caution reflects the production reality that stacking both parameters is confusing to reason about and rarely beneficial.

Temperature in Diffusion Models

The concept extends beyond LLMs. In diffusion models like Stable Diffusion, "temperature" or "guidance scale" controls a related tradeoff. The classifier-free guidance scale (CFG) introduced by Ho and Salimans (2022) behaves analogously: low CFG produces diverse but prompt-misaligned images; high CFG produces prompt-faithful but artefact-heavy images. Midjourney's "stylize" parameter plays a similar role. Users quickly discover empirically what temperature researchers formalised: there is always a diversity-quality frontier.

Beam Search vs. Sampling

Beam search — maintaining the top B partial sequences at each step — was the dominant decoding strategy in neural machine translation (Google Translate used it extensively through 2020). It produces high-BLEU-score translations but notoriously repetitive, stilted text in open-ended generation. The transition from beam search to temperature sampling with top-p was one of the key engineering decisions that made ChatGPT feel conversational rather than robotic. Google Brain's 2022 paper "Scaling Instruction-Finetuned Language Models" (FLAN-T5) confirmed that sampling-based decoding with moderate temperature outperformed beam search on human preference evaluations for dialogue tasks.

Beam SearchMaintain the B most probable partial sequences at each step. Deterministic and high-precision but repetitive in open generation.
Guidance Scale (CFG)Diffusion model analogue to temperature: scales the influence of the conditioning signal on generation.
Hallucination RateFrequency at which model generates factually incorrect content — measurably increases with temperature.
The Core Tradeoff

Temperature is a dial on the diversity-accuracy frontier. Every production system must choose where it sits on that frontier based on what failure modes are most costly for its users. There is no universally correct setting — only context-appropriate ones.

Lesson 3 Quiz

Temperature in Practice
Why do code generation systems like GitHub Copilot use low temperature (≈ 0.2–0.3) rather than the default 1.0?
Correct. In code, a wrong variable name or misplaced bracket can make everything fail. Each token affects the syntactic validity of subsequent tokens, making diversity actively harmful for this task.
Not quite. The reason is error compounding — in code, incorrect tokens don't just produce a mildly wrong answer; they cascade into syntactically and logically broken output, so lower temperatures that prioritise high-confidence tokens are essential.
What is the observed relationship between temperature and hallucination rate?
Correct. Research including TruthfulQA-related analyses found 15–20% higher hallucination at T=1.0 versus T=0.3 on factual tasks, because higher temperature samples from more of the probability tail, including factually incorrect tokens.
Not quite. Higher temperature amplifies probability for non-maximal tokens, which include factually wrong completions. Studies found significantly higher hallucination rates at T=1.0 versus T=0.3 on closed-domain factual tasks.
Why did the field largely move from beam search to temperature sampling for conversational AI?
Correct. Papers including Google's FLAN-T5 work confirmed that human preference evaluations systematically favour temperature-sampled output over beam search for dialogue, because beam search's maximisation produces the "dull, repetitive" text Holtzman et al. identified.
Not quite. Human evaluation studies consistently showed preference for temperature-sampled output in dialogue tasks. Beam search optimises BLEU score in translation but produces robotic, repetitive text in open-ended conversation.

Lab 3: Temperature in Production

Conversation practice — ask 3+ questions to complete

Your Task

Discuss real-world temperature decisions with your AI tutor. Explore why different applications choose different settings, how hallucination risk scales with temperature, or what beam search vs. sampling looks like in practice.

Suggested start: "I'm building a customer-facing chatbot that answers questions about our products from a knowledge base. What temperature should I use and why? What are the risks at different settings?"
Production Tutor Lesson 3
Let's talk about temperature in production systems. I can help you think through the tradeoffs for specific use cases, explain why hallucination scales with temperature, or compare beam search with sampling strategies. What situation are you working through?
Module 4 · Lesson 4

Advanced Decoding

Repetition penalties, frequency/presence penalties, constrained decoding, and why "just set temperature" is never the whole story.
What additional controls do engineers apply when temperature and sampling alone still produce bad output?

When OpenAI released GPT-3 in June 2020, early API users discovered a persistent problem: even at moderate temperatures, the model would enter repetition loops — sometimes generating the same phrase dozens of times in a row. This wasn't a hallucination problem or a temperature problem. It was a structural quirk of autoregressive generation: once a token pattern achieves high probability, it can become self-reinforcing, creating a probability attractor. OpenAI's response was to introduce presence penalties and frequency penalties as first-class API parameters — logit adjustments that discount recently-used or frequently-used tokens. These parameters shipped before GPT-4 and remain in the API today.

Repetition and Why It Happens

Repetition in LLM output is not random. It arises because the model's probability distribution is conditioned on its own previous output. If the model generates "the" at position N, the context now contains "the", making "the" somewhat more likely at position N+1 (many common phrases begin with "the"). Over many steps, high-frequency tokens can create feedback loops.

This is distinct from the hallucination problem. A model can be repetitive and factually accurate simultaneously. The issue is purely about the dynamics of autoregressive sampling.

Frequency and Presence Penalties

OpenAI's API exposes two penalty parameters, both operating on logits before softmax:

Frequency PenaltyReduces a token's logit proportionally to how many times it has already appeared in the output. The more a token has been used, the harder it is to sample again.
Presence PenaltyApplies a one-time flat penalty to any token that has appeared at all — regardless of count. Encourages topic diversity.

The formula: adjusted_logit = logit − frequency_penalty × token_count − presence_penalty × (token_appeared ? 1 : 0). OpenAI's documentation suggests values between 0 and 2.0; negative values increase repetition (occasionally useful for tasks like generating structured data where repeated tokens are expected).

An important nuance: frequency penalty primarily prevents phrase-level looping. Presence penalty primarily prevents topic-level looping (mentioning the same concept repeatedly). They address different aspects of the repetition problem and are often used together.

Practical Setting

For most chat applications: frequency_penalty = 0.5–1.0, presence_penalty = 0.0–0.6. For creative writing: presence_penalty 0.5–1.0 encourages narrative to move forward. For code generation: both near 0 — repetition in code is often intentional (variable reuse, standard patterns).

Constrained and Structured Decoding

A different class of problem: getting LLMs to output valid JSON, SQL, or other structured formats reliably. Temperature and sampling handle randomness, but they don't enforce schema compliance. The solution is constrained decoding: at each token position, mask to zero any tokens that would make the output invalid according to the target schema, then sample from only the valid remainder.

Libraries like Outlines (from Normal Computing, 2023) and Microsoft's Guidance implement this. They convert a JSON schema or regular grammar into a set of logit masks that change at each step as the model generates partial output. The result: the model samples freely from its distribution but can only ever produce syntactically valid output. Temperature still controls diversity within valid outputs — it's not bypassed, just constrained.

Outlines / Structured Generation (2023)

The Outlines library from Normal Computing demonstrated that constrained decoding with LLM-native grammars could reliably produce valid JSON, SQL, and regex-matched output with no accuracy cost versus unconstrained generation — and dramatically lower parsing error rates. This approach is now built into llama.cpp's grammar sampling feature (--grammar-file flag) and adopted by vLLM and other inference engines.

Mirostat: Adaptive Temperature

A 2020 paper by Basu et al. proposed Mirostat — an algorithm that dynamically adjusts temperature at each token to maintain a target perplexity (surprise level). Instead of setting a fixed temperature, you set a target "text quality" measured by cross-entropy, and the algorithm adjusts temperature step by step to stay near that target.

Mirostat is implemented in llama.cpp and Kobold AI. Users report it handles long-form generation better than fixed temperature — avoiding the quality collapse that sometimes occurs when temperature is set high and the model wanders into low-quality regions. It represents the state of the art for local model inference as of 2024 and illustrates the direction the field is moving: away from fixed hyperparameters toward adaptive, feedback-controlled decoding.

Temperature in Reasoning Models

OpenAI's o1 and o3 models (late 2023–2024) introduced a different paradigm: chain-of-thought reasoning tokens generated internally before the final answer. These reasoning traces are reportedly generated at moderate temperature to explore different solution paths, while the final answer extraction uses lower temperature for consistency. This "think hot, answer cold" approach — sampling creatively during deliberation but committing precisely at output — represents a sophisticated application of temperature as a cognitive control.

Google's Gemini team reported similar findings in their 2023 technical report: best-of-N sampling (generating N outputs at higher temperature, then selecting the best by a verifier) outperforms single greedy generation on mathematical reasoning by significant margins, despite introducing more randomness. The diversity creates a search space that the verifier can then optimise over.

Constrained DecodingMasking invalid tokens' logits to negative infinity before sampling, enforcing grammatical or schema constraints.
MirostatAdaptive temperature algorithm that adjusts T at each step to maintain a target perplexity.
Best-of-N SamplingGenerate N independent outputs at moderate temperature; select the highest-scoring by a separate verifier.
Think Hot, Answer ColdInformal term for using higher temperature during chain-of-thought reasoning and lower temperature for final output extraction.
The Big Picture

Temperature is the foundation, but production LLM decoding stacks multiple layers: temperature → top-p/top-k/min-p → frequency/presence penalties → constrained grammar masks → optional adaptive algorithms like Mirostat. Understanding each layer independently lets you diagnose which one to adjust when output quality problems arise.

Lesson 4 Quiz

Advanced Decoding
What is the difference between frequency penalty and presence penalty in OpenAI's API?
Correct. Frequency penalty grows with count (combats phrase-level loops); presence penalty is binary — any appearance triggers it (combats topic-level repetition). They address different repetition patterns.
Not quite. Frequency penalty multiplies the discount by occurrence count — the more often a token has appeared, the harder it is to use again. Presence penalty is a flat one-time discount applied to any token that has appeared at all, regardless of count.
How does constrained decoding (e.g., Outlines library) interact with temperature?
Correct. Constrained decoding zeroes out logits for invalid tokens, then temperature-adjusted softmax sampling proceeds over the valid remainder. Temperature still governs how diverse the valid output is — it's not bypassed.
Not quite. Constrained decoding masks invalid tokens (sets their logits to −∞) before sampling. After masking, the temperature-adjusted softmax operates normally over the remaining valid tokens, so temperature still controls diversity within the valid space.
What does the "think hot, answer cold" approach in models like OpenAI o1 mean?
Correct. Using higher temperature during deliberation creates a diverse search space of reasoning paths. Lower temperature at final output commits to the most reliable conclusion identified through that exploration — creativity in process, precision in result.
Not quite. "Think hot, answer cold" refers to using higher temperature during the internal chain-of-thought reasoning phase — to explore multiple solution paths creatively — then using lower temperature when extracting the final answer to commit precisely to the best conclusion reached.

Lab 4: Advanced Decoding

Conversation practice — ask 3+ questions to complete

Your Task

Explore advanced decoding techniques with your AI tutor: repetition penalties, constrained/structured generation, Mirostat adaptive temperature, and the emerging "think hot, answer cold" pattern. Ask for practical advice or dig into how these systems work mechanically.

Suggested start: "I need an LLM to reliably output JSON matching a specific schema. Walk me through constrained decoding — how does it mask logits, and what happens to temperature during that process?"
Advanced Decoding Tutor Lesson 4
Ready to explore advanced decoding! I can walk you through repetition penalties, constrained generation, Mirostat, best-of-N sampling, or the emerging "think hot, answer cold" pattern in reasoning models. What would you like to dig into?

Module 4 Test

Temperature and Sampling — 15 questions · Pass at 80%
1. What operation does temperature T perform on logits before softmax?
Correct. Each logit is divided by T before softmax.
Not quite. Temperature divides each logit by T.
2. At temperature T = 0.0 (greedy decoding), what token is always selected?
Correct. Greedy decoding always selects the argmax token — the one with highest probability.
Not quite. At T=0, greedy decoding always picks the highest-probability token.
3. What does high temperature (e.g., T = 2.0) do to the probability distribution?
Correct. Dividing logits by a large T compresses differences, making the distribution more uniform.
Not quite. High T flattens the distribution — all tokens become more equally probable.
4. From which field was the term "temperature" borrowed for LLM sampling?
Correct. The Boltzmann distribution in statistical physics uses the identical exponential-over-sum form.
Not quite. The term comes from statistical physics — specifically the Boltzmann distribution.
5. What is the core weakness of top-K sampling that nucleus (top-p) sampling addresses?
Correct. Top-p adapts to model confidence by covering a fixed fraction of probability mass rather than a fixed count.
Not quite. The weakness is that K=50 covers wildly different amounts of probability mass depending on how confident the model is — top-p fixes this by targeting cumulative probability.
6. In nucleus (top-p) sampling with p = 0.9, what is kept?
Correct. Nucleus sampling keeps the minimum set of top tokens summing to p, then renormalises and samples.
Not quite. Top-p keeps the smallest set of highest-probability tokens whose cumulative total reaches 0.9, then renormalises over only those tokens.
7. How does min-p differ from top-p in its threshold calculation?
Correct. Min-p = min_p_value × p(top_token), so when the model is confident the absolute cutoff rises, and when uncertain it falls.
Not quite. Min-p scales its threshold relative to the top token's probability: threshold = min_p × max_token_probability.
8. What temperature range do production code-generation systems typically use, and why?
Correct. Low temperature for code — diversity causes compounding errors because each token affects the syntactic validity of everything that follows.
Not quite. Code generation typically uses T = 0.2–0.3 because incorrect tokens cascade into syntactically broken output.
9. What did Holtzman et al. (2019) observe about pure maximisation decoding (greedy/beam search) for open-ended generation?
Correct. The paper's title ("The Curious Case of Neural Text Degeneration") refers directly to this finding.
Not quite. Holtzman et al. found maximisation produces degenerate — dull, repetitive, incoherent — text that humans strongly disprefer.
10. How does higher temperature affect hallucination rate in factual tasks?
Correct. Research found ~15–20% higher hallucination at T=1.0 vs T=0.3 on factual tasks, because higher T samples from a wider distribution including incorrect tokens.
Not quite. Higher temperature amplifies the probability of non-maximal tokens, which include factually incorrect completions — directly raising hallucination rate.
11. What does the frequency penalty do to logits in OpenAI's API?
Correct. The penalty grows with token count — the more a token has appeared, the bigger the logit discount it receives.
Not quite. Frequency penalty subtracts from a token's logit proportionally to the number of times that token has already appeared in the generated output.
12. How does constrained decoding (e.g., Outlines) enforce JSON schema compliance?
Correct. Valid logit masking at each step means the model can only ever produce tokens that keep the output syntactically valid per the schema — temperature still governs diversity within valid options.
Not quite. Constrained decoding masks invalid tokens' logits to −∞ at each generation step, making it impossible to sample them while leaving temperature to control diversity among valid tokens.
13. What does the Mirostat algorithm do differently from standard temperature sampling?
Correct. Mirostat is a feedback-controlled algorithm — it computes actual perplexity and adjusts T to stay near a target, rather than using a fixed T throughout generation.
Not quite. Mirostat measures the actual perplexity of each token as it's generated and adjusts temperature step-by-step to maintain a user-specified target perplexity.
14. What does "think hot, answer cold" mean in the context of reasoning models like OpenAI o1?
Correct. The deliberation phase benefits from diversity (exploring multiple paths); the answer extraction phase benefits from precision (committing to the best conclusion).
Not quite. "Think hot, answer cold" describes using higher temperature during internal chain-of-thought to explore diverse reasoning paths, then lower temperature when committing to the final answer.
15. Why is setting both temperature and top-p to high values simultaneously counterproductive?
Correct. OpenAI's own documentation warns against adjusting both simultaneously — temperature spreads probability mass while top-p cuts the tail, so the net effect is less predictable than adjusting either alone.
Not quite. Raising temperature widens the distribution; top-p then truncates that wider tail. The two effects work against each other, making the combined behaviour difficult to reason about — Anthropic and OpenAI both advise adjusting only one at a time.