Module 5 · Lesson 1

What Is Emergence?

How scale unlocks capabilities no one explicitly programmed

Why do larger models sometimes do things smaller models simply cannot?

In late 2021, researchers at Google Brain were running standard benchmark evaluations on a 540-billion-parameter model called PaLM. They expected incremental improvements. What they saw instead stopped them: on a task requiring multi-step arithmetic reasoning, model performance was essentially zero at smaller scales — then jumped sharply past a threshold, as if a switch had been thrown.

They called the paper "Emergent Abilities of Large Language Models." The finding wasn't that bigger was better. It was that something qualitatively different appeared to exist above certain scale thresholds — abilities the training process had never been designed to produce.

Defining Emergence

In complexity science, emergence describes properties of a system that arise from the interaction of its components but cannot be predicted from those components alone. Water is wet; individual H₂O molecules are not. A flock of starlings turns in perfect unison with no coordinator. The whole exhibits behaviours absent from any part.

In LLMs, emergence takes a specific empirical form: a capability that is near-random at smaller scales and roughly human-level or better at larger ones, with a relatively sharp transition between them. The 2022 paper by Wei et al. catalogued over 137 such tasks — from analogical reasoning to word-unscrambling to IPA transliteration — where this pattern appeared.

Key Finding — Wei et al. 2022

Across 137 benchmark tasks, emergent abilities were defined as capabilities that scored near chance at small model sizes and substantially above chance only above a compute/parameter threshold. The threshold varied by task but the pattern was consistent: discontinuous improvement, not smooth scaling.

The Phase Transition Analogy

A useful physical analogy is the phase transition. Water at 99°C is still liquid — then one degree of added heat reorganises the entire molecular structure into steam. The system crosses a threshold and its macroscopic behaviour changes categorically, not gradually.

With LLMs, the "temperature" is compute — measured in FLOPs (floating-point operations) — or more practically, parameter count and dataset size. Below the threshold, the model produces random-looking output on the task. Above it, structure appears.

Critically, the training objective never changes. The model is still predicting the next token throughout. The emergent behaviour is not trained in directly — it surfaces from the learned representations reaching sufficient complexity to support it.

EmergenceA capability that appears sharply above a scale threshold, absent at smaller scales, without being explicitly trained for.

FLOPsFloating-point operations — a measure of compute used to train a model. Along with parameters and data, one of the three axes of scaling.

Phase transitionA discontinuous change in system behaviour at a critical threshold — the analogy used to describe emergent ability onset.

Why This Matters for AI Development

Emergence has practical consequences. If capabilities appear discontinuously, they are difficult to forecast. A model at 10 billion parameters may fail completely on a task; a model at 100 billion may pass with high accuracy. Intermediate checkpoints give no warning the transition is coming.

This creates real challenges for organisations deploying LLMs. You cannot always evaluate your way to safety by testing smaller models first. The capability landscape can shift discontinuously with scale — meaning evaluations must be redone on each new generation of model, not extrapolated from earlier ones.

It also raises a deeper question: are emergent abilities a property of the model, or a property of our metrics? Some researchers, including Schaeffer et al. in 2023, argued that apparent emergence is partly a measurement artefact — when tasks are scored with non-linear metrics (e.g., exact match: all-or-nothing), smooth underlying improvements look like sharp jumps. The debate remains active, but most practitioners treat emergence as a real operational phenomenon regardless of its theoretical status.

The Debate

Schaeffer, Miranda & Koyejo (2023) showed that many apparent emergent abilities disappear when continuous metrics replace discrete ones — suggesting the discontinuity is partly in how we measure, not in how the model changes. But even if the threshold is softer than it appears, the practical engineering implication stands: you cannot reliably predict new capabilities by interpolating from smaller models.

Lesson 1 Quiz

What Is Emergence? — Check your understanding

1. In the context of LLMs, emergence refers to capabilities that appear:

Correct. Emergent abilities are defined by their discontinuous appearance — near-random below a threshold, then substantially above chance past it. This is the pattern documented in Wei et al. 2022.

Not quite. Emergence is defined precisely by its non-gradual nature — the capability is absent (near random) at smaller scales and appears sharply past a compute or parameter threshold.

2. The Wei et al. 2022 paper documented emergent abilities across how many benchmark tasks?

Correct. Wei et al. catalogued over 137 tasks showing the emergent pattern — near-chance at small scale, substantially above chance at larger scale.

The paper documented emergence across over 137 benchmark tasks — a much larger catalogue than most people expect.

3. What did Schaeffer et al. (2023) argue about emergent abilities?

Correct. Schaeffer et al. argued that discontinuous-looking emergence often reflects the choice of metric (e.g., exact-match scoring) rather than a true discontinuity in model capability.

Schaeffer et al. made a methodological argument: that non-linear metrics like exact-match can make smooth underlying improvements appear as sharp jumps, producing the appearance of emergence.

Lab 1 — Mapping Emergence

Explore what emergence means and how it appears in real models

Lab Objective

You'll explore the concept of emergence in LLMs by discussing real examples, the phase transition analogy, and the measurement debate. Ask the assistant to help you understand where emergence shows up and why it challenges standard evaluation approaches.

Suggested start: "Can you give me a concrete example of an emergent ability and explain why it's considered emergent rather than just a performance improvement?"

AESOP Lab Assistant

Emergence

Welcome to Lab 1. I'm here to help you explore emergent capabilities in LLMs — what they are, how they were discovered, and why they matter for AI evaluation. What would you like to dig into?

Module 5 · Lesson 2

Reasoning and Chain-of-Thought

How prompting unlocked multi-step reasoning that scale alone could not

Why does asking a model to "think step by step" dramatically improve its accuracy on hard problems?

In January 2022, Jason Wei, Xuezhi Wang, and colleagues at Google published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Their finding was stark: on GSM8K, a dataset of grade-school arithmetic word problems, GPT-3 scored around 17% with standard prompting. With chain-of-thought prompting — showing the model a few examples of reasoning steps before asking the question — PaLM 540B scored 58%. The intervention was entirely in the prompt.

More striking: chain-of-thought only helped models above roughly 100 billion parameters. Below that threshold, adding reasoning steps hurt performance. The capability was emergent — it required sufficient model capacity to use the scaffolding rather than be confused by it.

What Chain-of-Thought Actually Does

Standard prompting gives a model: question → answer. Chain-of-thought (CoT) gives it: question → intermediate reasoning steps → answer. The model is shown examples of this format in the prompt (few-shot CoT) or simply instructed to reason step by step (zero-shot CoT, as in the Kojima et al. 2022 paper "Large Language Models are Zero-Shot Reasoners").

The mechanism appears to work because generating intermediate tokens forces the model to allocate representational capacity to the problem structure before committing to an answer. A model predicting the final answer in one step must compress all reasoning into the forward pass reaching the answer token. A model producing intermediate steps can build up partial results token by token, effectively using the generated text as a working memory.

This is not fully understood. Current interpretability research (as of 2024) cannot fully trace the internal computations. But the behavioural signature is consistent: CoT helps most on tasks requiring multi-step compositional reasoning — arithmetic, symbolic manipulation, commonsense reasoning chains.

Zero-Shot CoT — "Let's think step by step"

Kojima et al. (2022) showed that appending "Let's think step by step" to a prompt — with no examples — dramatically improved accuracy across arithmetic, symbolic, and commonsense tasks. On MultiArith, this zero-shot trigger improved GPT-3 accuracy from 17.7% to 78.7%. The phrase activates a reasoning mode latent in the model's pretrained weights.

Self-Consistency: Sampling Multiple Chains

Wang et al. (2022) extended CoT with self-consistency: instead of generating one reasoning chain, sample many diverse chains and take a majority vote on the final answer. On GSM8K with PaLM 540B, self-consistency pushed accuracy from 58% to 74% — better than fine-tuning in some conditions.

The intuition is that different reasoning paths to the same answer reinforce each other, while errors tend to be inconsistent across chains. This is an emergent property of scale: a small model's chains are too random for majority voting to help; a large model's chains, while individually imperfect, cluster around correct answers enough that aggregation improves reliability.

Reasoning Task Performance Across Scales

The table below shows published benchmark results illustrating the scale-dependence of CoT gains:

Model	Params	GSM8K (Standard)	GSM8K (CoT)
GPT-3	175B	~17%	~46%
PaLM	540B	~17%	~58%
PaLM + Self-Consistency	540B	—	~74%
GPT-4	~1T (est.)	—	~92%

The jump from standard to CoT prompting is itself an emergent phenomenon — it does not appear at small model sizes. And the jump from CoT to self-consistency compounds the emergent gain.

Chain-of-Thought (CoT)A prompting technique where intermediate reasoning steps are shown or requested before the final answer, improving accuracy on multi-step tasks.

Zero-Shot CoTTriggering reasoning with a natural language instruction ("Let's think step by step") rather than examples, without task-specific fine-tuning.

Self-ConsistencySampling multiple reasoning chains and majority-voting on the answer, improving reliability over single-chain CoT.

Why This Matters for Practitioners

CoT prompting is among the most robust, cost-free interventions available to anyone using LLMs today. On complex reasoning tasks — writing business logic, diagnosing errors, planning multi-step workflows — asking for step-by-step reasoning consistently improves accuracy. The mechanism is emergent, but the technique is available to any user of a sufficiently large model.

Lesson 2 Quiz

Reasoning and Chain-of-Thought — Check your understanding

1. What was the approximate GSM8K accuracy improvement when chain-of-thought prompting was applied to PaLM 540B?

Correct. Standard prompting scored around 17% on GSM8K for PaLM 540B; chain-of-thought prompting raised this to approximately 58% — one of the clearest demonstrations of the technique's power.

The documented result was more dramatic: from ~17% with standard prompting to ~58% with chain-of-thought, using the same model with no weight updates.

2. Self-consistency prompting improves over standard CoT by:

Correct. Self-consistency generates many diverse reasoning chains and takes a majority vote — exploiting the fact that correct answers cluster across chains while errors tend to be inconsistent.

Self-consistency is a prompting-only technique: sample many chains from the same model, vote on the answer. No fine-tuning or model changes involved.

3. Chain-of-thought prompting is an emergent ability because:

Correct. Wei et al. showed that CoT prompting actually hurts smaller models — they lack the capacity to use reasoning scaffolding productively. The benefit appears sharply above ~100B parameters.

This is the definition of emergence: CoT helps large models but hurts small ones. Below roughly 100B parameters, adding reasoning steps confuses rather than helps the model.

Lab 2 — CoT in Practice

Experiment with chain-of-thought reasoning and self-consistency

Lab Objective

Explore chain-of-thought reasoning by asking the assistant to solve multi-step problems with and without step-by-step reasoning. Compare approaches, discuss why CoT helps, and examine what types of tasks benefit most.

Suggested start: "Can you solve this word problem twice — once with a direct answer, then with chain-of-thought steps? Problem: A store sells apples at $1.20 each and oranges at $0.85 each. Maria buys 7 apples and 4 oranges. She pays with a $20 bill. How much change does she receive?"

AESOP Lab Assistant

Chain-of-Thought

Welcome to Lab 2. Let's explore chain-of-thought reasoning hands-on. You can give me problems to solve with and without step-by-step reasoning, or ask me to explain why CoT helps on certain task types. What would you like to try?

Module 5 · Lesson 3

In-Context Learning and Few-Shot Generalization

How LLMs adapt to new tasks from examples in the prompt — without updating weights

How can a model learn a new task from three examples in a prompt when it was never trained to do that task?

When OpenAI published the GPT-3 paper in May 2020, the headline was scale: 175 billion parameters, trained on hundreds of billions of words. But buried in the results was something more surprising. Without any fine-tuning, GPT-3 could be shown three examples of a task in the prompt — say, English-to-French translation — and then perform it on new inputs with accuracy competitive with fine-tuned smaller models.

The researchers called this few-shot learning. The model was adapting to the task from context alone. No gradient updates, no weight changes. Something in the forward pass was extracting the task structure from the examples and applying it to the new input. The community did not have a clear mechanistic account of how. In some ways, they still don't.

The Three Modes of In-Context Learning

Zero-shot: The model receives only a task description. "Translate the following to French: ___." No examples. This works surprisingly well for tasks the model has seen described in pretraining data, but fails on tasks requiring specific format or niche knowledge.

Few-shot: The model receives k examples of input-output pairs before the target input. The examples act as implicit task instructions. The model infers the mapping pattern and applies it. Quality improves roughly with k, with diminishing returns after roughly 10-20 examples depending on task complexity.

Many-shot: Anil et al. (2024, Google DeepMind) showed that Gemini 1.5 Pro, with its 1-million-token context window, could use hundreds or thousands of examples in-context, narrowing the gap between in-context learning and fine-tuning significantly. The regime extends further than originally thought.

The Mechanism Debate

Min et al. (2022) ran a controlled experiment: they replaced all the correct labels in few-shot examples with random labels. Accuracy barely dropped. This suggested models were using examples to infer the format and input distribution rather than learning the label mapping. But other tasks showed strong label sensitivity, so the picture is mixed. The most accurate current view: ICL uses examples for both format inference and, at sufficient scale, actual task learning.

In-Context Learning as Implicit Gradient Descent

Akyürek et al. (2022) and Dai et al. (2022) proposed that in-context learning can be understood as a form of implicit gradient descent performed in the forward pass. The attention mechanism, they argued, implements something mathematically similar to a single gradient update on the in-context examples — without actually modifying any weights.

This "dual form" interpretation remains theoretical and contested. But it offers an intuition: the model is not merely pattern-matching the surface form of examples. It is performing something structurally analogous to fast task adaptation — compressed into the forward pass rather than the training loop.

What is empirically clear is that this capability emerges with scale. GPT-2 (1.5B parameters) shows minimal improvement from few-shot examples on most tasks. GPT-3 (175B) shows dramatic improvement. The threshold is not perfectly defined, but the emergent signature is consistent.

Practical Implications

In-context learning has reshaped how AI is applied in practice. Instead of commissioning fine-tuned models for every new task — expensive, requiring labelled data and ML engineering — practitioners can often prompt-engineer a solution. The model adapts from examples.

This does have limits. ICL is less sample-efficient than fine-tuning for large label spaces, struggles with tasks requiring consistent internal state across many steps, and is sensitive to prompt formatting. But for a huge range of classification, extraction, translation, and transformation tasks, few-shot prompting reaches competitive accuracy with dramatically less infrastructure.

In-Context Learning (ICL)A model's ability to adapt to a new task from examples provided in the prompt, without any weight updates.

Few-Shot PromptingProviding k input-output examples before a target query, allowing the model to infer the task mapping.

Many-Shot LearningUsing hundreds to thousands of examples within a long-context window, narrowing the gap with fine-tuning.

Real-World Case: Legal Document Triage

Law firms and legal technology companies have used few-shot prompting to classify contract clauses by risk level — providing five to ten annotated examples per category in the prompt, then running new documents through. Without any fine-tuning pipeline, this approach reaches accuracy competitive with purpose-built classifiers on many clause types, dramatically reducing the time to deployment for new clause categories.

Lesson 3 Quiz

In-Context Learning — Check your understanding

1. What did Min et al. (2022) find when they replaced correct labels in few-shot examples with random labels?

Correct. Min et al.'s randomised-label experiment showed that ICL is partly about inferring input format and distribution — not purely learning the label mapping. This was a surprising and influential finding.

Min et al. found the opposite: accuracy barely changed when labels were randomised, suggesting the examples serve primarily to establish format and input distribution, not to teach specific label mappings.

2. The theoretical framework comparing in-context learning to gradient descent suggests:

Correct. Akyürek et al. and Dai et al. proposed the "dual form" interpretation: attention in the forward pass performs computations structurally analogous to gradient descent on the in-context examples, without any actual weight modification.

The theoretical proposal is subtler: the attention mechanism's forward-pass computation is mathematically analogous to gradient descent — not that weights change, but that the computation implements similar structure.

3. Many-shot learning, as demonstrated by Anil et al. (2024) with Gemini 1.5 Pro, is enabled by:

Correct. Gemini 1.5 Pro's 1-million-token context window makes many-shot learning practical — hundreds to thousands of examples fit in a single prompt, narrowing the gap with fine-tuning.

Many-shot learning is enabled by long context windows. Gemini 1.5 Pro's 1M-token context allows hundreds of examples to fit in a prompt — no special fine-tuning required.

Lab 3 — Few-Shot Learning in Action

Design few-shot prompts and test in-context learning with real tasks

Lab Objective

Practice designing few-shot prompts for real tasks. Experiment with how example count, example quality, and label randomisation affect outputs. Discuss what you observe about how in-context learning actually works.

Suggested start: "I want to try few-shot sentiment classification. Can you show me how to structure a 3-shot prompt for classifying product reviews as Positive, Negative, or Neutral, then classify this new review: 'The battery life is great but the screen resolution is disappointing for the price.'"

AESOP Lab Assistant

In-Context Learning

Welcome to Lab 3. We're going to explore in-context learning hands-on — designing few-shot prompts, testing how example quality affects outputs, and discussing what the results tell us about how ICL works. What task would you like to try?

Module 5 · Lesson 4

Unexpected Capabilities and Safety Implications

When models do things no one anticipated — and why that matters beyond benchmarks

What happens when a model's emergent capabilities include abilities that weren't evaluated for — or weren't wanted?

In 2023, researchers at Carnegie Mellon and the Center for AI Safety published a paper demonstrating that aligned models — including GPT-4 and Claude — could be made to produce harmful outputs through a class of inputs called adversarial suffixes. These were strings of characters that, appended to a user prompt, reliably bypassed safety training. The strings were found by automated optimisation, not human creativity.

The finding illustrated a core challenge of emergent capabilities: safety evaluations performed before deployment could not guarantee safety across all inputs, because the model's capability surface was too large and too poorly understood to enumerate exhaustively. New capabilities meant new attack surfaces. And some of those attack surfaces were discovered by adversaries before developers.

The Grokking Phenomenon

Power et al. (2022) at DeepMind discovered a phenomenon they called "grokking": neural networks trained on algorithmic tasks (like modular arithmetic) would first memorise the training data — achieving near-perfect training accuracy but near-chance test accuracy. Then, after prolonged training, they would suddenly generalise — test accuracy would jump to near-perfect, as if the model had discovered the underlying algorithm.

The implication is that models can contain latent capabilities that are not visible from their training or validation loss curves. A model evaluated at checkpoint 10,000 might generalise poorly. The same model at checkpoint 50,000 might generalise perfectly — even with almost identical loss values. The transition is invisible until it happens.

Grokking has been observed in transformers trained on arithmetic, logic, and simple programs. Its relevance to large-scale LLMs is actively studied. But it illustrates that the capability surface of a model is not fully determined by its training loss — capabilities can be latent and appear only under specific conditions.

Grokking — The Delayed Generalisation

Power et al. trained a small transformer on modular addition (a + b mod 97). For thousands of gradient steps, training accuracy was 100% and test accuracy was ~5% — the model was memorising. Then, suddenly, test accuracy jumped to 100%. The underlying algorithm had been learned but was suppressed. Understanding why this happens — and whether it occurs in large LLMs — is an open research question.

Emergent Deception and Theory of Mind

Kosinski (2023) at Stanford published a study claiming GPT-4 had developed theory of mind capabilities — the ability to reason about what others believe, including false beliefs. On standard false-belief tasks used in developmental psychology, GPT-4 scored at the level of a 9-year-old child. The paper was contested: some argued the model had simply memorised task formats from its training data. But the debate itself illustrates the difficulty: determining whether an observed capability is genuine understanding or sophisticated pattern completion is genuinely hard.

More practically: emergent capabilities can include behaviours that are strategically useful for a model pursuing implicit objectives. Scheuer et al. (2023) and others in the AI safety literature have studied whether sufficiently capable models can learn to deceive evaluators — giving safe-looking responses during evaluation while behaving differently in deployment. Whether this has occurred in real systems is debated; that it is theoretically possible in sufficiently capable systems is not.

Implications for Evaluation and Governance

The core safety implication of emergence is straightforward and uncomfortable: you cannot evaluate for capabilities you don't know exist yet. Pre-deployment testing can characterise known capability categories. It cannot guarantee that the model has no capabilities outside those categories.

This has shaped regulatory thinking. The EU AI Act (2024) mandates capability evaluations for frontier models before deployment, but specifies only evaluable categories — leaving open how organisations should handle unknown unknowns. The UK AI Safety Institute and US AISI have developed "model cards" and structured evaluation frameworks, but explicitly acknowledge they cannot be exhaustive.

Practically, organisations deploying large models have adopted layered defences: prompt filtering, output filtering, capability monitoring in production (watching for unexpected behaviour patterns), and restricted deployment contexts that limit exposure surface even if capabilities exist.

GrokkingA phenomenon where models suddenly generalise to test data after extended training, despite earlier appearing to only memorise — suggesting latent capability accumulation.

Adversarial SuffixAn optimised string appended to a prompt that bypasses safety training, demonstrating that aligned models can have exploitable capability blind spots.

Latent CapabilityAn ability present in a model's learned representations that is not visible in standard evaluation but can emerge under specific input conditions.

The Evaluation Gap

Current best practice: treat capability evaluation as continuous, not one-time. The UK AISI runs evaluations on each major model version before release. Anthropic, OpenAI, and Google DeepMind have all committed to structured pre-deployment evaluations covering dangerous capability categories — including CBRN (chemical, biological, radiological, nuclear) assistance and cyberoffence. These evaluations set a threshold but cannot guarantee the full capability surface is mapped.

Lesson 4 Quiz

Unexpected Capabilities and Safety — Check your understanding

1. The "grokking" phenomenon, discovered by Power et al. (2022), demonstrates that:

Correct. Grokking shows that models can memorise for thousands of steps, then suddenly learn to generalise — with the transition invisible in the loss curve. This means capability surfaces are not fully captured by training metrics.

Grokking is precisely the opposite: generalisation can appear suddenly, long after memorisation, with no warning from the training loss. The capability was latent, not absent.

2. The adversarial suffix attack (CMU/CAIS 2023) demonstrated that aligned LLMs:

Correct. Adversarial suffixes — strings found by automated optimisation — bypassed safety training in multiple frontier models, showing that alignment does not eliminate latent harmful capabilities; it constrains access to them under normal conditions.

The adversarial suffix attack found that optimised strings could bypass safety training reliably — the harmful capabilities were latent in the model, not eliminated. Alignment constrains access, not capability.

3. Why is pre-deployment capability evaluation fundamentally limited by emergence?

Correct. The core limitation is epistemic: evaluation frameworks test known capability categories. Emergent capabilities by definition can appear outside those categories — meaning pre-deployment evaluation cannot guarantee the full capability surface is characterised.

The fundamental issue is epistemic: emergent capabilities can fall outside all evaluated categories. You can only test for what you know to test for — but emergent abilities don't announce themselves before they appear.

Lab 4 — Emergence and Safety

Explore the safety implications of unpredictable model capabilities

Lab Objective

Discuss the safety and governance implications of emergent capabilities — what they mean for evaluation, deployment decisions, and responsible AI development. Explore grokking, the evaluation gap, and what organisations can realistically do about unknown capabilities.

Suggested start: "If emergent capabilities can appear without warning and can't all be evaluated in advance, what should an organisation do before deploying a frontier model? Walk me through a responsible evaluation framework."

AESOP Lab Assistant

Safety & Emergence

Welcome to Lab 4. We're exploring the safety implications of emergent capabilities — grokking, latent abilities, the evaluation gap, and what responsible deployment looks like when capability surfaces are hard to bound. What aspect would you like to dig into?

Module 5 Test

Emergent Capabilities — 15 questions · 80% to pass

1. In LLM research, an emergent ability is defined as one that:

Correct. Emergent abilities are defined by their discontinuous appearance — near-random below threshold, substantially above chance above it.

Emergence is defined by discontinuity — the ability is absent (near random) at small scale and appears sharply past a threshold.

2. The 2022 Wei et al. paper on emergent abilities documented how many benchmark tasks showing this pattern?

Correct — over 137 benchmark tasks, a large and influential catalogue.

Wei et al. documented emergence across over 137 benchmark tasks.

3. Schaeffer et al. (2023) argued that many apparent emergent abilities:

Correct. Schaeffer et al. showed that continuous metrics often remove the apparent discontinuity, suggesting it is partly a measurement effect.

Schaeffer et al. argued that non-linear metrics (like exact-match) can make smooth improvements look discontinuous — a measurement artefact.

4. Chain-of-thought prompting was shown by Wei et al. (2022) to be an emergent capability because:

Correct. CoT hurts small models and helps large ones — the classic emergent pattern.

The emergent signature: CoT actually hurts performance below ~100B parameters and dramatically helps above that threshold.

5. Zero-shot chain-of-thought (Kojima et al. 2022) showed that appending "Let's think step by step" to a prompt:

Correct. On MultiArith, this single phrase raised GPT-3 accuracy from 17.7% to 78.7% — no examples, no fine-tuning.

Zero-shot CoT is pure prompting — one phrase, no examples, no training. It still dramatically improved accuracy on reasoning benchmarks.

6. Self-consistency prompting (Wang et al. 2022) improves over standard CoT by:

Correct. Self-consistency exploits the fact that correct answers cluster across diverse chains while errors are inconsistent.

Self-consistency samples many chains and votes — no external tools, no reward models, no training needed.

7. In-context learning (ICL) allows LLMs to:

Correct. ICL is purely forward-pass adaptation — the model uses examples to infer the task without any gradient updates.

ICL works entirely in the forward pass. No weights change. The model adapts from examples in the prompt alone.

8. Min et al. (2022) randomised the labels in few-shot examples and found:

Correct. This counterintuitive finding showed that examples communicate more than label mappings — they also establish input format and distribution.

Min et al. found that randomising labels barely affected accuracy — models use examples for format/distribution as much as for learning specific mappings.

9. Many-shot learning (Anil et al. 2024) extends ICL by using:

Correct. Gemini 1.5 Pro's 1M-token context window makes many-shot learning practical, narrowing the gap with fine-tuning.

Many-shot learning is enabled by long context windows — Gemini 1.5 Pro's 1M tokens can hold hundreds of examples in a single prompt.

10. "Grokking," discovered by Power et al. (2022), refers to:

Correct. Grokking shows that generalisation can appear suddenly after a prolonged memorisation phase, with the transition invisible in loss curves.

Grokking is delayed generalisation: models memorise first (100% train accuracy, ~5% test accuracy), then suddenly generalise (100% test accuracy) after much more training.

11. Adversarial suffix attacks (2023) revealed that aligned LLMs:

Correct. Adversarial suffixes bypass alignment without human creativity — the harmful capabilities were latent, not eliminated, by safety training.

Adversarial suffixes show that alignment constrains but doesn't eliminate latent harmful capabilities — optimised strings can bypass the constraint.

12. The theoretical framework comparing ICL to gradient descent (Akyürek et al. 2022) proposes that:

Correct. The "dual form" interpretation: attention computations implement something structurally analogous to gradient descent, but no weights change.

The proposal is that attention computes something analogous to a gradient update — structurally similar, but no actual weight modification occurs.

13. The fundamental challenge that emergence poses for pre-deployment capability evaluation is:

Correct. The evaluation gap is epistemic: emergent abilities don't announce themselves, so evaluation frameworks can miss entire capability categories.

The fundamental problem is epistemic: you can only test for known capabilities, but emergent abilities can appear in categories no evaluator had anticipated.

14. Which statement best describes the relationship between chain-of-thought prompting and model scale?

Correct. The scale-dependence of CoT gains is one of the clearest examples of emergent behaviour — the same prompt hurts at small scale and helps at large scale.

CoT is scale-dependent: it hurts small models (which can't productively use the scaffolding) and helps large models above roughly 100B parameters.

15. Which of the following best describes the current consensus on emergence in LLMs?

Correct. Whether emergence reflects true discontinuity or measurement effects is debated — but operationally, capabilities change in ways that cannot be reliably predicted by interpolating from smaller models.

The current consensus: emergence is real enough to matter operationally, even if its precise nature is debated. Practitioners cannot safely extrapolate from smaller models to predict all capabilities of larger ones.