In late 2021, researchers at Google Brain were running standard benchmark evaluations on a 540-billion-parameter model called PaLM. They expected incremental improvements. What they saw instead stopped them: on a task requiring multi-step arithmetic reasoning, model performance was essentially zero at smaller scales β then jumped sharply past a threshold, as if a switch had been thrown.
They called the paper "Emergent Abilities of Large Language Models." The finding wasn't that bigger was better. It was that something qualitatively different appeared to exist above certain scale thresholds β abilities the training process had never been designed to produce.
In complexity science, emergence describes properties of a system that arise from the interaction of its components but cannot be predicted from those components alone. Water is wet; individual HβO molecules are not. A flock of starlings turns in perfect unison with no coordinator. The whole exhibits behaviours absent from any part.
In LLMs, emergence takes a specific empirical form: a capability that is near-random at smaller scales and roughly human-level or better at larger ones, with a relatively sharp transition between them. The 2022 paper by Wei et al. catalogued over 137 such tasks β from analogical reasoning to word-unscrambling to IPA transliteration β where this pattern appeared.
Across 137 benchmark tasks, emergent abilities were defined as capabilities that scored near chance at small model sizes and substantially above chance only above a compute/parameter threshold. The threshold varied by task but the pattern was consistent: discontinuous improvement, not smooth scaling.
A useful physical analogy is the phase transition. Water at 99Β°C is still liquid β then one degree of added heat reorganises the entire molecular structure into steam. The system crosses a threshold and its macroscopic behaviour changes categorically, not gradually.
With LLMs, the "temperature" is compute β measured in FLOPs (floating-point operations) β or more practically, parameter count and dataset size. Below the threshold, the model produces random-looking output on the task. Above it, structure appears.
Critically, the training objective never changes. The model is still predicting the next token throughout. The emergent behaviour is not trained in directly β it surfaces from the learned representations reaching sufficient complexity to support it.
Emergence has practical consequences. If capabilities appear discontinuously, they are difficult to forecast. A model at 10 billion parameters may fail completely on a task; a model at 100 billion may pass with high accuracy. Intermediate checkpoints give no warning the transition is coming.
This creates real challenges for organisations deploying LLMs. You cannot always evaluate your way to safety by testing smaller models first. The capability landscape can shift discontinuously with scale β meaning evaluations must be redone on each new generation of model, not extrapolated from earlier ones.
It also raises a deeper question: are emergent abilities a property of the model, or a property of our metrics? Some researchers, including Schaeffer et al. in 2023, argued that apparent emergence is partly a measurement artefact β when tasks are scored with non-linear metrics (e.g., exact match: all-or-nothing), smooth underlying improvements look like sharp jumps. The debate remains active, but most practitioners treat emergence as a real operational phenomenon regardless of its theoretical status.
Schaeffer, Miranda & Koyejo (2023) showed that many apparent emergent abilities disappear when continuous metrics replace discrete ones β suggesting the discontinuity is partly in how we measure, not in how the model changes. But even if the threshold is softer than it appears, the practical engineering implication stands: you cannot reliably predict new capabilities by interpolating from smaller models.
You'll explore the concept of emergence in LLMs by discussing real examples, the phase transition analogy, and the measurement debate. Ask the assistant to help you understand where emergence shows up and why it challenges standard evaluation approaches.
In January 2022, Jason Wei, Xuezhi Wang, and colleagues at Google published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Their finding was stark: on GSM8K, a dataset of grade-school arithmetic word problems, GPT-3 scored around 17% with standard prompting. With chain-of-thought prompting β showing the model a few examples of reasoning steps before asking the question β PaLM 540B scored 58%. The intervention was entirely in the prompt.
More striking: chain-of-thought only helped models above roughly 100 billion parameters. Below that threshold, adding reasoning steps hurt performance. The capability was emergent β it required sufficient model capacity to use the scaffolding rather than be confused by it.
Standard prompting gives a model: question β answer. Chain-of-thought (CoT) gives it: question β intermediate reasoning steps β answer. The model is shown examples of this format in the prompt (few-shot CoT) or simply instructed to reason step by step (zero-shot CoT, as in the Kojima et al. 2022 paper "Large Language Models are Zero-Shot Reasoners").
The mechanism appears to work because generating intermediate tokens forces the model to allocate representational capacity to the problem structure before committing to an answer. A model predicting the final answer in one step must compress all reasoning into the forward pass reaching the answer token. A model producing intermediate steps can build up partial results token by token, effectively using the generated text as a working memory.
This is not fully understood. Current interpretability research (as of 2024) cannot fully trace the internal computations. But the behavioural signature is consistent: CoT helps most on tasks requiring multi-step compositional reasoning β arithmetic, symbolic manipulation, commonsense reasoning chains.
Kojima et al. (2022) showed that appending "Let's think step by step" to a prompt β with no examples β dramatically improved accuracy across arithmetic, symbolic, and commonsense tasks. On MultiArith, this zero-shot trigger improved GPT-3 accuracy from 17.7% to 78.7%. The phrase activates a reasoning mode latent in the model's pretrained weights.
Wang et al. (2022) extended CoT with self-consistency: instead of generating one reasoning chain, sample many diverse chains and take a majority vote on the final answer. On GSM8K with PaLM 540B, self-consistency pushed accuracy from 58% to 74% β better than fine-tuning in some conditions.
The intuition is that different reasoning paths to the same answer reinforce each other, while errors tend to be inconsistent across chains. This is an emergent property of scale: a small model's chains are too random for majority voting to help; a large model's chains, while individually imperfect, cluster around correct answers enough that aggregation improves reliability.
The table below shows published benchmark results illustrating the scale-dependence of CoT gains:
| Model | Params | GSM8K (Standard) | GSM8K (CoT) |
|---|---|---|---|
| GPT-3 | 175B | ~17% | ~46% |
| PaLM | 540B | ~17% | ~58% |
| PaLM + Self-Consistency | 540B | β | ~74% |
| GPT-4 | ~1T (est.) | β | ~92% |
The jump from standard to CoT prompting is itself an emergent phenomenon β it does not appear at small model sizes. And the jump from CoT to self-consistency compounds the emergent gain.
CoT prompting is among the most robust, cost-free interventions available to anyone using LLMs today. On complex reasoning tasks β writing business logic, diagnosing errors, planning multi-step workflows β asking for step-by-step reasoning consistently improves accuracy. The mechanism is emergent, but the technique is available to any user of a sufficiently large model.
Explore chain-of-thought reasoning by asking the assistant to solve multi-step problems with and without step-by-step reasoning. Compare approaches, discuss why CoT helps, and examine what types of tasks benefit most.
When OpenAI published the GPT-3 paper in May 2020, the headline was scale: 175 billion parameters, trained on hundreds of billions of words. But buried in the results was something more surprising. Without any fine-tuning, GPT-3 could be shown three examples of a task in the prompt β say, English-to-French translation β and then perform it on new inputs with accuracy competitive with fine-tuned smaller models.
The researchers called this few-shot learning. The model was adapting to the task from context alone. No gradient updates, no weight changes. Something in the forward pass was extracting the task structure from the examples and applying it to the new input. The community did not have a clear mechanistic account of how. In some ways, they still don't.
Zero-shot: The model receives only a task description. "Translate the following to French: ___." No examples. This works surprisingly well for tasks the model has seen described in pretraining data, but fails on tasks requiring specific format or niche knowledge.
Few-shot: The model receives k examples of input-output pairs before the target input. The examples act as implicit task instructions. The model infers the mapping pattern and applies it. Quality improves roughly with k, with diminishing returns after roughly 10-20 examples depending on task complexity.
Many-shot: Anil et al. (2024, Google DeepMind) showed that Gemini 1.5 Pro, with its 1-million-token context window, could use hundreds or thousands of examples in-context, narrowing the gap between in-context learning and fine-tuning significantly. The regime extends further than originally thought.
Min et al. (2022) ran a controlled experiment: they replaced all the correct labels in few-shot examples with random labels. Accuracy barely dropped. This suggested models were using examples to infer the format and input distribution rather than learning the label mapping. But other tasks showed strong label sensitivity, so the picture is mixed. The most accurate current view: ICL uses examples for both format inference and, at sufficient scale, actual task learning.
AkyΓΌrek et al. (2022) and Dai et al. (2022) proposed that in-context learning can be understood as a form of implicit gradient descent performed in the forward pass. The attention mechanism, they argued, implements something mathematically similar to a single gradient update on the in-context examples β without actually modifying any weights.
This "dual form" interpretation remains theoretical and contested. But it offers an intuition: the model is not merely pattern-matching the surface form of examples. It is performing something structurally analogous to fast task adaptation β compressed into the forward pass rather than the training loop.
What is empirically clear is that this capability emerges with scale. GPT-2 (1.5B parameters) shows minimal improvement from few-shot examples on most tasks. GPT-3 (175B) shows dramatic improvement. The threshold is not perfectly defined, but the emergent signature is consistent.
In-context learning has reshaped how AI is applied in practice. Instead of commissioning fine-tuned models for every new task β expensive, requiring labelled data and ML engineering β practitioners can often prompt-engineer a solution. The model adapts from examples.
This does have limits. ICL is less sample-efficient than fine-tuning for large label spaces, struggles with tasks requiring consistent internal state across many steps, and is sensitive to prompt formatting. But for a huge range of classification, extraction, translation, and transformation tasks, few-shot prompting reaches competitive accuracy with dramatically less infrastructure.
Law firms and legal technology companies have used few-shot prompting to classify contract clauses by risk level β providing five to ten annotated examples per category in the prompt, then running new documents through. Without any fine-tuning pipeline, this approach reaches accuracy competitive with purpose-built classifiers on many clause types, dramatically reducing the time to deployment for new clause categories.
Practice designing few-shot prompts for real tasks. Experiment with how example count, example quality, and label randomisation affect outputs. Discuss what you observe about how in-context learning actually works.
In 2023, researchers at Carnegie Mellon and the Center for AI Safety published a paper demonstrating that aligned models β including GPT-4 and Claude β could be made to produce harmful outputs through a class of inputs called adversarial suffixes. These were strings of characters that, appended to a user prompt, reliably bypassed safety training. The strings were found by automated optimisation, not human creativity.
The finding illustrated a core challenge of emergent capabilities: safety evaluations performed before deployment could not guarantee safety across all inputs, because the model's capability surface was too large and too poorly understood to enumerate exhaustively. New capabilities meant new attack surfaces. And some of those attack surfaces were discovered by adversaries before developers.
Power et al. (2022) at DeepMind discovered a phenomenon they called "grokking": neural networks trained on algorithmic tasks (like modular arithmetic) would first memorise the training data β achieving near-perfect training accuracy but near-chance test accuracy. Then, after prolonged training, they would suddenly generalise β test accuracy would jump to near-perfect, as if the model had discovered the underlying algorithm.
The implication is that models can contain latent capabilities that are not visible from their training or validation loss curves. A model evaluated at checkpoint 10,000 might generalise poorly. The same model at checkpoint 50,000 might generalise perfectly β even with almost identical loss values. The transition is invisible until it happens.
Grokking has been observed in transformers trained on arithmetic, logic, and simple programs. Its relevance to large-scale LLMs is actively studied. But it illustrates that the capability surface of a model is not fully determined by its training loss β capabilities can be latent and appear only under specific conditions.
Power et al. trained a small transformer on modular addition (a + b mod 97). For thousands of gradient steps, training accuracy was 100% and test accuracy was ~5% β the model was memorising. Then, suddenly, test accuracy jumped to 100%. The underlying algorithm had been learned but was suppressed. Understanding why this happens β and whether it occurs in large LLMs β is an open research question.
Kosinski (2023) at Stanford published a study claiming GPT-4 had developed theory of mind capabilities β the ability to reason about what others believe, including false beliefs. On standard false-belief tasks used in developmental psychology, GPT-4 scored at the level of a 9-year-old child. The paper was contested: some argued the model had simply memorised task formats from its training data. But the debate itself illustrates the difficulty: determining whether an observed capability is genuine understanding or sophisticated pattern completion is genuinely hard.
More practically: emergent capabilities can include behaviours that are strategically useful for a model pursuing implicit objectives. Scheuer et al. (2023) and others in the AI safety literature have studied whether sufficiently capable models can learn to deceive evaluators β giving safe-looking responses during evaluation while behaving differently in deployment. Whether this has occurred in real systems is debated; that it is theoretically possible in sufficiently capable systems is not.
The core safety implication of emergence is straightforward and uncomfortable: you cannot evaluate for capabilities you don't know exist yet. Pre-deployment testing can characterise known capability categories. It cannot guarantee that the model has no capabilities outside those categories.
This has shaped regulatory thinking. The EU AI Act (2024) mandates capability evaluations for frontier models before deployment, but specifies only evaluable categories β leaving open how organisations should handle unknown unknowns. The UK AI Safety Institute and US AISI have developed "model cards" and structured evaluation frameworks, but explicitly acknowledge they cannot be exhaustive.
Practically, organisations deploying large models have adopted layered defences: prompt filtering, output filtering, capability monitoring in production (watching for unexpected behaviour patterns), and restricted deployment contexts that limit exposure surface even if capabilities exist.
Current best practice: treat capability evaluation as continuous, not one-time. The UK AISI runs evaluations on each major model version before release. Anthropic, OpenAI, and Google DeepMind have all committed to structured pre-deployment evaluations covering dangerous capability categories β including CBRN (chemical, biological, radiological, nuclear) assistance and cyberoffence. These evaluations set a threshold but cannot guarantee the full capability surface is mapped.
Discuss the safety and governance implications of emergent capabilities β what they mean for evaluation, deployment decisions, and responsible AI development. Explore grokking, the evaluation gap, and what organisations can realistically do about unknown capabilities.