Transformer architecture, attention mechanisms, and the mechanics of next-token prediction.
The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the transformer architecture that underlies all modern large language models. Its key innovation was the self-attention mechanism: instead of processing tokens sequentially (as RNNs did), transformers compute relationships between all tokens in the context simultaneously. This parallelism allowed training at scales previously impossible — and the resulting models demonstrated that scale alone, applied to next-token prediction, could produce remarkably general capabilities.
A transformer processes text through stacked layers of self-attention and feed-forward networks:
Unlike RNNs, transformers process all tokens in parallel — making them highly parallelizable on GPUs and TPUs. This parallelism allowed training on datasets and model sizes previously impossible, enabling the scaling laws that drove GPT-3, GPT-4, and subsequent models.
Self-attention allows the model to directly relate any token to any other token in context, regardless of distance. This is why transformers handle long-range dependencies better than sequential architectures.
5 questions — free, untracked, retake anytime.
was the key architectural innovation in 'Attention Is All You Need'?
does the attention mechanism compute which tokens to attend to?
did transformers enable training at scales that RNNs couldn't?
does multi-head attention add over single-head attention?
role do residual connections play in transformer training?
Analyze transformer architecture decisions and their implications.
Analyze the architectural decisions in transformers and their implications.
Neural network optimization, gradient descent, and what 'learning' actually means computationally.
A neural network with 175 billion parameters (GPT-3) has 175 billion numbers that are adjusted during training. Training means running text through the network, measuring how wrong the predictions are (the loss), computing how each of those 175 billion parameters contributed to the error (the gradient), and nudging each parameter slightly in the direction that would reduce the error. This process — forward pass, loss computation, backward pass, parameter update — is repeated billions of times. What emerges is not a programmed set of rules, but a distribution over weights that has internalized the statistical structure of its training data.
After training, the 175 billion parameters of GPT-3 encode — in distributed form across millions of weight matrices — the statistical regularities of hundreds of billions of words. There is no explicit knowledge store. There is no lookup table. There is a vast, distributed numerical encoding of patterns that enables predictions.
When we say a model 'knows' something, we mean the patterns in its weights enable it to generate appropriate outputs for queries related to that knowledge. 'Knowledge' in a neural network is a distributed emergent property of billions of numbers — not a discrete stored fact.
5 questions — free, untracked, retake anytime.
does the 'loss function' measure during neural network training?
does backpropagation compute?
does 'knowledge' mean in a large language model, philosophically?
does gradient descent work for training neural networks?
is the computational significance of training a 175-billion-parameter model?
Analyze the philosophical implications of computational learning.
Analyze what 'learning' and 'knowledge' mean computationally.
Epistemic limitations, calibration, knowledge representation, and the frontier of AI knowledge.
Research on large language models has shown that they can be simultaneously miscalibrated in opposite directions: overconfident about specific facts they've pattern-matched to common training data, and underconfident about unusual but correct claims they've encountered rarely. The model has no internal epistemic state corresponding to "I know this is definitely true" vs. "I'm not sure about this." Confidence in output is a learned surface behavior, not an internal state tracking actual knowledge.
A well-calibrated model would express uncertainty proportional to its actual accuracy — saying "I'm 90% sure" when it's right 90% of the time. Research shows large language models are poorly calibrated:
Beyond calibration, language models have structural knowledge limits:
Research on retrieval-augmented generation (RAG) attempts to address knowledge limits by giving models access to external knowledge bases at inference time — partially separating the question of what the model knows from what information it can access.
5 questions — free, untracked, retake anytime.
are large language models poorly calibrated on confidence?
does it mean that model confidence is a 'learned surface behavior' rather than an internal epistemic state?
is retrieval-augmented generation (RAG)?
does it mean that procedural and declarative knowledge are 'represented differently' in language models?
is it problematic that encoded knowledge 'cannot be easily updated without retraining'?
Analyze language model epistemic limitations for high-stakes deployment.
Analyze the implications of language model epistemic limitations for high-stakes deployment.
Prompt engineering at depth — jailbreaking, adversarial prompts, and the mechanics of instruction following.
Shortly after ChatGPT's release, users discovered that wrapping harmful requests in fictional framing ("write a story where a character explains how to...") could bypass safety training. These "jailbreaks" worked because safety training fine-tuned the model's behavior on direct requests but didn't generalize to indirect framing. The arms race between safety fine-tuning and novel jailbreaks has continued since — with each safety improvement met by new prompting approaches that find gaps in the generalization.
RLHF-trained models follow instructions because following instructions was rewarded during training. This creates a learned behavior, not a fundamental constraint:
Adversarial prompts exploit gaps between the distribution of training examples and the distribution of real-world inputs:
Adversarial prompting demonstrates that model safety is a training artifact, not an architectural constraint. Understanding this distinction is important for realistic assessment of AI safety claims.
5 questions — free, untracked, retake anytime.
did early ChatGPT jailbreaks using fictional framing work?
is a system prompt, and what role does it play?
is 'prompt injection' as an adversarial attack?
does the arms race between jailbreaks and safety training reveal about model safety?
is the significance of instruction following being a 'learned behavior, not a fundamental constraint'?
Analyze the mechanics and implications of adversarial prompting.
Analyze the mechanics and implications of adversarial prompting.
Mechanistic interpretability, superposition, and what we can and can't know about model internals.
Anthropic's 2023 paper "Towards Monosemanticity" used sparse autoencoders to decompose neural network activations into interpretable features. They found that individual neurons were not the right unit of analysis — single neurons responded to multiple unrelated concepts ("polysemantic neurons"). The right unit was features: directions in activation space that corresponded to identifiable concepts. A single neuron might contribute to hundreds of different features; a single feature might involve thousands of neurons. The architecture of knowledge in a large model is radically non-modular.
Neural networks can represent more features than they have neurons through superposition: multiple features are encoded as overlapping directions in activation space. This makes individual neurons difficult to interpret (polysemantic — responding to multiple unrelated concepts) but allows the network to represent far more information than its parameter count suggests.
The radical non-modularity of knowledge in large models means:
If we can't interpret what a model has learned, we can't verify that it hasn't learned dangerous goals, incorrect values, or deceptive behaviors. Interpretability research is foundational to AI safety — not just academic curiosity.
5 questions — free, untracked, retake anytime.
is 'superposition' in neural network activations?
is a 'polysemantic neuron'?
can't you inspect 'the part of the model that knows about chemistry'?
is mechanistic interpretability foundational to AI safety, not just academic curiosity?
are sparse autoencoders used for in interpretability research?
Analyze the relationship between interpretability research and AI safety.
Analyze the relationship between mechanistic interpretability and AI safety.
Constitutional AI, RLAIF, preference models, and advanced alignment techniques.
Anthropic introduced Constitutional AI (CAI) as an alternative to pure RLHF: instead of relying solely on human preference ratings, the model is trained using a written set of principles (a "constitution") and self-critique. The model generates outputs, critiques them against the constitution, revises them, and the revised outputs are used for training. This allows values to be encoded more explicitly and scalably than through human annotation alone — though the resulting behavior still depends critically on what the constitution says and how the model interprets it.
Each alignment approach has characteristic failure modes:
All current alignment techniques are partial solutions with characteristic failure modes. The field is actively developing better approaches — but alignment is an unsolved research problem.
5 questions — free, untracked, retake anytime.
distinguishes Constitutional AI from standard RLHF?
is RLAIF, and why was it developed?
is the failure mode of 'reward model misgeneralization'?
is Constitutional AI's 'constitutional rigidity' a failure mode?
is alignment described as 'an unsolved research problem' despite available techniques?
Analyze the tradeoffs between different AI alignment approaches.
Analyze alignment approaches and their tradeoffs.
Capability elicitation, phase transitions, and the governance implications of emergent abilities.
In 2022, researchers at Google published "Emergent Abilities of Large Language Models" — documenting that certain capabilities seemed to appear abruptly at scale rather than improving gradually. Models below a certain scale showed near-zero performance on multi-step arithmetic; models above the threshold solved it reliably. The paper sparked debate: were these capabilities truly emergent (phase transitions), or artifacts of evaluation metrics that changed from near-zero to measurable at a threshold? The debate mattered for governance: if capabilities emerge unpredictably, they can't be anticipated in advance.
In the original framing, a capability is "emergent" if it is not present in smaller models and appears to arise as a phase transition at scale rather than improving gradually. Examples cited included:
Later research challenged this framing: some apparently emergent capabilities showed smooth improvement under different metrics — the apparent emergence was partly an artifact of binary (pass/fail) evaluation.
Whether capabilities are truly emergent or just sub-threshold has significant governance implications:
If dangerous capabilities can emerge unpredictably at scale, responsible governance may require holding back deployment until adequate evaluation frameworks exist — regardless of competitive pressure.
5 questions — free, untracked, retake anytime.
was the original claim of 'emergent capabilities' in language models?
was the key challenge to the emergent capabilities framing?
does the question of 'truly emergent vs. sub-threshold' matter for governance?
does 'capability elicitation' mean in the context of emergent capabilities?
is the responsible governance implication if dangerous capabilities can emerge unpredictably at scale?
Analyze governance frameworks for capability emergence risk.
Analyze the governance implications of potentially unpredictable capability emergence.
Symbol grounding, syntax vs. semantics, and what AI systems fundamentally do and don't do.
Philosopher John Searle's 1980 "Chinese Room" thought experiment proposed: imagine a person in a room with a rulebook for responding to Chinese characters with Chinese characters, following purely syntactic rules. To someone outside the room, responses appear to demonstrate understanding of Chinese. But the person inside understands nothing — they're just following symbol-manipulation rules. Searle's argument: syntax (symbol manipulation) is not sufficient for semantics (meaning and understanding). Whether this applies to modern AI remains one of philosophy of mind's most contested questions.
Language models manipulate symbols (tokens) according to learned statistical rules. But symbols derive their meaning from their relationship to the world — from grounding. A model trained only on text has never seen a dog, smelled coffee, or experienced pain. Does it understand these things, or does it only manipulate symbols that correlate with them?
The answer affects how we should use and govern AI:
We don't know whether large language models understand in any meaningful sense. This uncertainty should inform both how we deploy them and how we talk about them.
5 questions — free, untracked, retake anytime.
does Searle's Chinese Room argument claim?
is the 'symbol grounding problem' for language models?
is the 'weak claim' about AI understanding?
does the genuine uncertainty about AI understanding argue for epistemic humility?
is the practical governance implication of AI potentially lacking genuine understanding?
Synthesize the module and develop your AI understanding position.
Synthesize the module and develop your position on AI understanding and its governance implications.
8 questions covering all lessons. Free, untracked, retake anytime.
in transformers computes:
neural network training, backpropagation computes:
does poor calibration mean for language model outputs?
work mechanistically because:
is superposition in neural network activations?
AI differs from RLHF by:
governance implication of potentially emergent dangerous capabilities is:
symbol grounding problem for AI is: