On 30 November 2020, the organisers of CASP14 — the biennial protein-structure prediction competition — announced results that silenced the auditorium. A team from Google DeepMind had submitted predictions with a median accuracy score of 92.4 GDT, a metric out of 100. The previous best, achieved two years earlier, was around 40. One veteran structural biologist told Science magazine: "I was speechless. It's not just a small improvement; it has essentially solved the problem."
The problem in question had stood for more than fifty years. Anfinsen's dogma, confirmed by the 1972 Nobel Prize in Chemistry, held that a protein's amino-acid sequence completely determines its three-dimensional shape. But no one could reliably compute that shape from the sequence alone.
Proteins are the molecular machines of life. They act as enzymes that catalyse chemical reactions, as structural scaffolds (collagen, keratin), as signalling molecules (insulin), as immune defenders (antibodies), and as motors that contract muscle. The human genome encodes roughly 20,000 different proteins.
Every protein is a chain of amino acids — small organic molecules linked end-to-end like beads on a string. There are 20 standard amino acids. A typical protein chain is 300–500 amino acids long; the largest known human protein, titin, spans about 34,000. The linear sequence is called the primary structure.
But a flat chain is not functional. Within microseconds of being synthesised by a ribosome, a protein collapses into a precise, compact three-dimensional shape. That shape — called the native conformation or tertiary structure — is what allows the protein to do its job. A misfolded protein is usually useless at best and toxic at worst: misfolding underlies Alzheimer's disease, Parkinson's disease, cystic fibrosis, and hundreds of other conditions.
The combinatorial explosion is staggering. For a chain of just 100 amino acids, if each bond has even two possible orientations, the number of possible conformations exceeds 1030. Cyrus Levinthal noted in 1969 that if a protein randomly sampled conformations at the rate of one per nanosecond, it would take longer than the age of the universe to find the right one — yet real proteins fold in milliseconds. This became known as Levinthal's paradox.
The experimental method for determining protein structures — X-ray crystallography — is accurate but slow, expensive, and requires the protein to form a crystal, which many proteins refuse to do. As of 2020, roughly 170,000 structures were in the Protein Data Bank (PDB), covering perhaps 0.1% of known protein sequences.
Computational physics approaches attempted to simulate folding from first principles, but the energy landscapes were too complex and the required computing power astronomical. By the 2010s, machine-learning methods began mining evolutionary data: proteins that share a function tend to share certain amino-acid pairs that co-evolve because mutations in one position must be compensated by mutations elsewhere to preserve the fold. This co-evolutionary signal, extracted from millions of known sequences, provided geometric constraints that could triangulate the shape.
The evolutionary record of billions of sequences is itself a compressed archive of structural information. AI methods that read this archive can infer physical geometry without simulating every atom explicitly.
Since 1994, the Critical Assessment of protein Structure Prediction (CASP) has run every two years. Experimentalists solve a set of protein structures using crystallography or cryo-electron microscopy, then share only the sequences — not the answers — with competing prediction groups. Groups submit models, and the answers are revealed at a conference. The GDT (Global Distance Test) score measures what fraction of predicted atom positions fall within a few ångströms of the experimentally determined positions.
For 26 years, progress was real but slow. CASP13 in 2018 saw DeepMind's first AlphaFold system (now called AlphaFold 1) win convincingly but still with major errors. CASP14 in 2020 was different. AlphaFold 2's results were, as Nature put it in its December 2021 paper, "a solution to the 50-year-old grand challenge of protein structure prediction."
CASP14 target T1049 was a bacterial protein with no close relatives in the PDB. AlphaFold 2 produced a model with a Cα RMSD (root mean square deviation) of 0.96 ångströms from the crystal structure — roughly the width of a single hydrogen atom off.
You are a junior researcher encountering protein structure for the first time. Use the AI assistant below to deepen your understanding of the folding problem, Levinthal's paradox, and why the CASP competition mattered.
After their CASP13 win in 2018, the AlphaFold team — led by John Jumper and senior researcher Demis Hassabis — tore down their model and rebuilt it from scratch. The new architecture, AlphaFold 2, would be published in Nature in July 2021. It combined three innovations that had never before been assembled together for this problem: multiple sequence alignment to extract co-evolutionary signals, a transformer-based architecture called the Evoformer to reason about those signals jointly, and an equivariant structure module that predicted atom coordinates respecting the symmetries of 3D space.
AlphaFold 2 begins not with physics but with a database query. Given a target sequence, it searches three enormous sequence databases — UniRef90, BFD (Big Fantastic Database), and MGnify — to construct a Multiple Sequence Alignment (MSA): hundreds or thousands of related sequences from organisms across the tree of life.
Why? Because evolution is an experiment that ran for billions of years. If two positions in a protein must be close in 3D space, mutations at one position are often accompanied by compensating mutations at the other — to preserve the interface. Statistical analysis of which positions co-vary across species therefore encodes geometric constraints about the fold. AlphaFold 2 also searches the PDB for structural templates — related proteins already solved experimentally — to supplement this signal.
The core of AlphaFold 2 is the Evoformer, a stack of 48 transformer blocks that processes two representations simultaneously: a sequence representation (a matrix of MSA rows × amino acid positions) and a pair representation (a matrix of position × position relationships). These two representations update each other iteratively.
The key mechanism is attention — the same family of operations that powers large language models. In AlphaFold, attention lets each residue "look at" every other residue and at patterns across the MSA, weighting relevance dynamically. After 48 Evoformer iterations, the pair representation encodes rich information about which residue pairs are likely to be close in space.
Crucially, the model also recycles its own output: the predicted structure from one pass is fed back as input to the next, allowing the network to iteratively refine its prediction — much as a sculptor returns to a clay model to correct and smooth.
AlphaFold 2 has approximately 93 million trainable parameters. It was trained on ~170,000 known structures from the PDB, plus ~350,000 self-distillation predictions used as additional training signal.
The Evoformer outputs are fed to the Structure Module, which directly predicts the 3D coordinates of every backbone and side-chain atom. A critical design choice: the module uses equivariant operations — mathematical transformations that respect the symmetries of 3D space. If you rotate the entire protein, the predicted structure rotates with it rather than breaking. This means the network learns about geometry, not arbitrary coordinate systems.
The Structure Module represents each residue as a rigid-body "frame" — a local coordinate system — and updates these frames using Invariant Point Attention (IPA), a form of attention that operates in 3D space. The final output is a full-atom coordinate set plus confidence scores called pLDDT (predicted Local Distance Difference Test), which tell researchers how confident the model is at each residue position.
AlphaFold 2 was trained on a cluster of 128 TPUv3 cores for approximately 11 days. The loss function combined FAPE (Frame Aligned Point Error — a measure of 3D coordinate accuracy), side-chain torsion angle errors, distogram errors, and auxiliary losses. The CASP14 evaluation served as a prospective validation on unseen targets — the most rigorous possible test, because the experimental structures had not been released when predictions were made.
Jumper et al., "Highly accurate protein structure prediction with AlphaFold," Nature 596, 583–589 (2021). As of 2024 it has been cited over 25,000 times — among the most impactful papers of the decade.
You've just read about the Evoformer and Structure Module. Now dig deeper with the AI assistant. The goal is to understand why each design choice was made — not just what it is.
On 22 July 2021, the same week the AlphaFold 2 paper appeared in Nature, DeepMind and the European Bioinformatics Institute (EMBL-EBI) jointly released the AlphaFold Protein Structure Database — with predicted structures for the entire human proteome (about 20,000 proteins) and the proteomes of 20 other important organisms. Six months later, in January 2022, the database expanded to cover 48 organisms. By July 2022, it had exploded to over 200 million predicted structures — essentially every protein in UniProt, the universal protein sequence database.
The Protein Data Bank, accumulated over five decades of experimental work, contained roughly 170,000 structures. AlphaFold had increased the world's structural knowledge by more than 1,000-fold in under a year.
Every entry in the AlphaFold Database includes: the predicted 3D coordinates in standard PDB format; pLDDT confidence scores colour-coded on the structure (blue = high confidence, orange = low); PAE (Predicted Aligned Error) maps showing confidence in the relative positions of domain pairs; and a downloadable structure file compatible with standard molecular visualisation tools like PyMOL and ChimeraX.
Researchers access structures via a simple web interface at alphafold.ebi.ac.uk, or programmatically via an API. A UniProt accession number is all that is required.
Malaria vaccine target (2022). Researchers at the Wellcome Sanger Institute used AlphaFold structures of Plasmodium falciparum proteins — the malaria parasite — to identify RH5 interacting partners, accelerating the rational design of blood-stage vaccine candidates. The AlphaFold models provided atomic-level detail for proteins that had resisted crystallisation for years.
Antibiotic resistance (2022). A team at the University of California, San Francisco used AlphaFold structures of bacterial efflux pump proteins to identify new inhibitor binding pockets. Efflux pumps are a key mechanism by which bacteria expel antibiotics; structurally guided drug design against them had been limited by incomplete structural data. AlphaFold predictions filled the gaps.
Evolutionary dark matter (2022–2023). By comparing AlphaFold structures of proteins with no detectable sequence similarity, researchers discovered structural homologues — proteins with similar folds that had diverged so far in sequence that sequence-based methods could not detect the relationship. A landmark 2023 paper in Science identified new protein superfamilies by clustering AlphaFold structures, revealing previously hidden evolutionary connections across the tree of life.
Intrinsically disordered proteins. Low-pLDDT regions in AlphaFold predictions are not failures — they correctly flag proteins or protein segments that are genuinely disordered under physiological conditions. This realisation has prompted a reclassification of thousands of proteins previously assumed to have fixed folds.
The AlphaFold Database is free to access at alphafold.ebi.ac.uk. All structures are released under Creative Commons CC BY 4.0 — anyone, anywhere, may use them without restriction.
AlphaFold predicts single-chain structures in isolation. It does not predict: how proteins interact with each other (protein-protein complexes), how proteins bind small molecules (ligands) or DNA, conformational changes upon binding, or the effects of post-translational modifications such as phosphorylation and glycosylation. A predicted structure represents one conformation — typically the apo (ligand-free) ground state — not the full dynamic behaviour of the protein.
For drug discovery, the binding pocket geometry matters enormously: a structure solved in complex with a ligand may differ significantly from the apo AlphaFold model. Experimental structures remain essential for many applications, and AlphaFold is best understood as a powerful complement to rather than replacement for crystallography and cryo-EM.
By 2023 the AlphaFold Database contained predicted structures for 214 million proteins from 48 organisms. The Protein Data Bank's entire 50-year experimental archive covers ~220,000 structures — a number AlphaFold exceeded for a single bacterial species in hours of compute time.
You are a structural biologist given access to the AlphaFold Database for the first time. Use the AI assistant to figure out how you would actually use AlphaFold predictions in a real research project — including where the limitations would bite you.
The year after AlphaFold's triumph, David Baker's lab at the University of Washington published RoseTTAFold — an independently developed system achieving comparable accuracy using a three-track architecture that simultaneously processes sequence, distance, and coordinate information. Baker, who in 2024 would share the Nobel Prize in Chemistry with Demis Hassabis and John Jumper, then turned the same AI tools toward the inverse problem: not predicting how a sequence folds, but designing sequences that fold into desired shapes — proteins that do not exist in nature.
DeepMind's own extension, published in late 2021, addressed AlphaFold's most significant limitation: the inability to model protein complexes. AlphaFold-Multimer takes multiple protein chains as input and predicts their assembled quaternary structure. Early benchmarks on protein-protein complexes showed performance significantly better than existing methods, though generally somewhat below single-chain accuracy.
AlphaFold-Multimer has been used to model: antibody-antigen complexes relevant to vaccine design; ribosomal sub-complexes to understand translation; viral spike protein interactions with host cell receptors; and large multi-subunit enzyme assemblies. The 2022 paper by Evans et al. reported a median interface RMSD of 4.0 ångströms on benchmark targets — imperfect but transformative compared to the near-absence of reliable computational tools before.
RoseTTAFold, published in Science in August 2021, demonstrated that the core ideas were reproducible independently and introduced the three-track architecture. Baker's lab released the code publicly on GitHub, and it became the basis for several subsequent tools. RoseTTAFold All-Atom (2024) extended the approach to predict not only protein structures but also the positions of bound small molecules and nucleic acids — addressing AlphaFold's blind spot for ligand interactions.
The inverse folding problem — design a sequence that adopts a target shape — is arguably more commercially important than prediction. Baker's lab has demonstrated several milestones:
Hallucination and inpainting (2021–2022). Using RoseTTAFold as a differentiable "oracle," researchers generated entirely novel protein sequences by gradient descent — iteratively adjusting sequence tokens until the predicted structure matched a target geometry. Related inpainting approaches fill in unknown protein segments while fixing known scaffolds.
ProteinMPNN (2022). A message-passing neural network trained to design sequences for given backbone structures. ProteinMPNN dramatically outperformed previous computational design methods: experimentally tested designs folded correctly at rates above 50%, compared to single-digit percentages for earlier methods.
RFdiffusion (2023). A diffusion model (related to the technology behind image generators like DALL-E) adapted to generate protein backbones de novo — creating entirely new protein architectures not found in nature. Designed binders produced by RFdiffusion showed nanomolar affinity for target proteins in laboratory tests, approaching pharmaceutical-grade performance.
The 2024 Nobel Prize in Chemistry was awarded to David Baker (protein design), Demis Hassabis, and John Jumper (AlphaFold). The Royal Swedish Academy cited AlphaFold as having "solved a 50-year-old problem" and Baker's work as having "succeeded in the almost impossible task of building protein structures from scratch."
AlphaFold 3, published in Nature in May 2024, extended the system to predict structures of protein–DNA, protein–RNA, protein–ligand, and protein–ion complexes. It uses a diffusion-based architecture (replacing the Structure Module) and achieves state-of-the-art accuracy on protein-ligand docking benchmarks — directly addressing the drug-discovery gap. Code access was initially restricted to an API, sparking debate in the scientific community about open science.
ESMFold, released by Meta AI in 2022, uses a single large language model (650 million parameters) to predict structures directly from sequence without requiring an MSA — making it dramatically faster (milliseconds per prediction) at some accuracy cost. ESMFold was used to predict structures for all 600 million proteins in the MGnify metagenomic database, expanding the structural atlas of microbial life.
The protein-structure revolution is one chapter in a larger story: AI systems trained on the data produced by 50 years of experimental science are now generating new scientific knowledge faster than any human team could. The same pattern — large datasets, self-supervised learning, transformer architectures — is appearing in genomics, drug discovery, materials science, climate modelling, and particle physics. AlphaFold was the proof of concept that this paradigm works at the highest level of scientific difficulty.
AlphaFold solved folding. Now you're a researcher or entrepreneur thinking about what comes next. Use the AI assistant to explore protein design, the limitations of current tools, and where the scientific and commercial opportunities lie.