Lesson 1 · AI in Science — Module 2

The Protein Folding Problem

Fifty years of biology's hardest puzzle — and the AI that solved it in eighteen months.

Why does the shape of a protein determine everything it does, and why was predicting that shape so hard?

On 30 November 2020, the organisers of CASP14 — the biennial protein-structure prediction competition — announced results that silenced the auditorium. A team from Google DeepMind had submitted predictions with a median accuracy score of 92.4 GDT, a metric out of 100. The previous best, achieved two years earlier, was around 40. One veteran structural biologist told Science magazine: "I was speechless. It's not just a small improvement; it has essentially solved the problem."

The problem in question had stood for more than fifty years. Anfinsen's dogma, confirmed by the 1972 Nobel Prize in Chemistry, held that a protein's amino-acid sequence completely determines its three-dimensional shape. But no one could reliably compute that shape from the sequence alone.

What Is a Protein?

Proteins are the molecular machines of life. They act as enzymes that catalyse chemical reactions, as structural scaffolds (collagen, keratin), as signalling molecules (insulin), as immune defenders (antibodies), and as motors that contract muscle. The human genome encodes roughly 20,000 different proteins.

Every protein is a chain of amino acids — small organic molecules linked end-to-end like beads on a string. There are 20 standard amino acids. A typical protein chain is 300–500 amino acids long; the largest known human protein, titin, spans about 34,000. The linear sequence is called the primary structure.

But a flat chain is not functional. Within microseconds of being synthesised by a ribosome, a protein collapses into a precise, compact three-dimensional shape. That shape — called the native conformation or tertiary structure — is what allows the protein to do its job. A misfolded protein is usually useless at best and toxic at worst: misfolding underlies Alzheimer's disease, Parkinson's disease, cystic fibrosis, and hundreds of other conditions.

Primary structureThe linear sequence of amino acids encoded in DNA.

Secondary structureLocal folded motifs — alpha-helices and beta-sheets — stabilised by hydrogen bonds.

Tertiary structureThe full 3D fold of a single protein chain.

Quaternary structureThe assembly of multiple protein chains into a functional complex (e.g., haemoglobin's four subunits).

Why Was Prediction So Hard?

The combinatorial explosion is staggering. For a chain of just 100 amino acids, if each bond has even two possible orientations, the number of possible conformations exceeds 10³⁰. Cyrus Levinthal noted in 1969 that if a protein randomly sampled conformations at the rate of one per nanosecond, it would take longer than the age of the universe to find the right one — yet real proteins fold in milliseconds. This became known as Levinthal's paradox.

The experimental method for determining protein structures — X-ray crystallography — is accurate but slow, expensive, and requires the protein to form a crystal, which many proteins refuse to do. As of 2020, roughly 170,000 structures were in the Protein Data Bank (PDB), covering perhaps 0.1% of known protein sequences.

Computational physics approaches attempted to simulate folding from first principles, but the energy landscapes were too complex and the required computing power astronomical. By the 2010s, machine-learning methods began mining evolutionary data: proteins that share a function tend to share certain amino-acid pairs that co-evolve because mutations in one position must be compensated by mutations elsewhere to preserve the fold. This co-evolutionary signal, extracted from millions of known sequences, provided geometric constraints that could triangulate the shape.

Key Insight

The evolutionary record of billions of sequences is itself a compressed archive of structural information. AI methods that read this archive can infer physical geometry without simulating every atom explicitly.

The CASP Competition

Since 1994, the Critical Assessment of protein Structure Prediction (CASP) has run every two years. Experimentalists solve a set of protein structures using crystallography or cryo-electron microscopy, then share only the sequences — not the answers — with competing prediction groups. Groups submit models, and the answers are revealed at a conference. The GDT (Global Distance Test) score measures what fraction of predicted atom positions fall within a few ångströms of the experimentally determined positions.

For 26 years, progress was real but slow. CASP13 in 2018 saw DeepMind's first AlphaFold system (now called AlphaFold 1) win convincingly but still with major errors. CASP14 in 2020 was different. AlphaFold 2's results were, as Nature put it in its December 2021 paper, "a solution to the 50-year-old grand challenge of protein structure prediction."

Historical Record

CASP14 target T1049 was a bacterial protein with no close relatives in the PDB. AlphaFold 2 produced a model with a Cα RMSD (root mean square deviation) of 0.96 ångströms from the crystal structure — roughly the width of a single hydrogen atom off.

Lesson 1 Quiz

The Protein Folding Problem — check your understanding

1. What does Anfinsen's dogma state about proteins?

Correct. Anfinsen's dogma, confirmed by his 1972 Nobel Prize, holds that the primary sequence encodes all the information needed to reach the native fold.

Not quite. Anfinsen showed that the amino-acid sequence alone determines the native three-dimensional fold — the protein carries all the folding information in its sequence.

2. What is Levinthal's paradox?

Correct. Levinthal (1969) noted this paradox: the conformational space is astronomically large, yet folding is extraordinarily fast, implying proteins follow guided pathways rather than random searches.

Levinthal's paradox refers to the contradiction between the enormous number of possible conformations and the millisecond timescale of actual folding.

3. What score did AlphaFold 2 achieve at CASP14, and what was the previous best?

Correct. The jump from ~40 to 92.4 GDT was described by structural biologists as effectively solving the problem — not an incremental improvement but a qualitative leap.

AlphaFold 2 scored 92.4 GDT at CASP14; the previous record was roughly 40 — a gap that was described as paradigm-shifting.

4. Which level of protein structure describes the precise 3D fold of a single chain?

Correct. Tertiary structure is the full three-dimensional conformation of a single polypeptide chain, including how all its secondary-structure elements pack together.

Tertiary structure describes the full 3D fold of a single chain. Primary is the sequence; secondary is local motifs; quaternary is multi-chain assemblies.

Lab 1 · The Folding Problem

Discuss the protein-folding challenge with your AI research assistant

Your Task

You are a junior researcher encountering protein structure for the first time. Use the AI assistant below to deepen your understanding of the folding problem, Levinthal's paradox, and why the CASP competition mattered.

Suggested opening: "Why can't we just simulate protein folding using physics equations? What makes it computationally impossible?" — then follow the conversation wherever your curiosity leads.

AI Research Assistant

Protein Folding

Hello! I'm your AI research assistant for this module. I'm here to help you explore the protein-folding problem — one of biology's most famous grand challenges. What would you like to understand first?

Lesson 2 · AI in Science — Module 2

How AlphaFold 2 Works

Attention mechanisms, evolutionary databases, and the neural architecture that changed structural biology.

What specific AI techniques did DeepMind use — and why did they succeed where decades of physics and prior machine learning had not?

After their CASP13 win in 2018, the AlphaFold team — led by John Jumper and senior researcher Demis Hassabis — tore down their model and rebuilt it from scratch. The new architecture, AlphaFold 2, would be published in Nature in July 2021. It combined three innovations that had never before been assembled together for this problem: multiple sequence alignment to extract co-evolutionary signals, a transformer-based architecture called the Evoformer to reason about those signals jointly, and an equivariant structure module that predicted atom coordinates respecting the symmetries of 3D space.

Step 1 — Mining Evolutionary History

AlphaFold 2 begins not with physics but with a database query. Given a target sequence, it searches three enormous sequence databases — UniRef90, BFD (Big Fantastic Database), and MGnify — to construct a Multiple Sequence Alignment (MSA): hundreds or thousands of related sequences from organisms across the tree of life.

Why? Because evolution is an experiment that ran for billions of years. If two positions in a protein must be close in 3D space, mutations at one position are often accompanied by compensating mutations at the other — to preserve the interface. Statistical analysis of which positions co-vary across species therefore encodes geometric constraints about the fold. AlphaFold 2 also searches the PDB for structural templates — related proteins already solved experimentally — to supplement this signal.

MSAMultiple Sequence Alignment — a matrix of related sequences aligned by position, used to identify co-evolving residue pairs.

Co-evolutionThe correlated change of amino-acid pairs across species, used to infer spatial proximity in the folded structure.

Step 2 — The Evoformer

The core of AlphaFold 2 is the Evoformer, a stack of 48 transformer blocks that processes two representations simultaneously: a sequence representation (a matrix of MSA rows × amino acid positions) and a pair representation (a matrix of position × position relationships). These two representations update each other iteratively.

The key mechanism is attention — the same family of operations that powers large language models. In AlphaFold, attention lets each residue "look at" every other residue and at patterns across the MSA, weighting relevance dynamically. After 48 Evoformer iterations, the pair representation encodes rich information about which residue pairs are likely to be close in space.

Crucially, the model also recycles its own output: the predicted structure from one pass is fed back as input to the next, allowing the network to iteratively refine its prediction — much as a sculptor returns to a clay model to correct and smooth.

Architecture Detail

AlphaFold 2 has approximately 93 million trainable parameters. It was trained on ~170,000 known structures from the PDB, plus ~350,000 self-distillation predictions used as additional training signal.

Step 3 — The Structure Module

The Evoformer outputs are fed to the Structure Module, which directly predicts the 3D coordinates of every backbone and side-chain atom. A critical design choice: the module uses equivariant operations — mathematical transformations that respect the symmetries of 3D space. If you rotate the entire protein, the predicted structure rotates with it rather than breaking. This means the network learns about geometry, not arbitrary coordinate systems.

The Structure Module represents each residue as a rigid-body "frame" — a local coordinate system — and updates these frames using Invariant Point Attention (IPA), a form of attention that operates in 3D space. The final output is a full-atom coordinate set plus confidence scores called pLDDT (predicted Local Distance Difference Test), which tell researchers how confident the model is at each residue position.

pLDDTPredicted Local Distance Difference Test — a per-residue confidence score from 0–100. Scores above 90 indicate high confidence; below 50 suggest the region may be intrinsically disordered.

EquivarianceA property of a neural network operation such that rotating the input produces a correspondingly rotated output — essential for correctly predicting 3D structures.

Training and Validation

AlphaFold 2 was trained on a cluster of 128 TPUv3 cores for approximately 11 days. The loss function combined FAPE (Frame Aligned Point Error — a measure of 3D coordinate accuracy), side-chain torsion angle errors, distogram errors, and auxiliary losses. The CASP14 evaluation served as a prospective validation on unseen targets — the most rigorous possible test, because the experimental structures had not been released when predictions were made.

Published Science

Jumper et al., "Highly accurate protein structure prediction with AlphaFold," Nature 596, 583–589 (2021). As of 2024 it has been cited over 25,000 times — among the most impactful papers of the decade.

Lesson 2 Quiz

How AlphaFold 2 Works — check your understanding

1. What is the Evoformer in AlphaFold 2?

Correct. The Evoformer is 48 transformer blocks that simultaneously update a sequence representation (MSA rows × positions) and a pair representation (position × position), allowing the model to reason about co-evolutionary geometry.

The Evoformer is a transformer-based neural network block — not a physics simulator or database. It processes the MSA and pair representations iteratively.

2. Why does AlphaFold 2 begin by searching sequence databases?

Correct. Co-evolving residue pairs in the MSA reveal which positions must be near each other in 3D space — evolutionary history encodes geometric information that the model extracts statistically.

The MSA search provides co-evolutionary signals. Positions that co-vary across thousands of species are likely to be in contact in the folded structure.

3. What does pLDDT measure in an AlphaFold prediction?

Correct. pLDDT (predicted Local Distance Difference Test) is a per-residue confidence score. Above 90 is high confidence; below 50 often signals intrinsically disordered regions rather than prediction failure.

pLDDT is a per-residue confidence score (0–100). It tells researchers which parts of the predicted structure to trust and which may reflect genuine disorder.

4. What does "equivariance" mean in the context of AlphaFold's Structure Module?

Correct. Equivariance ensures the predicted structure is consistent under rotations and reflections — an essential geometric property for predicting physical 3D structures.

Equivariance means the model's outputs transform consistently with rotations of the inputs — critical for a model predicting physical 3D structures that have no preferred orientation.

Lab 2 · Inside AlphaFold

Interrogate the architecture of AlphaFold 2 with your AI assistant

Your Task

You've just read about the Evoformer and Structure Module. Now dig deeper with the AI assistant. The goal is to understand why each design choice was made — not just what it is.

Suggested opening: "How is the attention mechanism in the Evoformer different from attention in a language model like GPT? What does it 'attend' to?" — then challenge the assistant with follow-ups.

AI Research Assistant

AlphaFold Architecture

Ready to go deeper into AlphaFold's architecture. The Evoformer is one of the most creative neural-network designs of the 2020s — let's unpack why it works. What would you like to explore?

Lesson 3 · AI in Science — Module 2

The AlphaFold Protein Structure Database

From 170,000 known structures to over 200 million predicted — and what science has done with them.

When AlphaFold's predictions were made freely available to every researcher on Earth, what actually changed in biology?

On 22 July 2021, the same week the AlphaFold 2 paper appeared in Nature, DeepMind and the European Bioinformatics Institute (EMBL-EBI) jointly released the AlphaFold Protein Structure Database — with predicted structures for the entire human proteome (about 20,000 proteins) and the proteomes of 20 other important organisms. Six months later, in January 2022, the database expanded to cover 48 organisms. By July 2022, it had exploded to over 200 million predicted structures — essentially every protein in UniProt, the universal protein sequence database.

The Protein Data Bank, accumulated over five decades of experimental work, contained roughly 170,000 structures. AlphaFold had increased the world's structural knowledge by more than 1,000-fold in under a year.

What the Database Contains

Every entry in the AlphaFold Database includes: the predicted 3D coordinates in standard PDB format; pLDDT confidence scores colour-coded on the structure (blue = high confidence, orange = low); PAE (Predicted Aligned Error) maps showing confidence in the relative positions of domain pairs; and a downloadable structure file compatible with standard molecular visualisation tools like PyMOL and ChimeraX.

Researchers access structures via a simple web interface at alphafold.ebi.ac.uk, or programmatically via an API. A UniProt accession number is all that is required.

PAEPredicted Aligned Error — a matrix showing AlphaFold's confidence in the relative position of every pair of residues, useful for identifying domain boundaries and flexible linkers.

Real Scientific Discoveries

Malaria vaccine target (2022). Researchers at the Wellcome Sanger Institute used AlphaFold structures of Plasmodium falciparum proteins — the malaria parasite — to identify RH5 interacting partners, accelerating the rational design of blood-stage vaccine candidates. The AlphaFold models provided atomic-level detail for proteins that had resisted crystallisation for years.

Antibiotic resistance (2022). A team at the University of California, San Francisco used AlphaFold structures of bacterial efflux pump proteins to identify new inhibitor binding pockets. Efflux pumps are a key mechanism by which bacteria expel antibiotics; structurally guided drug design against them had been limited by incomplete structural data. AlphaFold predictions filled the gaps.

Evolutionary dark matter (2022–2023). By comparing AlphaFold structures of proteins with no detectable sequence similarity, researchers discovered structural homologues — proteins with similar folds that had diverged so far in sequence that sequence-based methods could not detect the relationship. A landmark 2023 paper in Science identified new protein superfamilies by clustering AlphaFold structures, revealing previously hidden evolutionary connections across the tree of life.

Intrinsically disordered proteins. Low-pLDDT regions in AlphaFold predictions are not failures — they correctly flag proteins or protein segments that are genuinely disordered under physiological conditions. This realisation has prompted a reclassification of thousands of proteins previously assumed to have fixed folds.

Access Note

The AlphaFold Database is free to access at alphafold.ebi.ac.uk. All structures are released under Creative Commons CC BY 4.0 — anyone, anywhere, may use them without restriction.

Limitations of the Database

AlphaFold predicts single-chain structures in isolation. It does not predict: how proteins interact with each other (protein-protein complexes), how proteins bind small molecules (ligands) or DNA, conformational changes upon binding, or the effects of post-translational modifications such as phosphorylation and glycosylation. A predicted structure represents one conformation — typically the apo (ligand-free) ground state — not the full dynamic behaviour of the protein.

For drug discovery, the binding pocket geometry matters enormously: a structure solved in complex with a ligand may differ significantly from the apo AlphaFold model. Experimental structures remain essential for many applications, and AlphaFold is best understood as a powerful complement to rather than replacement for crystallography and cryo-EM.

Scale in Numbers

By 2023 the AlphaFold Database contained predicted structures for 214 million proteins from 48 organisms. The Protein Data Bank's entire 50-year experimental archive covers ~220,000 structures — a number AlphaFold exceeded for a single bacterial species in hours of compute time.

Lesson 3 Quiz

The AlphaFold Database — check your understanding

1. Approximately how many structures did the AlphaFold Database contain by July 2022?

Correct. By mid-2022 the database had expanded to cover essentially all of UniProt — over 200 million predicted structures, compared to the ~170,000 experimentally determined structures in the PDB after 50 years of work.

By July 2022 the AlphaFold Database contained over 200 million structures — effectively the entire UniProt sequence database, a roughly 1,000-fold increase over the experimental PDB.

2. What does a low pLDDT score (below 50) most likely indicate?

Correct. Low pLDDT regions often correctly flag intrinsically disordered protein segments — regions without a fixed 3D structure under physiological conditions. This is not a failure but biologically meaningful information.

Low pLDDT typically indicates that the protein region is genuinely disordered — it has no fixed fold in solution — rather than indicating a prediction error by AlphaFold.

3. Which of the following is a documented limitation of AlphaFold 2 predictions?

Correct. AlphaFold 2 predicts single-chain apo structures. It does not model interactions with binding partners, small molecules, DNA, or the conformational changes that occur upon binding — all critical for drug discovery.

AlphaFold predicts single-chain structures in isolation. It does not handle protein complexes, ligand binding, post-translational modifications, or the dynamics of conformational change.

4. What is PAE (Predicted Aligned Error) used for?

Correct. The PAE matrix shows AlphaFold's confidence in the relative orientation of every pair of residues. Low PAE between two domains means their relative position is well determined; high PAE signals a flexible or uncertain inter-domain arrangement.

PAE is a pairwise confidence measure — a matrix showing how confident AlphaFold is about the relative position of each pair of residues. It is especially useful for multi-domain proteins.

Lab 3 · Using the AlphaFold Database

Explore what a researcher actually does with 200 million predicted structures

Your Task

You are a structural biologist given access to the AlphaFold Database for the first time. Use the AI assistant to figure out how you would actually use AlphaFold predictions in a real research project — including where the limitations would bite you.

Suggested opening: "I'm trying to design a drug that inhibits a bacterial enzyme. The crystal structure hasn't been solved. Walk me through how I'd use AlphaFold predictions and what I'd need to watch out for." — then follow the conversation to understand real-world use.

AI Research Assistant

AlphaFold Database

Welcome! The AlphaFold Database is now one of the first places structural biologists go when starting a new project. I'm here to help you think through how to use it wisely — including its real limitations. What's your research question?

Lesson 4 · AI in Science — Module 2

Beyond AlphaFold — What Comes Next

AlphaFold-Multimer, RoseTTAFold, protein design, and the broader AI revolution in structural biology.

AlphaFold solved the folding problem — so what are the next grand challenges, and how is AI already attacking them?

The year after AlphaFold's triumph, David Baker's lab at the University of Washington published RoseTTAFold — an independently developed system achieving comparable accuracy using a three-track architecture that simultaneously processes sequence, distance, and coordinate information. Baker, who in 2024 would share the Nobel Prize in Chemistry with Demis Hassabis and John Jumper, then turned the same AI tools toward the inverse problem: not predicting how a sequence folds, but designing sequences that fold into desired shapes — proteins that do not exist in nature.

AlphaFold-Multimer

DeepMind's own extension, published in late 2021, addressed AlphaFold's most significant limitation: the inability to model protein complexes. AlphaFold-Multimer takes multiple protein chains as input and predicts their assembled quaternary structure. Early benchmarks on protein-protein complexes showed performance significantly better than existing methods, though generally somewhat below single-chain accuracy.

AlphaFold-Multimer has been used to model: antibody-antigen complexes relevant to vaccine design; ribosomal sub-complexes to understand translation; viral spike protein interactions with host cell receptors; and large multi-subunit enzyme assemblies. The 2022 paper by Evans et al. reported a median interface RMSD of 4.0 ångströms on benchmark targets — imperfect but transformative compared to the near-absence of reliable computational tools before.

RoseTTAFold and Open Competition

RoseTTAFold, published in Science in August 2021, demonstrated that the core ideas were reproducible independently and introduced the three-track architecture. Baker's lab released the code publicly on GitHub, and it became the basis for several subsequent tools. RoseTTAFold All-Atom (2024) extended the approach to predict not only protein structures but also the positions of bound small molecules and nucleic acids — addressing AlphaFold's blind spot for ligand interactions.

RoseTTAFoldAn independent protein-structure prediction system from the Baker lab at UW, published August 2021, using a three-track architecture processing sequence, pairwise distances, and 3D coordinates simultaneously.

AI-Driven Protein Design

The inverse folding problem — design a sequence that adopts a target shape — is arguably more commercially important than prediction. Baker's lab has demonstrated several milestones:

Hallucination and inpainting (2021–2022). Using RoseTTAFold as a differentiable "oracle," researchers generated entirely novel protein sequences by gradient descent — iteratively adjusting sequence tokens until the predicted structure matched a target geometry. Related inpainting approaches fill in unknown protein segments while fixing known scaffolds.

ProteinMPNN (2022). A message-passing neural network trained to design sequences for given backbone structures. ProteinMPNN dramatically outperformed previous computational design methods: experimentally tested designs folded correctly at rates above 50%, compared to single-digit percentages for earlier methods.

RFdiffusion (2023). A diffusion model (related to the technology behind image generators like DALL-E) adapted to generate protein backbones de novo — creating entirely new protein architectures not found in nature. Designed binders produced by RFdiffusion showed nanomolar affinity for target proteins in laboratory tests, approaching pharmaceutical-grade performance.

Nobel Prize · 2024

The 2024 Nobel Prize in Chemistry was awarded to David Baker (protein design), Demis Hassabis, and John Jumper (AlphaFold). The Royal Swedish Academy cited AlphaFold as having "solved a 50-year-old problem" and Baker's work as having "succeeded in the almost impossible task of building protein structures from scratch."

AlphaFold 3 and ESMFold

AlphaFold 3, published in Nature in May 2024, extended the system to predict structures of protein–DNA, protein–RNA, protein–ligand, and protein–ion complexes. It uses a diffusion-based architecture (replacing the Structure Module) and achieves state-of-the-art accuracy on protein-ligand docking benchmarks — directly addressing the drug-discovery gap. Code access was initially restricted to an API, sparking debate in the scientific community about open science.

ESMFold, released by Meta AI in 2022, uses a single large language model (650 million parameters) to predict structures directly from sequence without requiring an MSA — making it dramatically faster (milliseconds per prediction) at some accuracy cost. ESMFold was used to predict structures for all 600 million proteins in the MGnify metagenomic database, expanding the structural atlas of microbial life.

The Broader Picture

The protein-structure revolution is one chapter in a larger story: AI systems trained on the data produced by 50 years of experimental science are now generating new scientific knowledge faster than any human team could. The same pattern — large datasets, self-supervised learning, transformer architectures — is appearing in genomics, drug discovery, materials science, climate modelling, and particle physics. AlphaFold was the proof of concept that this paradigm works at the highest level of scientific difficulty.

Lesson 4 Quiz

Beyond AlphaFold — check your understanding

1. What does AlphaFold-Multimer predict that AlphaFold 2 cannot?

Correct. AlphaFold-Multimer extends the system to model protein complexes — taking multiple chain sequences as input and predicting their quaternary (assembled) structure, critical for understanding molecular machines and designing drugs that disrupt protein-protein interactions.

AlphaFold-Multimer's key addition is the ability to model multi-chain protein complexes — predicting how two or more protein chains assemble together in 3D space.

2. What is the "inverse folding problem" in protein science?

Correct. Inverse folding is protein design: given a target 3D shape, find or generate an amino-acid sequence that will spontaneously adopt it. Tools like ProteinMPNN and RFdiffusion have made this commercially viable for the first time.

Inverse folding means going backwards: given a desired 3D structure, design a sequence that will fold into it. This is the foundation of modern computational protein design.

3. Who shared the 2024 Nobel Prize in Chemistry for protein-structure work, and for what?

Correct. The 2024 Nobel Prize in Chemistry recognised David Baker for computational protein design and Demis Hassabis and John Jumper for AlphaFold — three researchers whose AI tools have transformed structural biology.

The 2024 Nobel Prize in Chemistry went to David Baker (protein design at UW), Demis Hassabis (DeepMind), and John Jumper (DeepMind/Google) for AlphaFold.

4. What made ESMFold useful for large-scale metagenomic analysis, despite being less accurate than AlphaFold 2?

Correct. ESMFold's single-model architecture skips the MSA search entirely, enabling prediction at millisecond speed. This allowed Meta AI to structure-annotate 600 million metagenomic sequences — a scale that would have been computationally prohibitive for AlphaFold 2.

ESMFold is faster because it bypasses the MSA database search, predicting from a single language model pass. This speed enabled structural annotation of hundreds of millions of metagenomic sequences in reasonable time.

Lab 4 · The Future of Protein AI

Think through the next challenges in AI-driven structural biology

Your Task

AlphaFold solved folding. Now you're a researcher or entrepreneur thinking about what comes next. Use the AI assistant to explore protein design, the limitations of current tools, and where the scientific and commercial opportunities lie.

Suggested opening: "If I wanted to design a brand-new enzyme that doesn't exist in nature — to break down plastic waste, for example — what's the current state of the AI tools that could help me do that?" — then explore the landscape of what's possible now versus what remains hard.

AI Research Assistant

Protein Design & Future

Protein design is one of the most exciting frontiers in all of science right now. Tools like RFdiffusion and ProteinMPNN have opened doors that were firmly shut just three years ago. What challenge or application would you like to explore?

Module 2 Test

AlphaFold and Protein Structure — 15 questions · 80% to pass

1. Who confirmed Anfinsen's dogma experimentally, earning the 1972 Nobel Prize in Chemistry?

Correct. Anfinsen's own experiments showed that denatured ribonuclease A spontaneously refolds to its native, active conformation — proving sequence determines structure.

Christian Anfinsen himself won the 1972 Nobel in Chemistry for showing ribonuclease A refolds spontaneously, demonstrating that the sequence encodes the fold.

2. How many amino acids are in the standard genetic code?

Correct. The standard genetic code encodes 20 amino acids (plus stop signals). Their chemical diversity — ranging from tiny glycine to bulky tryptophan, from charged arginine to hydrophobic leucine — is the raw material of all protein function.

There are 20 standard amino acids. The 64 codons in the genetic code map to these 20 amino acids (plus stop codons) with redundancy.

3. What does a Multiple Sequence Alignment (MSA) provide to AlphaFold?

Correct. The MSA encodes billions of years of evolutionary experiments. Correlated mutations across species reveal which residue pairs are spatially close — providing geometric constraints that AlphaFold learns to decode.

The MSA provides co-evolutionary information: pairs of positions that tend to mutate together are likely in contact in the folded structure, giving AlphaFold geometric constraints without explicit physics simulation.

4. How many Evoformer blocks does AlphaFold 2 contain?

Correct. The Evoformer stack has 48 transformer blocks, each of which updates both the MSA representation and the pair representation in turn, progressively refining the geometric understanding of the sequence.

AlphaFold 2 uses 48 Evoformer blocks. Each block performs row-wise and column-wise attention on the MSA, plus triangle operations on the pair representation.

5. What competition did AlphaFold 2 dominate in November 2020?

Correct. CASP14 (November 2020) was the biennial protein-structure prediction competition where AlphaFold 2 scored 92.4 GDT, roughly doubling the previous best and prompting structural biologists to describe the 50-year-old problem as solved.

CASP14 — the 14th Critical Assessment of protein Structure Prediction competition — was where AlphaFold 2 achieved its landmark 92.4 GDT score in November 2020.

6. What is the GDT score, and what does 92.4 represent?

Correct. GDT (Global Distance Test) scores the fraction of Cα atoms predicted within distance cutoffs of the experimental structure, on a 0–100 scale. A score of 92.4 indicates the predicted and experimental structures are nearly superimposable.

GDT is the Global Distance Test — it measures what fraction of predicted atom positions fall within a set distance of the real structure. 92.4/100 is near-experimental accuracy.

7. What innovation did RoseTTAFold introduce compared to AlphaFold 2?

Correct. RoseTTAFold used a three-track architecture where information flows bidirectionally between sequence, distance, and coordinate representations — an independent approach that achieved comparable accuracy and was released as open-source code.

RoseTTAFold's distinctive contribution was its three-track architecture — sequence, pairwise distance, and 3D coordinate tracks that mutually inform each other throughout the network.

8. The AlphaFold Database was released jointly by DeepMind and which institution?

Correct. DeepMind partnered with EMBL-EBI to host and distribute the AlphaFold Database, making all structures freely available under CC BY 4.0 via alphafold.ebi.ac.uk.

The AlphaFold Database was jointly released by DeepMind and EMBL-EBI (the European Bioinformatics Institute in Cambridge), providing free access to all predictions.

9. What is ProteinMPNN, and what improvement did it offer over previous design methods?

Correct. ProteinMPNN (2022, Baker lab) designs amino-acid sequences for a given backbone geometry. Experimental validation showed designed proteins folding correctly at >50% success rates, a dramatic improvement over Rosetta-based design methods.

ProteinMPNN is an inverse folding model — given a backbone, it designs a sequence. It raised experimental success rates from single digits to over 50%, a step change in practical protein design.

10. AlphaFold 3 (2024) extended structural prediction to which new molecular entities?

Correct. AlphaFold 3 (Nature, May 2024) extended the system beyond proteins to predict complexes with DNA, RNA, small-molecule ligands, and ions — directly addressing the drug-discovery gap that limited earlier versions.

AlphaFold 3 added prediction of protein–DNA, protein–RNA, protein–small molecule, and protein–ion complexes, using a diffusion-based structure module rather than the IPA-based approach of AlphaFold 2.

11. What aspect of protein biology does AlphaFold 2 fundamentally NOT capture?

Correct. AlphaFold 2 predicts a single static conformation. Proteins in solution are dynamic — they breathe, flex, and adopt multiple conformations. For many drug targets, the relevant state is an induced-fit conformation only populated upon ligand binding, which AlphaFold does not capture.

AlphaFold 2 predicts a single ground-state conformation. Protein dynamics — the fluctuations, breathing motions, and ligand-induced conformational changes — are not modelled and remain a key limitation for drug discovery.

12. What database contains the ~220,000 experimentally determined protein structures accumulated over 50 years?

Correct. The Protein Data Bank (PDB), established in 1971, is the repository for experimentally determined macromolecular structures. AlphaFold was trained on its ~170,000 structures (as of 2020) and has since produced predicted structures that dwarf that number.

The Protein Data Bank (PDB) is the global archive of experimentally determined protein structures — the ~170,000 entries it held in 2020 formed the training set for AlphaFold 2.

13. ESMFold (Meta AI, 2022) foregoes the MSA step. What is the tradeoff?

Correct. ESMFold uses a large protein language model to predict structure directly without MSA construction — enabling millisecond-scale predictions. This comes at an accuracy cost, especially for "orphan" proteins with few evolutionary relatives.

ESMFold trades accuracy for speed — skipping the MSA makes it orders of magnitude faster but reduces accuracy compared to AlphaFold 2, particularly for proteins without close evolutionary relatives.

14. What did the 2023 Science study using AlphaFold structures discover about protein evolution?

Correct. By clustering AlphaFold structures rather than sequences, researchers found structural homologues invisible to BLAST and other sequence methods — revealing new superfamilies and previously hidden evolutionary relationships across the tree of life.

The 2023 Science study used structural clustering of AlphaFold predictions to find proteins related by fold but undetectable by sequence comparison — uncovering new superfamilies in the "evolutionary dark matter."

15. What type of AI architecture does RFdiffusion use, and what does it generate?

Correct. RFdiffusion (Baker lab, 2023) adapts the diffusion model paradigm — the same underlying approach as DALL-E and Stable Diffusion — to generate novel protein backbone geometries. Designed binders from RFdiffusion have shown nanomolar affinities in laboratory tests.

RFdiffusion is a diffusion model (like image-generating AI) adapted for protein backbones — it generates entirely new protein structures not found in nature, enabling the design of novel binders and enzymes.