Module 5 · Lesson 1

The Inverse Problem: Designing Matter Backward

Traditional materials science found materials first and uses later. AI reverses the logic — specifying desired properties and searching for the material that satisfies them.

How does AI transform the century-old process of materials discovery into a computable search problem?

For most of the twentieth century, discovering a new functional material followed a predictable ritual: synthesize a compound, measure its properties, and — if lucky — find something useful. The average lag from laboratory discovery to commercial deployment was 18 to 23 years. The Materials Genome Initiative, launched by the Obama White House in 2011, set an explicit goal: cut that timeline in half by deploying high-throughput computation and data-sharing infrastructure across U.S. research institutions.

The philosophical shift that AI enables is more radical than mere speed. Rather than asking "what does this compound do?", researchers now ask: "what compound will do what I need?" This inverse design paradigm treats material structure as an output variable rather than an input.

The Scale of the Search Space

The challenge is combinatorial. Consider binary and ternary inorganic compounds alone: the number of plausible stoichiometries combining elements from the periodic table runs into the tens of millions. The Materials Project database, maintained by Lawrence Berkeley National Laboratory, had catalogued over 154,000 computed inorganic compounds by 2023 — yet this represents only a fraction of what is theoretically possible. Organic chemical space is even larger: estimates for drug-like organic molecules alone exceed 10⁶⁰ distinct structures.

Classical high-throughput density functional theory (DFT) calculations can screen thousands of candidates per day on large computing clusters, but even this throughput cannot meaningfully sample spaces of this magnitude. Machine learning interatomic potentials — surrogate models trained on DFT data — can evaluate millions of structures at a fraction of the cost, making the search tractable for the first time.

Real Case — Roost and Graph Neural Networks (2020)

Rhys Goodall and Alpha Lee at Cambridge published the Roost model in 2020, demonstrating that a graph neural network trained on the composition of inorganic compounds — without explicit structural information — could predict formation energy, band gap, and bulk modulus with accuracy approaching DFT. The model processed composition strings directly, enabling screening of composition space before expensive structural relaxation was even attempted. Published in Nature Communications, the work illustrated how representation learning could compress the initial filtering stage from weeks to minutes.

From Enumeration to Generation

Early computational screening was still fundamentally enumerative — it tested candidates from a pre-specified list. The next conceptual leap was generative modeling: training neural networks to produce novel candidate structures not seen in training data. Variational autoencoders (VAEs) and generative adversarial networks (GANs) were adapted from image generation to crystal structure generation around 2018–2020. The key challenge is that molecular and crystal structures obey hard physical constraints — atoms cannot overlap, charges must balance, lattice symmetries must be respected — that images do not.

Two strategies emerged: latent space navigation, where a continuous latent space is searched by gradient methods toward desirable property predictions, and diffusion models, where structures are iteratively denoised from random configurations. The 2023 paper introducing DiffCSP (Crystal Structure Prediction via diffusion) by Jiao et al. demonstrated that diffusion-based approaches could recover experimental crystal structures from composition alone at competitive accuracy with far more expensive methods.

Key Concepts

Inverse designSpecifying target properties first and using computation to identify structures that satisfy them, reversing the traditional discovery workflow.

High-throughput DFTAutomated application of density functional theory across thousands of candidate compounds to compute formation energies and electronic properties without experiment.

ML interatomic potential (MLIP)A machine learning model trained on quantum-mechanical data that approximates the energy landscape of atomic configurations millions of times faster than DFT.

Generative materials modelA neural network architecture (VAE, GAN, or diffusion model) trained to produce novel plausible crystal or molecular structures as output.

Materials Genome InitiativeA 2011 U.S. federal program aimed at halving materials discovery timelines through open databases, high-throughput computation, and data standards.

Why It Matters

The shift from enumeration to generation is analogous to the shift from searching a library to writing a new book. It does not just accelerate discovery — it opens regions of chemical space that no human researcher would have thought to explore, because they were never synthesized and therefore never catalogued.

Lesson 1 Quiz

Inverse Design Fundamentals

Three questions — click an answer to reveal feedback.

What is the core philosophical difference between traditional materials discovery and AI-enabled inverse design?

Correct. Inverse design reverses the logic: properties are the input, structure is the output. This is a genuine paradigm shift, not a speedup of the same workflow.

Not quite. The key difference is philosophical — treating structure as an output variable — not merely computational speed. Review the opening section of Lesson 1.

The Materials Genome Initiative, launched in 2011, set which explicit goal?

Correct. The MGI's explicit target was halving the 18–23 year average lag from discovery to deployment through high-throughput computation and data infrastructure.

Incorrect. The MGI is a materials science initiative, not a genomics program. Its goal was halving discovery timelines through computation and open data. Revisit the Context section.

What advantage do ML interatomic potentials (MLIPs) have over standard density functional theory calculations?

Correct. MLIPs are surrogate models trained on DFT data. They sacrifice some accuracy for enormous speed gains, making large-scale screening tractable.

Incorrect. MLIPs are faster approximations of DFT, not more accurate replacements. They are trained on quantum-mechanical data and inherit DFT's assumptions and errors. Review the screening section.

Lesson 1 · Lab

Inverse Design Consultant

AI-assisted exploration — complete 3 exchanges to finish the lab.

Your Task

You are advising a research team that needs a new transparent conducting oxide to replace indium tin oxide (ITO) in flexible displays. ITO is expensive, brittle, and uses scarce indium. Your AI assistant knows the landscape of computational materials discovery tools and approaches.

Starter question: "What properties should we specify first when setting up an inverse design search for a transparent conductor, and why does the order of constraints matter?"

Materials Discovery AI

Inverse Design

Welcome. I'm your computational materials discovery assistant for this lab. You're looking to replace ITO in flexible displays — a well-defined inverse design problem. What would you like to explore first?

Module 5 · Lesson 2

GNoME and the Structural Prediction Revolution

In 2023, Google DeepMind's Graph Networks for Materials Exploration predicted 2.2 million stable new inorganic crystals — more than the total discovered by humanity in the preceding two centuries.

What did GNoME actually predict, how was it validated, and what does it mean for experimental science?

In November 2023, Gnome Merchant Haugen et al. at Google DeepMind published "Scaling deep learning for materials discovery" in Nature. The headline number was startling: 2.2 million new stable crystal structures predicted by a graph neural network trained iteratively on expanding DFT-validated datasets. Of these, 380,000 were predicted to be thermodynamically stable — meaning they would not spontaneously decompose — a figure roughly equal to the entire corpus of experimentally known inorganic compounds accumulated over two centuries of chemistry.

Simultaneously, a team at Lawrence Berkeley National Laboratory published a complementary paper in Nature showing that 58 of GNoME's predictions had already been independently synthesized by robotic labs before the paper appeared, providing immediate experimental validation of the computational predictions at an unprecedented scale.

How GNoME Works

GNoME uses a graph neural network architecture where atoms are nodes and interatomic bonds are edges. The network was trained on approximately 89,000 DFT-computed formation energies from the Materials Project. The key innovation was an active learning loop: after initial training, the model predicted stability for a large candidate set; the most uncertain and most promising candidates were sent to DFT calculation; those results were added to the training set; and the loop repeated over hundreds of rounds. This is called an active learning or closed-loop approach.

The stability criterion used was the convex hull distance (e_hull) — the energy above the convex hull of all competing phases. A material with e_hull = 0 meV/atom lies exactly on the convex hull and is predicted stable against decomposition to any mixture of competing phases. GNoME's 380,000 stable predictions all had e_hull ≤ 0 meV/atom by GNN prediction, though DFT recalculation of a subset showed the model was calibrated accurately within ~30 meV/atom error for most compounds.

Real Case — Autonomous Lab Validation (Berkeley 2023)

The A-Lab at Lawrence Berkeley National Lab, described in a concurrent Nature paper by Szymanski et al., used a robotic synthesis platform guided by AI planning to autonomously attempt synthesis of 58 GNoME-predicted compounds over 17 days. The robot mixed precursors, ran furnace reactions, and characterised products by X-ray diffraction — all without human intervention. 41 of the 58 targets (71%) were successfully synthesised, validating GNoME's predictions and demonstrating that the prediction-to-synthesis pipeline could operate end-to-end with minimal human involvement.

What "Stable" Actually Means

A critical nuance: thermodynamic stability (convex hull) is necessary but not sufficient for a material to be experimentally accessible. A compound might be thermodynamically stable but kinetically inaccessible — requiring impractical synthesis conditions. It might be stable in vacuum but decompose in air or moisture. It might be stable at 0 K but adopt a different phase at room temperature. GNoME's predictions are energies at 0 K without pressure, temperature, or chemical environment corrections.

Researchers distinguish: thermodynamic stability (lowest energy at given composition), metastability (local energy minimum not globally lowest, but kinetically trapped — diamond being the canonical example), and synthesizability (whether accessible synthesis routes exist). AI models in 2023–2024 began incorporating synthesizability predictions as a separate output, recognising that predicting a stable structure and predicting a makeable structure are different problems.

GNoME Predictions

2.2 Million

New inorganic crystal structures predicted stable or metastable

Thermodynamically Stable

380,000

Predicted e_hull = 0 — stable against decomposition

A-Lab Synthesis Rate

71%

41 of 58 predicted compounds successfully synthesised by robot

Training Rounds

Hundreds

Active learning loops, each adding DFT-validated data

Implications for Experimental Science

GNoME does not replace experimental chemists. It produces a priority list — a ranked catalogue of computationally plausible candidates that experimentalists can choose to pursue. The practical bottleneck shifts from "what to try" to "how to try it efficiently." Robotic synthesis platforms, automated characterization, and AI-guided experimental design (covered in Lesson 3) are the complementary technologies that determine whether the predicted pipeline actually accelerates real-world materials discovery.

Critics noted that publishing 2.2 million predictions creates a literature problem: how do other researchers prioritise which predictions to test? GNoME's supplementary data was released publicly, enabling the community to filter by element availability, property predictions, and structural type — an open-access approach that itself reflects the Materials Genome Initiative's philosophy.

Scope of Change

Pre-GNoME, roughly 10,000 inorganic compounds with known crystal structures were considered well-characterised. GNoME expanded the computationally predicted stable space by a factor of ~40. Whether laboratory capacity can ever fully explore that space remains an open question — but the bottleneck has definitively shifted from prediction to synthesis.

Lesson 2 Quiz

GNoME and Structural Prediction

Three questions on the 2023 GNoME results and their interpretation.

How many inorganic crystal structures did GNoME predict to be thermodynamically stable (e_hull = 0)?

Correct. 2.2 million structures were predicted stable or metastable overall; the 380,000 figure refers specifically to those on the convex hull (thermodynamically stable against decomposition).

Incorrect. The 2.2 million figure includes all stable and metastable predictions; 380,000 is the subset predicted thermodynamically stable (e_hull = 0). Review the GNoME data table in Lesson 2.

The A-Lab at Lawrence Berkeley validated GNoME predictions using what approach?

Correct. The A-Lab used a robotic platform to autonomously attempt synthesis of 58 predicted compounds in 17 days, achieving a 71% success rate — a landmark demonstration of end-to-end AI-to-robot materials synthesis.

Incorrect. The A-Lab's validation was experimental and robotic, not computational or manual. The robot mixed precursors, ran furnaces, and characterised products by XRD autonomously. Revisit the Berkeley callout in Lesson 2.

Why is thermodynamic stability (e_hull = 0) necessary but not sufficient to confirm a material is useful?

Correct. Stability predictions are 0 K, vacuum calculations. Real synthesis occurs at finite temperature and pressure in chemical environments. Kinetic accessibility, environmental stability, and functional properties are all separate questions.

Incorrect. The issue is not GNoME's error bars but rather what convex hull stability does and does not guarantee. Review the "What Stable Actually Means" section in Lesson 2.

Lesson 2 · Lab

Interpreting GNoME Predictions

Discuss and evaluate AI-predicted crystal stability — 3 exchanges to complete.

Your Task

Your research group has downloaded GNoME's public dataset. You have identified a ternary oxide predicted to be thermodynamically stable with e_hull = 0 meV/atom, containing titanium, niobium, and oxygen in a novel structure type. You need to decide whether to pursue experimental synthesis.

Starter question: "We have a Ti-Nb-O compound predicted stable by GNoME. What additional computational and experimental information should we gather before committing to synthesis?"

Materials Discovery AI

GNoME Analysis

Good question — this is exactly the decision point that separates useful predictions from the vast majority of the GNoME catalogue. Let me help you think through what to check before committing synthesis resources. What is your target application for this compound?

Module 5 · Lesson 3

Battery Materials and the Electrolyte Search

Solid-state electrolytes could make lithium batteries fundamentally safer. AI has accelerated the identification of viable candidates from a search space too large for conventional methods.

How did Microsoft and partner institutions use AI to identify a solid-state electrolyte candidate in days rather than years?

In January 2024, Microsoft Research and Pacific Northwest National Laboratory published results in ACS Energy Letters describing an AI-accelerated discovery campaign that identified a solid-state lithium-ion conductor from an initial pool of 32 million candidate materials. The campaign began with AI filtering, narrowed to 18 candidates for DFT evaluation, and concluded with experimental synthesis of a single high-priority compound — Li₇MoO₆-like structures with partial sodium substitution — in under nine months from initiation to experimental validation.

The work demonstrated a complete pipeline: generative enumeration, machine learning property screening, DFT validation, and laboratory synthesis — each stage reducing the candidate pool by orders of magnitude. The compound identified showed ionic conductivity competitive with leading solid electrolytes, though further optimisation was required before device integration.

Why Solid Electrolytes Are Hard to Find

Liquid electrolytes in conventional lithium-ion batteries are flammable organic solvents — the source of high-profile fire incidents in electric vehicles and consumer electronics. Solid-state electrolytes would eliminate this risk but must simultaneously satisfy a stringent set of requirements that conventional liquid electrolytes satisfy almost automatically:

The target property profile includes: high lithium-ion conductivity (≥ 1 mS/cm at room temperature), low electronic conductivity (to prevent internal short circuits), electrochemical stability across the voltage window of the electrode pair, mechanical compatibility with electrode volume changes during charge/discharge, chemical stability against both the anode and cathode, and processability at scale. No single known material satisfies all criteria optimally — hence the active research field.

The Microsoft/PNNL Pipeline

The 2024 Microsoft-PNNL campaign used Azure Quantum Elements, Microsoft's cloud-based materials simulation infrastructure, in conjunction with large language model-assisted literature synthesis and ML property predictors. The pipeline operated in four stages:

Stage 1 — Enumeration: 32 million candidate lithium-containing inorganic structures were generated by systematic substitution of elements into known crystal prototypes. Stage 2 — AI Filtering: Graph neural networks trained on known ionic conductors screened all 32 million candidates for predicted ionic conductivity, electrochemical window, and stability, reducing the pool to approximately 5,000. Stage 3 — DFT Validation: High-throughput DFT calculations refined this to 18 priority candidates. Stage 4 — Synthesis: PNNL chemists synthesised and characterised the top candidates experimentally.

Real Case — LLM-Assisted Literature Mining

An underappreciated element of the Microsoft-PNNL campaign was the use of a fine-tuned large language model to systematically extract ionic conductivity measurements from thousands of published papers — a task that would take human researchers months. The LLM parsed experimental tables, resolved inconsistent units, and flagged measurements from comparable synthesis conditions, creating a curated training dataset for the downstream property predictors. This text-to-database step is increasingly standard in AI-accelerated materials pipelines as the published literature grows faster than human reading capacity.

Other AI Battery Materials Milestones

The Microsoft-PNNL work was preceded by several landmark efforts. The Materials Project's systematic DFT screening of ~1,500 known lithium solid electrolytes (2019, Sendek et al., Stanford) used logistic regression trained on 20 material features to predict ionic conductivity class, identifying 21 high-priority candidates from a database of known compounds. The NOMAD (Novel Materials Discovery) repository, maintained by the Fritz Haber Institute in Berlin, aggregated DFT calculations from hundreds of groups and enabled cross-institutional ML training. Samsung Advanced Institute of Technology used reinforcement learning in 2020 to optimise the composition of lithium-rich layered oxide cathodes, identifying compositions with improved cycle stability confirmed in coin-cell tests.

Key Concepts

Solid-state electrolyteA non-liquid ionic conductor that replaces flammable organic solvents in batteries, enabling safer and potentially higher-energy-density cells.

Ionic conductivityThe ability of a material to allow ion transport; for solid electrolytes, a room-temperature conductivity ≥ 1 mS/cm is generally required for practical cells.

Electrochemical stability windowThe voltage range within which an electrolyte does not decompose; must span the operating voltages of both anode and cathode.

Closed-loop pipelineA discovery workflow where AI predictions, computational validation, experimental results, and model retraining are automated and integrated into a continuous cycle.

The Speed Question

Microsoft reported the compound identification took under nine months. The team estimated a comparable purely human-driven screening effort would have taken decades. Even if this estimate is generous, the compression factor is large enough to matter commercially — battery technology roadmaps operate on 5–10 year horizons, and reducing a 20-year discovery cycle to two years is a competitive advantage of the first order.

Lesson 3 Quiz

Battery Materials and AI Pipelines

Three questions on the electrolyte search and AI-accelerated pipelines.

How many candidate materials did the Microsoft-PNNL pipeline begin with before AI filtering?

Correct. The pipeline started with 32 million candidates generated by systematic crystal prototype substitution. AI filtering reduced this to ~5,000, then DFT to 18, before experimental synthesis of the top hits.

Incorrect. The pipeline began at 32 million candidates. 5,000 was the post-AI-filtering pool; 18 was the post-DFT pool; and the final synthesis involved the top hits. Review the pipeline stages in Lesson 3.

What role did a fine-tuned large language model play in the Microsoft-PNNL discovery campaign?

Correct. The LLM performed literature mining — parsing tables, resolving units, and flagging comparable measurements across thousands of papers — creating a curated ionic conductivity dataset that would have taken human researchers months to assemble.

Incorrect. The LLM's role was literature mining, not synthesis or structure generation. This text-to-database step is an increasingly important part of AI materials pipelines. Review the LLM callout in Lesson 3.

Which property is NOT listed as a requirement that solid-state electrolytes must satisfy?

Correct. Solid electrolytes must have LOW electronic conductivity — high electronic conductivity would cause internal short circuits. Ion transport must be high; electron transport must be suppressed.

Incorrect. A solid electrolyte needs LOW — not high — electronic conductivity. High electronic conductivity would short-circuit the cell. Review the property requirements in Lesson 3.

Lesson 3 · Lab

Battery Electrolyte Design Advisor

Design a screening strategy for solid electrolyte candidates — 3 exchanges to complete.

Your Task

You are part of a startup developing next-generation solid-state batteries for electric vehicles. Your computational team has access to the Materials Project database and a GPU cluster capable of running high-throughput DFT. You need to design a multi-stage screening strategy for solid electrolyte candidates.

Starter question: "We want to filter the Materials Project database for potential solid electrolytes. What sequence of computational filters should we apply, and in what order, to get from ~150,000 compounds to a manageable synthesis list?"

Materials Discovery AI

Electrolyte Screening

Great strategic question. Designing the filter sequence well is critical — applying expensive calculations too early wastes resources; applying cheap filters too late misses easy eliminations. Let me help you build a funnel. First: are you targeting oxide, sulfide, or halide solid electrolytes, or are you agnostic about anion chemistry?

Module 5 · Lesson 4

Protein Structure, Drug Design, and the AlphaFold Ripple

DeepMind's AlphaFold 2 solved a 50-year grand challenge in biology. Its methods are now reshaping drug discovery, enzyme engineering, and the design of entirely new proteins with no natural counterparts.

What has AlphaFold actually enabled downstream of structure prediction, and how are generative models extending its impact into new molecule design?

When DeepMind published AlphaFold 2 in Nature in July 2021, it demonstrated near-experimental accuracy on protein structure prediction across the CASP14 benchmark — achieving median backbone accuracy of 0.96 Å RMSD over ordered residues. By July 2022, the AlphaFold Protein Structure Database had released predicted structures for 200 million proteins from 48 model organisms, including the entire human proteome. Structural biologists who had spent careers on single protein structures described the release as simultaneously a gift and a destabilisation of their field.

But the real story was not structure prediction per se — it was what downstream tools built on AlphaFold's representations could do. Structure prediction had been the bottleneck; once solved, it revealed the next bottleneck: function annotation, drug target validation, and the design of proteins that nature never evolved.

AlphaFold's Mechanism in Brief

AlphaFold 2 uses a deep neural network with two key innovations. The Evoformer module processes a multiple sequence alignment (MSA) of evolutionarily related proteins and a pairwise residue distance matrix simultaneously, using attention mechanisms to extract evolutionary covariation signals. The structure module then iteratively updates residue frames (position + orientation in 3D space) to produce an all-atom coordinate prediction, along with per-residue confidence scores (pLDDT) and inter-residue distance predictions (PAE).

The training data was entirely from the Protein Data Bank — experimentally determined structures solved by X-ray crystallography, NMR, and cryo-EM. AlphaFold learned physical constraints implicitly from data rather than encoding them as explicit energy terms, as earlier physics-based methods (Rosetta, MODELLER) did.

Real Case — Malaria Vaccine and Parasitic Disease Proteins (2022)

Within months of the AlphaFold database launch, researchers at the Wellcome Sanger Institute used AlphaFold predictions to structurally characterise proteins from Plasmodium falciparum (malaria parasite) and related parasites — organisms where experimental structure determination was severely limited by protein insolubility and difficulty of culture. A 2022 paper in Science from the Bhatt Lab used AlphaFold-predicted structures of Plasmodium surface antigens to identify conserved epitopes amenable to vaccine design, directly informing antigen selection for a next-generation malaria vaccine candidate. This was structure prediction converted to drug design in under 18 months from database availability.

From Prediction to Design: RFdiffusion and Beyond

AlphaFold predicts the structure of naturally occurring or close-homologue proteins. A distinct challenge is de novo protein design: creating amino acid sequences that fold into a specified 3D structure with specified function — sequences that evolution never explored. David Baker's group at the University of Washington has led this field.

In 2023, the Baker lab published RFdiffusion (Rosetta Folding diffusion) in Nature: a diffusion model trained on AlphaFold and experimental structures that generates novel protein backbone conformations conditioned on functional constraints. Given a desired binding site geometry, RFdiffusion produces protein scaffolds designed to present that geometry. Paired with a sequence design model (ProteinMPNN) and AlphaFold structure validation, the pipeline generated de novo binders to multiple target proteins — including influenza hemagglutinin and a cancer-associated receptor — that showed nanomolar affinity in experimental binding assays. David Baker received the 2024 Nobel Prize in Chemistry partially for this body of work.

Small Molecule Drug Discovery

AlphaFold structures also accelerated structure-based small molecule drug discovery by providing high-quality 3D models of therapeutic targets previously inaccessible to crystallography. Insilico Medicine used AlphaFold-predicted structures of the cancer target CDK20 to run AI-guided docking and generative chemistry, identifying a clinical candidate (ISM9274) that entered Phase I clinical trials in 2023 in under 30 months from project initiation — an exceptionally fast timeline for oncology drug discovery. The company's AI platform (Chemistry42) uses a reinforcement learning generative model to propose molecules, scored by AlphaFold-enabled docking predictions and ADMET property models.

Isomorphic Labs, spun out of DeepMind in 2021 to apply AlphaFold to drug discovery, reported in 2024 research collaborations with Eli Lilly and Novartis targeting multiple disease areas, with undisclosed milestone payments that implied AlphaFold-derived structures were central to the drug design process.

Key Concepts

AlphaFold 2DeepMind's 2021 deep learning system that predicts protein 3D structure from amino acid sequence with near-experimental accuracy, trained entirely on Protein Data Bank data.

pLDDTPer-residue Local Distance Difference Test — AlphaFold's per-residue confidence score. Values above 90 indicate high confidence; below 50 indicate likely disordered regions.

De novo protein designCreating amino acid sequences that fold into a specified 3D structure with specified function — without reference to naturally evolved sequences.

RFdiffusionA diffusion model from the Baker lab (2023) that generates novel protein backbone structures conditioned on functional site constraints, enabling de novo binder design.

Structure-based drug discoveryDrug design that uses the 3D structure of a target protein to guide the selection and optimisation of small molecules that bind and modulate it.

The Nobel Recognition

The 2024 Nobel Prize in Chemistry was awarded to David Baker (for computational protein design), Demis Hassabis and John Jumper (for AlphaFold). The Nobel Committee described the work as having "unlocked the secret of proteins" — the most consequential scientific recognition of AI-driven materials and molecular science to date.

Lesson 4 Quiz

AlphaFold, Protein Design, and Drug Discovery

Three questions on AlphaFold's downstream impact.

What is the purpose of AlphaFold's pLDDT score?

Correct. pLDDT (per-residue Local Distance Difference Test) is AlphaFold's confidence estimate for each amino acid position. High pLDDT regions are reliable; low pLDDT regions may be genuinely disordered or poorly predicted.

Incorrect. pLDDT is a confidence score, not a binding affinity or conservation metric. Review the key concepts in Lesson 4.

RFdiffusion differs from AlphaFold 2 in what fundamental way?

Correct. AlphaFold is a predictor: given a sequence, it outputs a structure. RFdiffusion is a generator: given a functional constraint, it outputs novel backbone geometries that no natural protein has. These are complementary tools in different directions of the structure-function problem.

Incorrect. The key distinction is prediction versus generation. AlphaFold predicts; RFdiffusion generates. Both work on proteins. Review the design section of Lesson 4.

Insilico Medicine's CDK20 program achieved Phase I clinical trial entry in approximately how long from project initiation?

Correct. Under 30 months from initiation to Phase I clinical trial is remarkably fast for oncology — the typical industry average for reaching Phase I is 4–6 years from initial candidate identification. AlphaFold-enabled docking played a central role in this acceleration.

Incorrect. The timeline was under 30 months — fast for the field but not instant. Experimental validation, preclinical toxicology, and regulatory steps still occurred. Review the small molecule section of Lesson 4.

Lesson 4 · Lab

De Novo Protein Design Workshop

Explore the pipeline from target to designed binder — 3 exchanges to complete.

Your Task

Your biotech company wants to develop a de novo protein inhibitor for a cancer-related enzyme whose structure was recently predicted by AlphaFold 2. You need to understand the RFdiffusion-ProteinMPNN-AlphaFold validation pipeline and plan your design campaign.

Starter question: "We have an AlphaFold-predicted structure of our cancer target enzyme with a well-defined active site pocket. Walk me through how RFdiffusion would be used to design a protein that binds this pocket, and what validation steps we'd need before experimental testing."

Materials Discovery AI

Protein Design

Excellent — this is a well-established pipeline now, thanks to the Baker lab's work. Let me walk you through the full design-validate-test cycle. To start: is your target pocket well-defined by experimental data or relying entirely on the AlphaFold model? That affects how much you should trust the pocket geometry for design.

Module 5 · Final Assessment

Materials Science and Discovery — Module Test

15 questions across all four lessons. Score ≥ 80% to pass the module.

1. What does "inverse design" mean in the context of AI-driven materials science?

Correct. Inverse design reverses the classic workflow — properties are inputs, structures are outputs.

Inverse design means starting from desired properties and computing the structure — not experimental correction or battery charging. Review Lesson 1.

2. The Materials Genome Initiative was launched in which year, and by which government?

Correct. The MGI launched in 2011 under President Obama, targeting a halving of materials discovery timelines.

Incorrect. The MGI was a 2011 U.S. federal initiative. Review Lesson 1.

3. What is the fundamental advantage of ML interatomic potentials (MLIPs) over DFT for materials screening?

Correct. MLIPs trade some accuracy for enormous speed, making large-scale screening tractable.

Incorrect. MLIPs are faster approximations, not more accurate alternatives. They require DFT training data. Review Lesson 1.

4. GNoME used an "active learning loop." What does this mean?

Correct. Active learning iteratively improves the model by targeting its own uncertainty — the most informative data points are selected for expensive DFT calculation.

Incorrect. Active learning is an automated loop between model predictions and DFT validation, not human correction. Review Lesson 2.

5. The Berkeley A-Lab's 2023 experiment autonomously synthesised compounds from GNoME predictions. What was the success rate?

Correct. 41 of 58 predicted compounds (71%) were confirmed by X-ray diffraction in just 17 days of autonomous robot operation.

Incorrect. The A-Lab achieved 71% (41/58). Review Lesson 2's Berkeley callout.

6. A compound predicted at e_hull = 0 meV/atom is described as thermodynamically stable. Why might it still not be synthesisable?

Correct. 0 K vacuum calculations do not capture synthesis reality. Kinetic accessibility, air stability, and processability are separate questions. Review Lesson 2.

Incorrect. The issue is the gap between 0 K vacuum thermodynamics and real synthesis conditions — not model error or metastability definition. Review Lesson 2.

7. What property must a solid-state electrolyte have LOW of (not high) to function safely in a battery?

Correct. High electronic conductivity in an electrolyte creates an internal short circuit. Ion conduction must be high; electron conduction must be suppressed. Review Lesson 3.

Incorrect. The electrolyte must have LOW electronic conductivity — otherwise it short-circuits the cell. Review the property requirements in Lesson 3.

8. In the Microsoft-PNNL solid electrolyte discovery pipeline, how many candidates remained after the AI filtering stage?

Correct. AI filtering reduced 32 million candidates to ~5,000. DFT then reduced that to 18 for experimental synthesis. Review Lesson 3.

Incorrect. The post-AI-filtering pool was ~5,000. 32M was the starting pool; 18 was post-DFT; 1 was the final candidate. Review Lesson 3.

9. What was the role of an LLM in the Microsoft-PNNL materials discovery campaign?

Correct. Literature mining via LLM — parsing tables, resolving units, standardising measurements — created the training data that downstream ML property predictors depended on. Review Lesson 3.

Incorrect. The LLM mined literature for ionic conductivity data. Review the LLM callout in Lesson 3.

10. AlphaFold 2's training data came exclusively from which source?

Correct. AlphaFold 2 learned physical folding constraints implicitly from PDB experimental structures — no simulated data were required. Review Lesson 4.

Incorrect. AlphaFold trained on the Protein Data Bank's experimentally determined structures. Review Lesson 4.

11. What does a pLDDT score below 50 in an AlphaFold prediction indicate?

Correct. Low pLDDT regions are either genuinely intrinsically disordered or poorly predicted — either way, 3D coordinates in those regions should not drive drug design decisions. Review Lesson 4.

Incorrect. pLDDT below 50 signals low confidence / likely disorder — not large size or poor homology. Review Lesson 4.

12. RFdiffusion is best described as which type of model?

Correct. RFdiffusion is a diffusion model — it generates novel backbones by iteratively denoising random configurations, conditioned on specified functional geometry. Review Lesson 4.

Incorrect. RFdiffusion is a diffusion generative model for backbones, not a sequence model or physics minimiser. Review Lesson 4.

13. Insilico Medicine's CDK20 oncology program reached Phase I clinical trials in under 30 months. What primarily enabled this speed?

Correct. AlphaFold-enabled structural docking combined with the Chemistry42 generative platform allowed rapid candidate identification and optimisation. Review Lesson 4.

Incorrect. The speed came from AI-guided structure-based design using AlphaFold predictions — not regulatory shortcuts or drug repurposing. Review Lesson 4.

14. The 2022 malaria vaccine application of AlphaFold used predicted structures to do what?

Correct. AlphaFold predicted structures of insoluble parasite proteins that resisted crystallography, revealing conserved epitopes that guided vaccine antigen design. Review Lesson 4's malaria callout.

Incorrect. AlphaFold provided structural insight into previously intractable proteins — it identified epitopes for antigen selection, not simulated replication or replaced clinical trials. Review Lesson 4.

15. Which of the following best characterises the current state of AI in materials and molecular discovery?

Correct. This accurately captures the current state: AI compresses the search and prediction stages; the experimental pipeline still requires human expertise, robotic automation, and careful validation. This module's theme throughout.

Incorrect. AI assists and accelerates but does not replace experimental science; it has been applied across both organic and inorganic materials; and prediction accuracy is demonstrably useful in real pipelines. Review the module's recurring themes.