AI in Science · Introduction

The Instrument That Changes Every Instrument

Science has always advanced by building better tools — and this one rewrites the rules of every laboratory on Earth.

In September 1895, Wilhelm Röntgen was experimenting with cathode ray tubes in Würzburg when he noticed a fluorescent screen glowing from across the room — shielded from any direct beam. Within weeks he had photographed the bones inside his wife's hand. The X-ray did not merely add a new technique to medicine; it made the previously invisible body legible to science for the first time, and every discipline it touched — surgery, crystallography, astronomy — was fundamentally altered by the new kind of seeing it enabled.

Something structurally similar is happening right now across the biological and physical sciences. In November 2020, DeepMind's AlphaFold 2 predicted the three-dimensional structures of proteins with accuracy matching decades of experimental crystallography. By July 2022 it had released structures for virtually every protein in the human proteome — roughly 200 million entries — freely online. Computational biology, drug discovery, and evolutionary research are each absorbing consequences that researchers are still struggling to articulate. The pattern is the same as 1895: a new instrument arrives, and the questions scientists can ask expand overnight.

This course examines AI as that instrument — honestly, technically, and critically. You will learn how machine learning models are trained on scientific data, where they genuinely accelerate discovery, and where they introduce new failure modes that demand careful scrutiny. We will not pretend AI is infallible, nor that it is merely hype. The goal is fluency: by the end of this module you should be able to evaluate a real AI-assisted scientific claim the way you would evaluate any other experimental result.

If you finish every module, here's who you become:

You'll understand how machine learning models are trained on scientific data and where that training produces reliable results versus dangerous blind spots.
You'll be able to read an AI-assisted scientific claim — a protein structure prediction, a climate model output, a drug candidate — and evaluate it the way you would any experimental result.
You'll know the AlphaFold story in technical detail: what it solved, how it works, and why 200 million freely released protein structures changed biology faster than the field could absorb.
You'll recognize the specific reproducibility and peer-review failures that AI introduces into research pipelines, and you'll know what responsible validation looks like in response.
You'll become someone who can move across disciplines — from drug discovery to materials science to astronomy — and identify where AI is genuinely accelerating work versus where it is substituting confidence for rigor.
You'll trace a coherent line from Röntgen's fluorescent screen to AlphaFold 2, understanding AI not as a trend but as a new instrument that expands which questions science can ask.
You'll leave thinking like a critical practitioner — fluent enough to collaborate with AI tools, skeptical enough to know when to push back on what they return.

AI in Science · Module 1 · Lesson 1

Pattern Recognition at Scale

What AI actually does in a laboratory — and why it is not the same thing as understanding.

How does a model trained on past data tell a scientist something genuinely new?

On the morning of January 15, 2020, a radiology team at Massachusetts General Hospital ran a chest CT scan through an AI model called AI-Rad Companion, developed by Siemens Healthineers. The scan showed faint bilateral ground-glass opacities in both lung fields — a pattern the model had learned to flag from training on thousands of prior scans. The patient had recently traveled internationally. The attending radiologist, already alerted by the system's confidence score, escalated the case. It was one of the earliest confirmed COVID-19 cases in the northeastern United States. The AI did not diagnose the disease — it had never seen SARS-CoV-2 before — but it recognized a known visual pattern in new context, and that pattern recognition bought time that mattered.

This is the core transaction of AI in science: not the discovery of new laws, but the recognition of known patterns in volumes of data that exceed human bandwidth. Understanding what that transaction can and cannot accomplish is the foundation everything else in this course is built on.

What Machine Learning Actually Does

Machine learning, at its most general, is the process of finding a mathematical function that maps inputs to outputs by adjusting parameters until predictions match a labeled training set. In science this sounds deceptively simple, but the practical consequences are significant. A model trained on 1.4 million labeled retinal photographs — as Google's DeepMind collaboration with Moorfields Eye Hospital was in 2018 — can detect signs of diabetic retinopathy with sensitivity and specificity matching that of expert clinicians. It does not "understand" the retina in any biological sense; it has learned which pixel patterns correlate with which labels.

This distinction between correlation-based pattern matching and mechanistic understanding is not a weakness unique to AI — much of classical statistics operates the same way. But it becomes consequential when scientists use model outputs to make causal claims. A model that predicts which chemical compounds will inhibit a protein kinase (as Insilico Medicine's generative model did in 2022, producing a novel drug candidate for idiopathic pulmonary fibrosis that reached Phase II trials in under 18 months) is not explaining biochemistry. It is extrapolating from the geometry of molecules it was trained on. That extrapolation can be spectacularly useful. It can also be spectacularly wrong in exactly the cases where the training distribution does not cover the query.

Three broad families of ML are used in scientific contexts. Supervised learning requires labeled examples and learns a mapping from input to label — the radiological AI above is an instance. Unsupervised learning finds structure in unlabeled data: dimensionality reduction techniques like UMAP, used extensively in single-cell RNA sequencing since 2018, let biologists visualize high-dimensional gene expression data as interpretable clusters without pre-specifying what the clusters should be. Reinforcement learning optimizes behavior through reward signals: DeepMind's AlphaFold 2 used a variant of this paradigm during its structure refinement stage, iteratively improving predictions against a physical plausibility metric.

Critical Distinction

A model that predicts an outcome and a model that explains an outcome are doing fundamentally different things. Science needs both. Conflating them is one of the most common sources of overblown AI claims in peer-reviewed literature.

The Training Data Problem

Every AI model inherits the biases and gaps of the data it was trained on. In 2019, a widely-used clinical risk-prediction algorithm deployed across US hospital systems was found by Obermeyer et al. (published in Science, October 2019) to systematically underestimate the illness severity of Black patients — not because race was a direct input, but because the model used healthcare cost as a proxy for health need, and historical spending on Black patients was lower due to systemic access inequities. The algorithm was doing exactly what it was trained to do. The training data encoded a social disparity, and the model faithfully reproduced it.

This is not an argument against using AI in science. It is an argument for treating training data as a first-class scientific artifact — subject to the same scrutiny as experimental samples. When the Allen Institute for Brain Science assembled its Allen Cell Types Database starting in 2015, each electrophysiological recording was accompanied by detailed metadata about the animal, brain region, and recording conditions, precisely so that future models trained on the data could account for systematic variation. Good scientific AI starts with good scientific data curation.

Key Terms

Supervised LearningTraining a model on input-output pairs where the correct output (label) is provided. The model learns a function from inputs to labels.

Training DistributionThe statistical population of examples a model was trained on. Performance degrades when test inputs fall outside this distribution.

GeneralizationA model's ability to perform correctly on data it was not trained on. The central challenge of applied machine learning.

Proxy VariableA measurable quantity used as a stand-in for an unmeasured variable of interest. Can introduce systematic bias if the proxy is imperfectly correlated with the true target.

Takeaway

AI in science is most powerful as a pattern-recognition engine operating at scales no human team can match. Its reliability is bounded by the quality and representativeness of its training data. Neither the hype nor the dismissal captures this accurately.

Lesson 1 Quiz

Pattern Recognition at Scale · 5 questions

1. The chest CT AI system described in the lesson recognized COVID-19-like patterns in January 2020. What is the most precise description of what it actually did?

Correct. The model had never encountered SARS-CoV-2; it recognized a pattern from its training distribution (ground-glass opacities) in a new context. Pattern recognition is not diagnosis, and diagnosis is not mechanistic explanation.

Not quite. The AI had no knowledge of SARS-CoV-2 specifically — it matched a visual pattern it had learned from prior labeled scans. Review the distinction between pattern recognition, diagnosis, and mechanistic understanding.

2. The 2019 Obermeyer et al. study found that a clinical risk algorithm systematically underestimated illness severity in Black patients. What was the root cause?

Correct. The model optimized for healthcare cost as a proxy for need. Because historical spending on Black patients was lower — due to systemic access inequities, not lower illness severity — the model learned to underestimate their needs. The bias was in the training data's structure, not in an explicit racial feature.

Review the lesson section on training data. The problem was not the algorithm's architecture or an explicit race variable — it was a proxy variable (healthcare cost) that encoded a social disparity already present in the historical data.

3. Which of the following best distinguishes supervised from unsupervised learning in a scientific context?

Correct. Supervised learning maps labeled inputs to outputs (e.g., labeled retinal images to disease grades). Unsupervised learning discovers patterns like clusters or dimensions in data that has no pre-provided labels (e.g., UMAP of single-cell RNA-seq). Reinforcement learning — not unsupervised — uses reward signals.

Re-read the section on the three families of ML. Supervised learning is defined by the presence of labels, not by data volume or scientific domain. Reward signals belong to reinforcement learning.

4. AlphaFold 2 predicted protein structures with accuracy matching experimental crystallography. What does this achievement represent?

Correct. AlphaFold 2 learned geometric relationships from the Protein Data Bank's known structures and extrapolated them to novel sequences. It does not explain the physics of folding from first principles — it predicts structure with high accuracy by recognizing patterns in a massive training corpus.

AlphaFold 2's achievement is impressive precisely because it shows how far pattern recognition can go — but it does not constitute a mechanistic theory of folding. It learned from existing known structures and extrapolates. Review the lesson's core distinction.

5. A model performs extremely well on its test set but fails badly on data collected six months later from a new hospital site. The most likely explanation is:

Correct. When deployment data differs systematically from training data — different equipment, patient population, clinical protocols — performance degrades. This is a distribution shift and is one of the primary reasons that scientific AI models that work in one lab often fail when applied in another.

The key concept here is training distribution and generalization. If the new hospital's data differs systematically from what the model trained on — different scanner, different demographics, different imaging protocols — the model is being asked to extrapolate outside its training distribution.

Lab 1 · The Training Data Audit

Interrogate an AI model's training data assumptions before trusting its outputs.

Lab Objective

In this lab you will act as a scientific reviewer evaluating an AI model's reliability. Your AI lab assistant has been briefed on a hypothetical genomics model. Your job is to ask probing questions about its training data, identify potential failure modes, and assess whether its outputs would be trustworthy in a specific deployment context.

Aim for at least 3 exchanges. Ask about training data sources, potential biases, distribution mismatch risks, and whether the model's predictions should be treated as explanatory or merely correlational.

Suggested opener: "Tell me about the training data used for this genomics model. What populations are represented, and what might be missing?"

AI Lab Assistant

Lesson 1 · Training Data

Welcome to Lab 1. I'm your AI lab assistant for this module on AI as a scientific tool. I've been briefed on a genomics variant-effect prediction model — let's call it GenomePred — that your team is considering deploying to identify pathogenic mutations in clinical samples. Ask me anything about its training data, expected reliability, or potential failure modes. I'll be as specific and critical as I can.

AI in Science · Module 1 · Lesson 2

Accelerating Discovery: Where AI Wins

Drug design, materials science, and climate modeling — the domains where AI has produced documented, replicable gains.

What kinds of scientific problems are structurally suited to machine learning, and why?

In February 2022, Insilico Medicine announced that a drug candidate — INS018_055, targeting fibroblast activation protein for idiopathic pulmonary fibrosis — had entered Phase II clinical trials. The molecule had been generated by a generative AI system called Chemistry42 and had gone from initial design to clinical candidate in eighteen months. The conventional timeline for that same journey is typically five to six years. No magic was involved: the AI searched chemical space more efficiently than humans could by hand, guided by a target structure and a set of learned constraints about what makes drug-like molecules viable. Speed, not insight, was the gain — but in a disease where patients have a median survival of three to five years from diagnosis, speed is not a trivial thing.

Why Some Problems Are Good Fits

Machine learning excels when a problem has a specific profile: large labeled datasets exist, the input space is high-dimensional (many variables, hard for humans to intuit), feedback is fast and quantifiable, and approximate solutions have real value even when perfect solutions are unachievable. Drug discovery hits all four criteria: the PubChem database contains over 100 million chemical structures with associated biological activity data; molecular descriptors can run into thousands of features; binding assays are automatable; and a compound that is 80% as potent as the theoretical optimum is still a drug worth developing.

Materials science has a similar profile. In 2023, Google DeepMind's GNoME (Graph Networks for Materials Exploration) system predicted the crystal structures of 2.2 million new stable inorganic materials — more than the total discovered by experimentalists in all of prior history. The model was trained on the Materials Project database, itself a massive DFT-computed collection of known crystal energies. GNoME's predictions do not replace experimental synthesis; they prioritize which of the enormous space of possible materials are worth the experimentalist's time to attempt.

Climate science represents a third domain. NVIDIA's FourCastNet, released in 2022, produces 10-day global weather forecasts in seconds that rival the accuracy of the European Centre for Medium-Range Weather Forecasts' numerical models, which require supercomputers running for hours. The AI model was trained on 40 years of ERA5 reanalysis data — a retrospective reconstruction of atmospheric states from satellite and surface observations. FourCastNet does not solve the Navier-Stokes equations; it learned to emulate their outputs. For operational forecasting, the distinction matters less than the speed gain.

The Pattern Across Domains

Drug design, materials science, weather forecasting: in each case, a large corpus of prior computations or experiments was used to train a model that can now approximate those computations orders of magnitude faster. AI is not replacing the underlying science — it is replacing the expensive, slow step of re-running it for each new query.

Speed vs. Understanding

The gains described above are real and large. They also do not advance mechanistic understanding in the same way a controlled experiment does. When FourCastNet predicts a hurricane's track, it cannot tell you which physical process it weighted most heavily. When Chemistry42 generates a drug candidate, it cannot explain in chemical terms why it chose that scaffold. This is sometimes called the interpretability problem, and it is one of the active frontiers of AI research in science.

The 2022 Nobel Prize in Chemistry, awarded to Carolyn Bertozzi, Morten Meldal, and K. Barry Sharpless for click chemistry, was not given to an AI system — it was given to researchers who understood, mechanistically, why certain reactions proceed with such efficiency. Click chemistry is now being integrated into AI-guided drug synthesis pipelines. The hierarchy matters: AI accelerates the search; human mechanistic insight generates the principles that make the search space bounded and navigable.

Key Terms

Generative ModelA model that learns the distribution of a training set and can sample new examples from that distribution. Chemistry42 and similar drug-design systems are generative models over chemical space.

EmulationUsing a fast ML model to approximate the outputs of a slow, expensive simulator (e.g., a climate model or DFT calculation) without re-running the simulator.

InterpretabilityThe degree to which a model's predictions can be understood in terms of its internal representations and the features it attends to. Low interpretability limits scientific utility even when predictive accuracy is high.

Chemical SpaceThe vast set of all theoretically synthesizable molecules — estimated at 10⁶⁰ compounds. AI-guided search navigates this space; no experimental program could enumerate it.

Lesson 2 Quiz

Accelerating Discovery · 5 questions

1. Insilico Medicine's Chemistry42 system took a drug candidate from design to Phase II trial in 18 months. The primary scientific contribution was:

Correct. The AI's contribution was speed — efficient search through chemical space guided by learned constraints. It did not produce new mechanistic insights about pulmonary fibrosis biology. Speed, in this context, is itself clinically valuable.

The AI compressed timeline by efficient search, not by discovering new mechanisms. Review the lesson section on why drug discovery is a good ML fit, and the distinction between speed gains and mechanistic understanding.

2. Google DeepMind's GNoME predicted 2.2 million stable inorganic crystal structures. What is the most accurate characterization of its role in materials science?

Correct. GNoME functions as a prioritization engine — narrowing the experimentally accessible subset of a vast theoretical space. Synthesis and validation still require lab work. AI predictions and experimental confirmation are complementary, not competing.

GNoME's predictions are computational, not experimental. They prioritize what is worth synthesizing, but do not replace synthesis or explain quantum mechanics from first principles. Review the lesson's treatment of GNoME.

3. NVIDIA's FourCastNet produces weather forecasts in seconds that rival those of supercomputer-based numerical models. What technique does it use?

Correct. FourCastNet is a learned emulator — it approximates the behavior of numerical weather models without solving the underlying physics equations. Trained on ERA5 reanalysis, it has learned to mimic what those equations produce, orders of magnitude faster.

FourCastNet does not solve Navier-Stokes. It learned to emulate the outputs of models that do. This is the emulation pattern described in the lesson — a fast ML surrogate for a slow physics-based simulator.

4. The interpretability problem in scientific AI refers to:

Correct. Interpretability is about understanding the model's internal reasoning — which inputs it weighted, which features were decisive. Without interpretability, a correct prediction cannot be used to build scientific knowledge, only to make further predictions.

Interpretability is a technical concept about the transparency of a model's internal computations. It is distinct from communication to lay audiences or code readability. Review the lesson's definition.

5. Why is drug discovery described as a structurally good fit for machine learning?

Correct. Drug discovery satisfies the four-part profile for ML suitability described in the lesson: scale, dimensionality, quantifiable feedback, and practical value of approximate solutions. These structural properties — not anything special about chemistry — explain why AI has made inroads here.

Re-read the section on why some problems are good ML fits. The four criteria — data scale, high dimensionality, fast feedback, value of approximation — explain the match. AI certainly cannot replace clinical trials.

Lab 2 · Domain Fit Analysis

Evaluate whether a proposed AI application is structurally suited to machine learning.

Lab Objective

A research team wants to use ML to predict volcanic eruption timing from seismic and GPS deformation data. Your AI lab assistant can help you think through whether this is a good structural fit for ML, what data challenges exist, and how to frame the problem responsibly.

Use the four-part framework from the lesson (data scale, dimensionality, feedback speed, value of approximation) to probe the assistant. Aim for at least 3 exchanges.

Suggested opener: "I want to apply ML to predict volcanic eruption timing. Walk me through the four-part ML fitness framework and how this problem scores on each criterion."

AI Lab Assistant

Lesson 2 · Domain Fit

Hello. I'm ready to work through the ML domain-fit analysis for volcanic eruption prediction with you. This is a genuinely interesting case — some aspects look favorable, others are serious obstacles. Ask me anything about the data landscape, the problem structure, or how you might frame this responsibly for a research proposal.

AI in Science · Module 1 · Lesson 3

Failure Modes: When AI Gets Science Wrong

Overfitting, hallucination, spurious correlation, and adversarial fragility — the specific ways AI misleads scientific inquiry.

What systemic failures does AI introduce into the scientific process, and how should they be anticipated?

In 2019, a deep learning model published in Nature Medicine claimed to diagnose fourteen different diseases from chest X-rays — including pneumonia, pneumothorax, and pleural effusion — at a level matching or exceeding that of radiologists. The paper made international headlines. When independent researchers examined the training and test data more carefully, they found that images from the same patient appeared in both the training and test sets. The model had, at least in part, learned to recognize specific patients rather than generalizable disease features. The leakage of training data into the test set inflated the apparent accuracy. The model itself was not fraudulent; the evaluation was flawed. The result could not be fully replicated under rigorously separated conditions. Science corrected itself — but not before the original numbers had circulated widely.

Overfitting and Data Leakage

Overfitting occurs when a model learns the specific noise and idiosyncrasies of its training data rather than the generalizable signal. A model with too many parameters relative to training examples can achieve near-perfect accuracy on training data while performing no better than chance on new data. In science this is not merely an engineering nuisance — it produces false claims about predictive capability that contaminate the literature.

Data leakage is a specific and particularly insidious form of this problem. In 2021, a systematic review by Kapoor and Narayanan (Science, 2023, based on analyses conducted from 2021) examined 329 published ML papers across 17 scientific and medical domains. They found leakage — including test-set contamination, temporal leakage (training on future data relative to the test period), and feature leakage (including target-derived features as inputs) — in a majority of surveyed papers. The implied consequence is that a significant fraction of published ML accuracy numbers in science are optimistic, potentially substantially.

The structural safeguard is rigorous cross-validation with properly separated folds, or better, a completely held-out test set that is touched exactly once. In medical applications, patients — not individual scans — must be the unit of separation.

Replication Crisis Meets AI

The broader scientific replication crisis (documented by Ioannidis in 2005 and amplified by subsequent work) is compounded when AI tools with opaque internals are involved. It is harder to audit an AI claim than a conventional statistical claim, because the model's "reasoning" is distributed across millions of parameters with no direct semantic interpretation.

Spurious Correlation

In 2016, a widely-cited paper demonstrated that a deep learning model trained to detect melanoma in dermoscopy images had learned, in part, to associate ruler markings (which clinicians often place next to suspicious lesions for scale) with malignancy — because suspicious lesions were more likely to be photographed with rulers. The confound was not malicious; it was a reflection of clinical photography practice. The model was optimizing for anything correlated with the label, including artifacts of the data collection process that had nothing to do with the lesion itself.

This class of failure is called shortcut learning or Clever Hans behavior (after the 19th-century horse that appeared to do arithmetic but was actually reading subtle cues from its trainer). In a 2020 paper in Nature Machine Intelligence, Geirhos et al. documented that ImageNet-trained convolutional networks classify images based heavily on texture rather than shape — the opposite of the human visual system's strategy. When these networks are transferred to scientific imaging tasks (microscopy, satellite imagery, medical imaging), their texture bias can produce spectacular failures on test distributions that differ in texture statistics from training.

Hallucination in Scientific AI

Large language models used in scientific contexts — to summarize papers, suggest hypotheses, or write code — produce hallucinations: fluent, confident, factually wrong outputs. In a 2023 evaluation by researchers at the University of Ottawa, GPT-3.5 and GPT-4 were asked to summarize 50 medical research papers. GPT-3.5 introduced fabricated citations or misattributed findings in roughly 47% of cases; GPT-4 reduced but did not eliminate the problem. The papers that were most confidently summarized were not reliably the ones most accurately summarized. Confidence and accuracy are not correlated in language model outputs.

For scientific use this is a significant operational constraint. An AI assistant that generates plausible-sounding but fabricated references is not just unhelpful — it can poison a literature review or suggest experimental directions based on studies that do not exist. The mitigation is not to avoid LLMs but to use them with retrieval augmentation (grounding outputs in specific cited documents) and to verify all factual claims against primary sources.

Key Terms

OverfittingWhen a model fits the noise in training data rather than the underlying signal, resulting in poor generalization to new data.

Data LeakageWhen information from the test set (or future time periods) inadvertently influences model training, producing inflated performance estimates.

Shortcut LearningWhen a model learns spurious correlations between irrelevant features (artifacts, confounds) and the target label, rather than the intended causal features.

HallucinationIn large language models: generating fluent, confident text that is factually incorrect. Particularly dangerous in scientific summarization and citation tasks.

Lesson 3 Quiz

Failure Modes · 5 questions

1. The chest X-ray model from 2019 had inflated accuracy because images from the same patient appeared in both training and test sets. This is an example of:

Correct. When patients appeared in both training and test sets, the model could learn patient-specific features — effectively "memorizing" individuals — rather than generalizable disease features. This is test-set contamination, the classic form of data leakage.

The specific failure mechanism here was test-set contamination: patient data crossing the train/test boundary. Overfitting is related but more general; shortcut learning involves confounds; distribution shift is about deployment context. Review the lesson's opening scene.

2. Kapoor and Narayanan's 2023 Science analysis found data leakage in a majority of reviewed ML science papers. The most important implication for science is:

Correct. The finding implies systemic optimism in published ML accuracy claims — not necessarily fraud, but flawed evaluation practice. The response is stronger replication standards, held-out test sets, and independent validation, not abandonment of ML in science.

The implication is not to abandon ML but to apply rigorously separated evaluation standards. The finding means published numbers should be treated with appropriate skepticism pending independent replication. Review the lesson section on overfitting and leakage.

3. A melanoma detection model learned to associate ruler markings in dermoscopy images with malignancy. This is best classified as:

Correct. Ruler markings are a clinical photography artifact correlated with (but not causally related to) malignancy. The model found this shortcut because it was optimizing for correlation with the label, not for the biological features that cause malignancy. This is the Clever Hans problem in medical AI.

This is shortcut learning: exploiting a spurious correlation (ruler presence) rather than learning the intended causal features (lesion morphology). It differs from overfitting (which is about memorizing noise) and leakage (which is about test contamination). Review the lesson.

4. The finding that ImageNet-trained networks classify based heavily on texture rather than shape is dangerous for scientific imaging because:

Correct. Transfer learning from ImageNet to scientific domains is common practice. If those networks rely on texture statistics that differ between natural photographs and microscopy or satellite imagery, the transferred representations may encode the wrong inductive biases, producing unreliable predictions on scientific imaging tasks.

The danger is distributional: scientific images have different texture statistics than natural photographs. A texture-biased network transferred to scientific imaging may fail in ways that are not apparent from ImageNet benchmark performance. Review the Geirhos et al. finding in the lesson.

5. The best mitigation for LLM hallucination in scientific summarization tasks is:

Correct. Retrieval-augmented generation (RAG) forces the model to cite specific retrieved documents, making fabrication detectable. Independent verification against primary sources provides the final safety check. As the lesson notes, confidence and accuracy are not correlated in LLM outputs, so self-reported confidence scores are not a reliable filter.

Model size reduces but does not eliminate hallucination. Self-reported confidence is unreliable — the lesson explicitly notes that confidence and accuracy are uncorrelated in LLM outputs. The practical answer is retrieval augmentation plus human verification of primary sources. Review the lesson section on hallucination.

Lab 3 · Failure Mode Diagnosis

Identify which AI failure mode is operating in real research scenarios.

Lab Objective

You will be presented with brief descriptions of AI-assisted research outcomes that produced unexpected or concerning results. Work with your AI lab assistant to diagnose the failure mode in each scenario, explain the mechanism, and suggest what evaluation changes would have caught the problem earlier.

Aim for at least 3 exchanges. Push the assistant to be specific about mechanisms, not just labels.

Suggested opener: "Here's my first scenario: An AI model predicting hospital readmission performed extremely well during internal validation but dropped to near-baseline accuracy when deployed at a different hospital. What failure mode is this, and what's the mechanism?"

AI Lab Assistant

Lesson 3 · Failure Mode Diagnosis

Ready for failure mode diagnosis. Bring me your scenarios — real or hypothetical — and I'll work through the mechanism with you: what went wrong, why the evaluation didn't catch it, and what rigorous evaluation would look like. The goal is not just to name the failure but to understand it well enough to prevent it.

AI in Science · Module 1 · Lesson 4

Scientific Validity in the Age of AI

Reproducibility, peer review, and the standards that keep AI-assisted science honest.

What institutional and methodological changes does AI-assisted research demand of science?

In March 2023, researchers at Lawrence Berkeley National Laboratory published a paper in Nature claiming that a room-temperature superconductor — a material that would carry electrical current without resistance at ambient conditions — had been synthesized. The finding was extraordinary; verified room-temperature superconductivity would represent one of the most consequential materials science discoveries in a century. Within weeks, multiple independent groups failed to replicate the core result. By November 2023, the paper was retracted. An investigation found data anomalies; one co-author alleged that raw data had been improperly processed before publication. The Nature superconductor episode is a cautionary parallel: when computational pipelines — AI or otherwise — sit between raw data and published conclusions, auditing that pipeline is as important as auditing the data itself.

Reproducibility and the AI Pipeline

Classical scientific reproducibility requires that another team, using the same materials and methods, can obtain the same result. AI-assisted science introduces new layers where reproducibility can fail: the random seeds used during training, the exact version of a library, the preprocessing order applied to data, and the hardware on which training ran can all produce different model weights — and therefore different predictions — even from nominally identical procedures.

In 2019, the Papers With Code initiative began tracking ML reproducibility, and its findings were sobering: across major ML benchmarks, a substantial fraction of claimed state-of-the-art results could not be reproduced by independent teams using the reported hyperparameters, even when code was shared. In 2021, the Machine Learning Reproducibility Challenge (MLRC) systematically attempted to reproduce papers from NeurIPS, ICLR, and ICML; roughly 30% of attempted reproductions failed or produced substantially different quantitative results from those reported.

For scientific fields adopting AI tools, this matters beyond the ML community. A genomics result that depends on an unreproducible ML model is an unreproducible genomics result. The fix is not philosophical — it requires model cards (structured documentation of training data, architecture, hyperparameters, and evaluation conditions), frozen random seeds, containerized environments (e.g., Docker or Singularity), and deposition of trained model weights alongside data in supplementary materials.

Emerging Standard

As of 2024, journals including Nature Methods, Cell Systems, and eLife have adopted reporting guidelines requiring authors using ML models to provide training code, model weights, and a description of the computational environment. These requirements are not yet universal, but they represent the direction of travel for rigorous AI-assisted science.

Peer Review and AI Opacity

Traditional peer review can evaluate statistical methods because reviewers are trained in statistics and the methods are documented in textbooks. Deep learning models used in science are frequently not interpretable to reviewers — or even to the authors who trained them. A reviewer can check whether a t-test was applied correctly; checking whether a 40-million-parameter neural network is overfitting requires access to training curves, held-out test performance disaggregated by subgroup, and ideally the model weights themselves.

The REPAIRS checklist (Reporting Standards for AI in Peer-Reviewed Science), proposed by a cross-disciplinary group in 2022, enumerates ten reporting items including dataset provenance, train/test separation methodology, model selection procedure, and sensitivity analysis. At present, compliance with such checklists is voluntary at most journals. The consequence is a literature where AI-assisted results are published under the same review standards that were designed for conventional experiments — a mismatch that favors false positives.

There is, however, a constructive counterpart. AI is also beginning to assist peer review itself: tools like StatReviewer (used by several American Psychological Association journals as of 2023) automatically check submitted manuscripts for statistical reporting errors, missing confidence intervals, and sample size inconsistencies. The same pattern-recognition capability that creates failure modes in scientific AI can also be used to enforce standards.

What Valid AI-Assisted Science Looks Like

A well-conducted AI-assisted scientific study in 2024 has several identifiable properties. The training and test sets are separated in a way that respects the natural units of the data (patients, not scans; organisms, not cells; geographic regions, not coordinates). The model's performance is reported disaggregated across meaningful subgroups — not just overall accuracy. The evaluation includes an honest discussion of the training distribution and where the model is likely to fail. Trained weights and preprocessing code are deposited in a public repository. And the paper distinguishes clearly between what the model predicts and what it explains.

None of this is exotic. It is the application of existing scientific standards — rigor, transparency, honest acknowledgment of limitations — to a new class of computational tool. The challenge is institutional: creating incentives for publishing negative results and failed replications, building reviewer capacity in ML methods across science disciplines, and updating reporting requirements at a pace that keeps up with the technology.

Key Terms

Model CardStructured documentation accompanying a trained ML model, describing its training data, architecture, hyperparameters, intended use, and known limitations.

ReproducibilityThe ability of an independent team to obtain the same scientific result using the same methods and data. In ML, requires specifying random seeds, library versions, hardware, and full preprocessing pipelines.

Subgroup AnalysisEvaluating model performance separately across meaningful population subsets (age, sex, race, geography) to detect differential performance that aggregate accuracy masks.

ContainerizationPackaging software and its dependencies (e.g., in Docker) so that a computational analysis can be re-run in an identical environment on any system.

Lesson 4 Quiz

Scientific Validity in the Age of AI · 5 questions

1. The 2023 Nature room-temperature superconductor retraction is invoked in the lesson primarily to illustrate:

Correct. The case is a structural parallel: AI or otherwise, computational steps between raw data and published conclusions must be transparent and auditable. The lesson uses it to motivate the need for pipeline documentation, not to imply AI was involved in that specific case.

AI was not involved in the superconductor case. The lesson uses it as a structural parallel: computational pipelines of any kind require auditing. Review the opening scene and its explicit framing as a "cautionary parallel."

2. The Machine Learning Reproducibility Challenge found that roughly 30% of attempted reproductions of published ML papers failed or showed substantially different results. The primary cause of ML irreproducibility is:

Correct. ML irreproducibility is primarily a documentation and engineering problem, not a fraud problem. Small differences in random seeds, library versions, or hardware floating-point behavior can propagate through training to produce models with meaningfully different performance. The fix is precise documentation and containerization.

The primary driver is not fraud but underdocumented technical dependencies. Random seeds, library versions, preprocessing order, and hardware all affect the training outcome. Review the lesson section on reproducibility and the practical fixes it describes.

3. A model card is best described as:

Correct. A model card is a transparency artifact — structured documentation that allows users, reviewers, and future researchers to understand what a model was trained on, how it was evaluated, and where it is likely to fail. It is a scientific document, not a regulatory one.

A model card is a documentation artifact, not a patient-facing document or regulatory filing. Review the key terms in the lesson. Its purpose is to make the model's provenance and limitations transparent to scientific users and peer reviewers.

4. Subgroup analysis is considered essential in AI-assisted science because:

Correct. A model with 90% aggregate accuracy might perform at 95% for one demographic and 70% for another. Aggregate numbers flatten this variation. Subgroup analysis is how differential performance — and its potential harms — becomes visible. This connects directly to the Obermeyer algorithm case from Lesson 1.

Subgroup analysis doesn't fix leakage — it's about revealing differential performance across population groups. Aggregate accuracy can hide disparities that are clinically significant and ethically important. Review the lesson section and its connection to Lesson 1's Obermeyer example.

5. What institutional change does the lesson identify as the most critical remaining gap in AI-assisted science?

Correct. The lesson's final section identifies the challenge as institutional: the underlying standards — rigor, transparency, honest limitation — are not new. The gap is building structures that reward their application: incentives for negative results, peer reviewer capacity, and adaptive reporting requirements. These are social and organizational problems as much as technical ones.

The lesson explicitly argues against banning ML and does not propose RCT requirements or centralization. Re-read the final section, which identifies the challenge as creating institutional incentives and capacity that apply existing scientific values to the new class of tool.

Lab 4 · Peer Review Simulation

Apply scientific validity standards to an AI-assisted research manuscript.

Lab Objective

You are a peer reviewer for a journal that has received a manuscript describing an AI model for predicting antibiotic resistance from bacterial genome sequences. The model reports 94% accuracy on its test set. Your AI lab assistant will play the role of the corresponding author. Your task: ask the hard questions a rigorous reviewer must ask before this paper can be accepted.

Focus your questions on: train/test separation, subgroup performance, model documentation (model card), reproducibility, and whether the paper distinguishes prediction from explanation. Aim for at least 3 exchanges.

Suggested opener: "I'm reviewing your manuscript on antibiotic resistance prediction. Your 94% accuracy figure — how was your test set constructed, and are you certain there is no patient- or strain-level overlap with your training data?"

AI Lab Assistant

Lesson 4 · Peer Review Simulation

Thank you for agreeing to review our manuscript. We're happy to address your concerns. Our model — ResistNet — was trained on 18,000 bacterial whole-genome sequences from the PATRIC database, labeled with minimum inhibitory concentration data. We achieved 94% accuracy on our held-out test set of 2,000 sequences. Please ask me anything about our methodology, and I'll answer as fully as I can — including the things we perhaps should have been more explicit about in the manuscript.

Module 1 Test

AI as a Scientific Tool · 15 questions · Pass at 80%

1. The core transaction of AI in science, as described in this module, is best summarized as:

Correct.

Review Lesson 1's core framing.

2. AlphaFold 2 released structures for approximately 200 million proteins in July 2022. The training resource that made this possible was:

Correct.

AlphaFold 2 learned from the Protein Data Bank's experimentally solved structures. Review Lesson 1.

3. A model correctly predicts that a compound will inhibit a kinase, but cannot explain which molecular interactions drive that inhibition. This illustrates:

Correct.

Review Lesson 1's discussion of prediction vs. explanation.

4. UMAP dimensionality reduction used in single-cell RNA sequencing is an example of:

Correct.

UMAP finds structure in unlabeled data — it is unsupervised. Review Lesson 1.

5. The four structural criteria that make a scientific problem a good ML fit are:

Correct.

Review Lesson 2's framework for ML domain fit.

6. Google DeepMind's GNoME system is best described as a tool that:

Correct.

GNoME predicts stability to guide experimental prioritization, not to replace synthesis or derive new physics. Review Lesson 2.

7. The term "emulation" in ML-assisted science refers to:

Correct.

Emulation is the surrogate model pattern — fast ML standing in for a slow simulator. Review Lesson 2's treatment of FourCastNet.

8. Temporal leakage in a medical AI model occurs when:

Correct.

Temporal leakage is a specific data leakage type where future information contaminates training. Review Lesson 3's leakage section.

9. The "Clever Hans" metaphor in ML refers to:

Correct.

Clever Hans exploited trainer cues rather than solving arithmetic. The ML parallel is shortcut learning. Review Lesson 3.

10. LLM hallucination is particularly dangerous in scientific contexts because:

Correct.

The core danger is that hallucinations sound as confident and fluent as accurate outputs. Review Lesson 3's hallucination section.

11. The minimum requirement for ML reproducibility that goes beyond sharing code is:

Correct.

Code alone is insufficient for ML reproducibility. Review Lesson 4's reproducibility section and the role of containerization and seed documentation.

12. Which of the following is the correct unit of separation for train/test splits in a clinical imaging AI study?

Correct.

If the same patient appears in both sets (even in different scans), the model can exploit patient-specific features. Patients are the correct unit. Review Lesson 4 and the chest X-ray case from Lesson 3.

13. The Insilico Medicine Chemistry42 case (IPF drug candidate, 2022) is evidence that AI can:

Correct.

The lesson explicitly notes that the contribution was speed, not mechanistic insight. Review Lesson 2's opening scene.

14. The DeepMind retinal disease detection study trained on 1.4 million labeled photographs achieved expert-level performance. The most accurate statement about what the model learned is:

Correct.

The model learned correlations between pixel patterns and labels — not biological mechanisms. Review Lesson 1's treatment of this study.

15. Which combination of practices best characterizes valid AI-assisted science in 2024?

Correct. This combination addresses the full set of validity concerns covered in the module: leakage prevention, differential performance visibility, reproducibility, and epistemic honesty about what the model does and does not demonstrate.

Review Lesson 4's final section on what valid AI-assisted science looks like. High accuracy and benchmark performance alone are insufficient without rigorous separation, subgroup analysis, reproducibility infrastructure, and epistemic clarity.