In September 1895, Wilhelm Röntgen was experimenting with cathode ray tubes in Würzburg when he noticed a fluorescent screen glowing from across the room — shielded from any direct beam. Within weeks he had photographed the bones inside his wife's hand. The X-ray did not merely add a new technique to medicine; it made the previously invisible body legible to science for the first time, and every discipline it touched — surgery, crystallography, astronomy — was fundamentally altered by the new kind of seeing it enabled.
Something structurally similar is happening right now across the biological and physical sciences. In November 2020, DeepMind's AlphaFold 2 predicted the three-dimensional structures of proteins with accuracy matching decades of experimental crystallography. By July 2022 it had released structures for virtually every protein in the human proteome — roughly 200 million entries — freely online. Computational biology, drug discovery, and evolutionary research are each absorbing consequences that researchers are still struggling to articulate. The pattern is the same as 1895: a new instrument arrives, and the questions scientists can ask expand overnight.
This course examines AI as that instrument — honestly, technically, and critically. You will learn how machine learning models are trained on scientific data, where they genuinely accelerate discovery, and where they introduce new failure modes that demand careful scrutiny. We will not pretend AI is infallible, nor that it is merely hype. The goal is fluency: by the end of this module you should be able to evaluate a real AI-assisted scientific claim the way you would evaluate any other experimental result.
If you finish every module, here's who you become:
On the morning of January 15, 2020, a radiology team at Massachusetts General Hospital ran a chest CT scan through an AI model called AI-Rad Companion, developed by Siemens Healthineers. The scan showed faint bilateral ground-glass opacities in both lung fields — a pattern the model had learned to flag from training on thousands of prior scans. The patient had recently traveled internationally. The attending radiologist, already alerted by the system's confidence score, escalated the case. It was one of the earliest confirmed COVID-19 cases in the northeastern United States. The AI did not diagnose the disease — it had never seen SARS-CoV-2 before — but it recognized a known visual pattern in new context, and that pattern recognition bought time that mattered.
This is the core transaction of AI in science: not the discovery of new laws, but the recognition of known patterns in volumes of data that exceed human bandwidth. Understanding what that transaction can and cannot accomplish is the foundation everything else in this course is built on.
Machine learning, at its most general, is the process of finding a mathematical function that maps inputs to outputs by adjusting parameters until predictions match a labeled training set. In science this sounds deceptively simple, but the practical consequences are significant. A model trained on 1.4 million labeled retinal photographs — as Google's DeepMind collaboration with Moorfields Eye Hospital was in 2018 — can detect signs of diabetic retinopathy with sensitivity and specificity matching that of expert clinicians. It does not "understand" the retina in any biological sense; it has learned which pixel patterns correlate with which labels.
This distinction between correlation-based pattern matching and mechanistic understanding is not a weakness unique to AI — much of classical statistics operates the same way. But it becomes consequential when scientists use model outputs to make causal claims. A model that predicts which chemical compounds will inhibit a protein kinase (as Insilico Medicine's generative model did in 2022, producing a novel drug candidate for idiopathic pulmonary fibrosis that reached Phase II trials in under 18 months) is not explaining biochemistry. It is extrapolating from the geometry of molecules it was trained on. That extrapolation can be spectacularly useful. It can also be spectacularly wrong in exactly the cases where the training distribution does not cover the query.
Three broad families of ML are used in scientific contexts. Supervised learning requires labeled examples and learns a mapping from input to label — the radiological AI above is an instance. Unsupervised learning finds structure in unlabeled data: dimensionality reduction techniques like UMAP, used extensively in single-cell RNA sequencing since 2018, let biologists visualize high-dimensional gene expression data as interpretable clusters without pre-specifying what the clusters should be. Reinforcement learning optimizes behavior through reward signals: DeepMind's AlphaFold 2 used a variant of this paradigm during its structure refinement stage, iteratively improving predictions against a physical plausibility metric.
A model that predicts an outcome and a model that explains an outcome are doing fundamentally different things. Science needs both. Conflating them is one of the most common sources of overblown AI claims in peer-reviewed literature.
Every AI model inherits the biases and gaps of the data it was trained on. In 2019, a widely-used clinical risk-prediction algorithm deployed across US hospital systems was found by Obermeyer et al. (published in Science, October 2019) to systematically underestimate the illness severity of Black patients — not because race was a direct input, but because the model used healthcare cost as a proxy for health need, and historical spending on Black patients was lower due to systemic access inequities. The algorithm was doing exactly what it was trained to do. The training data encoded a social disparity, and the model faithfully reproduced it.
This is not an argument against using AI in science. It is an argument for treating training data as a first-class scientific artifact — subject to the same scrutiny as experimental samples. When the Allen Institute for Brain Science assembled its Allen Cell Types Database starting in 2015, each electrophysiological recording was accompanied by detailed metadata about the animal, brain region, and recording conditions, precisely so that future models trained on the data could account for systematic variation. Good scientific AI starts with good scientific data curation.
AI in science is most powerful as a pattern-recognition engine operating at scales no human team can match. Its reliability is bounded by the quality and representativeness of its training data. Neither the hype nor the dismissal captures this accurately.
In this lab you will act as a scientific reviewer evaluating an AI model's reliability. Your AI lab assistant has been briefed on a hypothetical genomics model. Your job is to ask probing questions about its training data, identify potential failure modes, and assess whether its outputs would be trustworthy in a specific deployment context.
Aim for at least 3 exchanges. Ask about training data sources, potential biases, distribution mismatch risks, and whether the model's predictions should be treated as explanatory or merely correlational.
In February 2022, Insilico Medicine announced that a drug candidate — INS018_055, targeting fibroblast activation protein for idiopathic pulmonary fibrosis — had entered Phase II clinical trials. The molecule had been generated by a generative AI system called Chemistry42 and had gone from initial design to clinical candidate in eighteen months. The conventional timeline for that same journey is typically five to six years. No magic was involved: the AI searched chemical space more efficiently than humans could by hand, guided by a target structure and a set of learned constraints about what makes drug-like molecules viable. Speed, not insight, was the gain — but in a disease where patients have a median survival of three to five years from diagnosis, speed is not a trivial thing.
Machine learning excels when a problem has a specific profile: large labeled datasets exist, the input space is high-dimensional (many variables, hard for humans to intuit), feedback is fast and quantifiable, and approximate solutions have real value even when perfect solutions are unachievable. Drug discovery hits all four criteria: the PubChem database contains over 100 million chemical structures with associated biological activity data; molecular descriptors can run into thousands of features; binding assays are automatable; and a compound that is 80% as potent as the theoretical optimum is still a drug worth developing.
Materials science has a similar profile. In 2023, Google DeepMind's GNoME (Graph Networks for Materials Exploration) system predicted the crystal structures of 2.2 million new stable inorganic materials — more than the total discovered by experimentalists in all of prior history. The model was trained on the Materials Project database, itself a massive DFT-computed collection of known crystal energies. GNoME's predictions do not replace experimental synthesis; they prioritize which of the enormous space of possible materials are worth the experimentalist's time to attempt.
Climate science represents a third domain. NVIDIA's FourCastNet, released in 2022, produces 10-day global weather forecasts in seconds that rival the accuracy of the European Centre for Medium-Range Weather Forecasts' numerical models, which require supercomputers running for hours. The AI model was trained on 40 years of ERA5 reanalysis data — a retrospective reconstruction of atmospheric states from satellite and surface observations. FourCastNet does not solve the Navier-Stokes equations; it learned to emulate their outputs. For operational forecasting, the distinction matters less than the speed gain.
Drug design, materials science, weather forecasting: in each case, a large corpus of prior computations or experiments was used to train a model that can now approximate those computations orders of magnitude faster. AI is not replacing the underlying science — it is replacing the expensive, slow step of re-running it for each new query.
The gains described above are real and large. They also do not advance mechanistic understanding in the same way a controlled experiment does. When FourCastNet predicts a hurricane's track, it cannot tell you which physical process it weighted most heavily. When Chemistry42 generates a drug candidate, it cannot explain in chemical terms why it chose that scaffold. This is sometimes called the interpretability problem, and it is one of the active frontiers of AI research in science.
The 2022 Nobel Prize in Chemistry, awarded to Carolyn Bertozzi, Morten Meldal, and K. Barry Sharpless for click chemistry, was not given to an AI system — it was given to researchers who understood, mechanistically, why certain reactions proceed with such efficiency. Click chemistry is now being integrated into AI-guided drug synthesis pipelines. The hierarchy matters: AI accelerates the search; human mechanistic insight generates the principles that make the search space bounded and navigable.
A research team wants to use ML to predict volcanic eruption timing from seismic and GPS deformation data. Your AI lab assistant can help you think through whether this is a good structural fit for ML, what data challenges exist, and how to frame the problem responsibly.
Use the four-part framework from the lesson (data scale, dimensionality, feedback speed, value of approximation) to probe the assistant. Aim for at least 3 exchanges.
In 2019, a deep learning model published in Nature Medicine claimed to diagnose fourteen different diseases from chest X-rays — including pneumonia, pneumothorax, and pleural effusion — at a level matching or exceeding that of radiologists. The paper made international headlines. When independent researchers examined the training and test data more carefully, they found that images from the same patient appeared in both the training and test sets. The model had, at least in part, learned to recognize specific patients rather than generalizable disease features. The leakage of training data into the test set inflated the apparent accuracy. The model itself was not fraudulent; the evaluation was flawed. The result could not be fully replicated under rigorously separated conditions. Science corrected itself — but not before the original numbers had circulated widely.
Overfitting occurs when a model learns the specific noise and idiosyncrasies of its training data rather than the generalizable signal. A model with too many parameters relative to training examples can achieve near-perfect accuracy on training data while performing no better than chance on new data. In science this is not merely an engineering nuisance — it produces false claims about predictive capability that contaminate the literature.
Data leakage is a specific and particularly insidious form of this problem. In 2021, a systematic review by Kapoor and Narayanan (Science, 2023, based on analyses conducted from 2021) examined 329 published ML papers across 17 scientific and medical domains. They found leakage — including test-set contamination, temporal leakage (training on future data relative to the test period), and feature leakage (including target-derived features as inputs) — in a majority of surveyed papers. The implied consequence is that a significant fraction of published ML accuracy numbers in science are optimistic, potentially substantially.
The structural safeguard is rigorous cross-validation with properly separated folds, or better, a completely held-out test set that is touched exactly once. In medical applications, patients — not individual scans — must be the unit of separation.
The broader scientific replication crisis (documented by Ioannidis in 2005 and amplified by subsequent work) is compounded when AI tools with opaque internals are involved. It is harder to audit an AI claim than a conventional statistical claim, because the model's "reasoning" is distributed across millions of parameters with no direct semantic interpretation.
In 2016, a widely-cited paper demonstrated that a deep learning model trained to detect melanoma in dermoscopy images had learned, in part, to associate ruler markings (which clinicians often place next to suspicious lesions for scale) with malignancy — because suspicious lesions were more likely to be photographed with rulers. The confound was not malicious; it was a reflection of clinical photography practice. The model was optimizing for anything correlated with the label, including artifacts of the data collection process that had nothing to do with the lesion itself.
This class of failure is called shortcut learning or Clever Hans behavior (after the 19th-century horse that appeared to do arithmetic but was actually reading subtle cues from its trainer). In a 2020 paper in Nature Machine Intelligence, Geirhos et al. documented that ImageNet-trained convolutional networks classify images based heavily on texture rather than shape — the opposite of the human visual system's strategy. When these networks are transferred to scientific imaging tasks (microscopy, satellite imagery, medical imaging), their texture bias can produce spectacular failures on test distributions that differ in texture statistics from training.
Large language models used in scientific contexts — to summarize papers, suggest hypotheses, or write code — produce hallucinations: fluent, confident, factually wrong outputs. In a 2023 evaluation by researchers at the University of Ottawa, GPT-3.5 and GPT-4 were asked to summarize 50 medical research papers. GPT-3.5 introduced fabricated citations or misattributed findings in roughly 47% of cases; GPT-4 reduced but did not eliminate the problem. The papers that were most confidently summarized were not reliably the ones most accurately summarized. Confidence and accuracy are not correlated in language model outputs.
For scientific use this is a significant operational constraint. An AI assistant that generates plausible-sounding but fabricated references is not just unhelpful — it can poison a literature review or suggest experimental directions based on studies that do not exist. The mitigation is not to avoid LLMs but to use them with retrieval augmentation (grounding outputs in specific cited documents) and to verify all factual claims against primary sources.
You will be presented with brief descriptions of AI-assisted research outcomes that produced unexpected or concerning results. Work with your AI lab assistant to diagnose the failure mode in each scenario, explain the mechanism, and suggest what evaluation changes would have caught the problem earlier.
Aim for at least 3 exchanges. Push the assistant to be specific about mechanisms, not just labels.
In March 2023, researchers at Lawrence Berkeley National Laboratory published a paper in Nature claiming that a room-temperature superconductor — a material that would carry electrical current without resistance at ambient conditions — had been synthesized. The finding was extraordinary; verified room-temperature superconductivity would represent one of the most consequential materials science discoveries in a century. Within weeks, multiple independent groups failed to replicate the core result. By November 2023, the paper was retracted. An investigation found data anomalies; one co-author alleged that raw data had been improperly processed before publication. The Nature superconductor episode is a cautionary parallel: when computational pipelines — AI or otherwise — sit between raw data and published conclusions, auditing that pipeline is as important as auditing the data itself.
Classical scientific reproducibility requires that another team, using the same materials and methods, can obtain the same result. AI-assisted science introduces new layers where reproducibility can fail: the random seeds used during training, the exact version of a library, the preprocessing order applied to data, and the hardware on which training ran can all produce different model weights — and therefore different predictions — even from nominally identical procedures.
In 2019, the Papers With Code initiative began tracking ML reproducibility, and its findings were sobering: across major ML benchmarks, a substantial fraction of claimed state-of-the-art results could not be reproduced by independent teams using the reported hyperparameters, even when code was shared. In 2021, the Machine Learning Reproducibility Challenge (MLRC) systematically attempted to reproduce papers from NeurIPS, ICLR, and ICML; roughly 30% of attempted reproductions failed or produced substantially different quantitative results from those reported.
For scientific fields adopting AI tools, this matters beyond the ML community. A genomics result that depends on an unreproducible ML model is an unreproducible genomics result. The fix is not philosophical — it requires model cards (structured documentation of training data, architecture, hyperparameters, and evaluation conditions), frozen random seeds, containerized environments (e.g., Docker or Singularity), and deposition of trained model weights alongside data in supplementary materials.
As of 2024, journals including Nature Methods, Cell Systems, and eLife have adopted reporting guidelines requiring authors using ML models to provide training code, model weights, and a description of the computational environment. These requirements are not yet universal, but they represent the direction of travel for rigorous AI-assisted science.
Traditional peer review can evaluate statistical methods because reviewers are trained in statistics and the methods are documented in textbooks. Deep learning models used in science are frequently not interpretable to reviewers — or even to the authors who trained them. A reviewer can check whether a t-test was applied correctly; checking whether a 40-million-parameter neural network is overfitting requires access to training curves, held-out test performance disaggregated by subgroup, and ideally the model weights themselves.
The REPAIRS checklist (Reporting Standards for AI in Peer-Reviewed Science), proposed by a cross-disciplinary group in 2022, enumerates ten reporting items including dataset provenance, train/test separation methodology, model selection procedure, and sensitivity analysis. At present, compliance with such checklists is voluntary at most journals. The consequence is a literature where AI-assisted results are published under the same review standards that were designed for conventional experiments — a mismatch that favors false positives.
There is, however, a constructive counterpart. AI is also beginning to assist peer review itself: tools like StatReviewer (used by several American Psychological Association journals as of 2023) automatically check submitted manuscripts for statistical reporting errors, missing confidence intervals, and sample size inconsistencies. The same pattern-recognition capability that creates failure modes in scientific AI can also be used to enforce standards.
A well-conducted AI-assisted scientific study in 2024 has several identifiable properties. The training and test sets are separated in a way that respects the natural units of the data (patients, not scans; organisms, not cells; geographic regions, not coordinates). The model's performance is reported disaggregated across meaningful subgroups — not just overall accuracy. The evaluation includes an honest discussion of the training distribution and where the model is likely to fail. Trained weights and preprocessing code are deposited in a public repository. And the paper distinguishes clearly between what the model predicts and what it explains.
None of this is exotic. It is the application of existing scientific standards — rigor, transparency, honest acknowledgment of limitations — to a new class of computational tool. The challenge is institutional: creating incentives for publishing negative results and failed replications, building reviewer capacity in ML methods across science disciplines, and updating reporting requirements at a pace that keeps up with the technology.
You are a peer reviewer for a journal that has received a manuscript describing an AI model for predicting antibiotic resistance from bacterial genome sequences. The model reports 94% accuracy on its test set. Your AI lab assistant will play the role of the corresponding author. Your task: ask the hard questions a rigorous reviewer must ask before this paper can be accepted.
Focus your questions on: train/test separation, subgroup performance, model documentation (model card), reproducibility, and whether the paper distinguishes prediction from explanation. Aim for at least 3 exchanges.