Module 7 · Lesson 1

The Reproducibility Crisis and AI's Role

Science's self-correcting mechanism is failing — and AI is both symptom and potential cure.

How has AI accelerated a crisis in scientific reliability that was already decades in the making?

In 2011, Bayer HealthCare scientist Glenn Begley attempted to reproduce 53 landmark cancer biology studies that had formed the basis of major drug development programs. His team could replicate the findings in only 6 of 53 — an 11% success rate. The results, published in Nature in 2012, sent shockwaves through the biomedical community. Billions of dollars in drug development had been built on results that could not be confirmed.

The problem predated AI, but AI tools would soon make it far easier to generate statistically significant-looking results at scale — and far harder to distinguish genuine discovery from sophisticated artifact.

What Is the Reproducibility Crisis?

Reproducibility — the ability of an independent researcher using the same methods to obtain the same results — is considered a foundational requirement of scientific knowledge. When a result cannot be reproduced, it may indicate fraud, error, inadequate methodology, or genuine variability in complex systems.

A 2016 Nature survey of 1,576 researchers found that more than 70% had failed to reproduce another scientist's experiments, and more than 50% had failed to reproduce their own. The problem is not confined to biology: psychology, economics, neuroscience, and materials science have all documented systematic failures. The 2015 "Reproducibility Project: Psychology" led by Brian Nosek at the University of Virginia successfully replicated only 36 of 100 published psychological studies.

Root causes include publication bias (journals favor positive results), small sample sizes, undisclosed researcher degrees of freedom, inadequate statistical power, and insufficient sharing of raw data and code. AI tools introduced after 2015 have exacerbated several of these factors while offering tools to address others.

The Begley-Ellis Rule

Begley and Lee Ellis (2012) proposed that landmark studies should require independent replication before proceeding to drug development. The principle has since become a template for pre-registration and registered reports — study designs locked before data collection to prevent post-hoc hypothesis fitting.

How AI Tools Interact with Reproducibility

AI tools interact with the crisis in three distinct ways. First, machine learning models are themselves often irreproducible: results vary with random seeds, hardware differences (GPU floating-point behavior), software library versions, and undisclosed hyperparameter tuning. A 2019 analysis by Joelle Pineau's team at McGill found that most published deep learning papers in NeurIPS and ICML lacked enough detail to reproduce their claimed results.

Second, AI tools used in scientific analysis can introduce hidden biases: image-processing algorithms applied inconsistently, natural language processing tools with version-specific behaviors, or foundation models used for data extraction that hallucinate values. When these tools are not documented, a subsequent researcher cannot determine whether their different result reflects a scientific difference or a software difference.

Third — and more hopefully — AI tools can improve reproducibility through automated code generation, containerized computational environments (Docker, Singularity), and tools like Weights & Biases or MLflow that log every parameter of an experiment automatically.

Reproducibility Same researchers, same data, same code → same result. Distinct from replicability (independent researchers, new data).

p-hacking Conducting multiple statistical tests and selectively reporting those with p < 0.05, inflating the false-positive rate. AI tools that automate analysis make this vastly easier to do accidentally or deliberately.

HARKing Hypothesizing After Results are Known — presenting exploratory findings as if they were pre-specified confirmatory tests. GPT-based writing tools can help researchers reframe narratives convincingly.

The 2023 Science AI Image Integrity Crisis

In early 2023, the journal Science and multiple publishers began deploying AI-based image analysis tools — including ImageTwin and Proofig — to screen submitted manuscripts for duplicated or manipulated figures. Within months, thousands of papers were flagged, and several high-profile retractions followed at journals including Anesthesiology, Molecular Biology of the Cell, and Stem Cell Reports.

The irony was sharp: AI-generated figures — produced by diffusion models that could render plausible-looking Western blots, microscopy images, and flow cytometry plots — were simultaneously harder to detect using traditional human review and easier to detect using AI pattern matching. The episode established a clear principle: the same class of tool that enables scientific fraud at scale is also the most capable instrument for detecting it.

Key Insight

The reproducibility crisis is not primarily a story about dishonest scientists. Most failures arise from incentive structures that reward novelty over verification, methodological norms that were adequate for simpler analyses, and software complexity that makes exact reproduction nearly impossible without deliberate documentation. AI tools amplify all three factors — but also provide new levers to address each.

Lesson 1 Quiz

The Reproducibility Crisis and AI's Role · 4 questions

In Glenn Begley's 2012 Nature study attempting to reproduce 53 landmark cancer biology papers, approximately what fraction of results were successfully replicated?

Correct. Only 6 of 53 studies — roughly 11% — could be reproduced, revealing that billions in drug development rested on unverifiable findings.

Not quite. The success rate was far lower: only 6 of 53 (about 11%). The finding was alarming precisely because it was so low.

What does "HARKing" stand for, and why do AI writing tools make it a greater concern?

Correct. HARKing involves presenting post-hoc hypotheses as if pre-planned. AI writing assistance makes it easier to produce fluent, convincing narrative framing that obscures this practice.

HARKing = Hypothesizing After Results are Known. AI writing tools increase risk by making it easy to rewrite methods and introductions to match whatever result was found.

The 2015 Reproducibility Project in Psychology, led by Brian Nosek, found that approximately what percentage of 100 published psychological studies could be successfully replicated?

Correct. Only 36 of 100 studies replicated successfully, demonstrating that the reproducibility problem extends well beyond biomedical science.

The actual replication rate was about 36% — far lower than most psychologists had assumed, which made the study itself a landmark finding about scientific practice.

Which of the following best describes how AI image-analysis tools like ImageTwin relate to the problem of AI-generated fraudulent figures in scientific papers?

Correct. The 2023 wave of retractions demonstrated this dynamic clearly: AI pattern-matching could detect statistical regularities in generated images that human reviewers missed.

The key insight from 2023 is that AI detection tools flagged thousands of papers — showing that AI-generated fraud is detectable by AI, creating an ongoing adversarial dynamic rather than an unsolvable problem.

Lab 1: Diagnosing Reproducibility Failures

Discuss real cases of irreproducibility and how AI tools contributed or could prevent them

Lab Scenario

You are a research integrity consultant reviewing a disputed machine learning study in biomedicine. The original paper claimed 94% accuracy on a cancer detection task; three independent groups failed to exceed 71%. The original authors used an unpublished preprocessing pipeline, a specific GPU model, and a private dataset split. Your AI assistant specializes in reproducibility analysis.

Starter prompts: "What are the most likely causes of the accuracy gap between the original paper and replications?" / "How should researchers document ML preprocessing pipelines to ensure reproducibility?" / "What role did the private dataset split likely play in the discrepancy?"

Reproducibility Analysis Assistant

Lab 1

Welcome to the reproducibility lab. I'm here to help you analyze failures of scientific reproducibility, particularly where AI and machine learning methods are involved. Describe a case, ask about root causes, or explore what documentation practices could have prevented the failure. What would you like to examine?

Module 7 · Lesson 2

Data Fabrication, Hallucination, and the Fraud Spectrum

From deliberate misconduct to honest error — AI blurs boundaries that scientific ethics once treated as clear.

When an AI tool inserts a fabricated citation or generates a plausible-looking dataset, where does error end and misconduct begin?

In February 2023, a preprint circulated on bioRxiv describing a novel protein-folding mechanism with apparently extensive literature support. When researchers at the University of Cambridge attempted to locate 23 of the cited papers, none existed. The citations were grammatically perfect, plausibly titled, attributed to real journals with realistic volume and page numbers — and entirely fabricated by a large language model the authors had used to draft the literature review section without adequate verification.

The authors did not intend to deceive. They had trusted the AI's output without systematic cross-referencing. The paper was withdrawn, but the episode crystallized a new category of scientific misconduct that existing frameworks were not designed to address: unintentional fabrication enabled by automated tools.

The Traditional Fraud Taxonomy

Scientific misconduct has historically been classified into three categories by the US Office of Research Integrity (ORI): fabrication (making up data or results), falsification (manipulating data, equipment, or processes), and plagiarism (appropriating others' ideas without credit). These three — collectively "FFP" — were designed to capture intentional, knowing violations.

The ORI definition requires that misconduct be a "significant departure from accepted practices" and that it be "committed intentionally, knowingly, or recklessly." AI-generated hallucinations challenge each element. A researcher who pastes AI-generated text without checking citations may have been reckless — but were they knowing? A researcher who uses an AI to generate synthetic training data that subtly diverges from the claimed experimental distribution — is that fabrication if they did not understand the tool's behavior?

The Diederik Stapel Benchmark

Dutch social psychologist Diederik Stapel fabricated data in at least 55 published papers between 2004 and 2011, making up entire datasets that confirmed elegant hypotheses. His 2011 exposure became a benchmark case for deliberate fraud. The contrast with AI-enabled error is instructive: Stapel knew exactly what he was doing. The question now confronting research ethics boards is how to handle cases where fabrication is real but intent is ambiguous.

LLM Hallucination in Scientific Contexts

Large language models produce hallucinated content — fluent, confident-sounding text that is factually incorrect — at rates that depend on the task, the model, and the verification infrastructure around it. In scientific writing, hallucination concentrates in several high-risk areas: citations and bibliography generation, numerical values extracted from source texts, gene names and protein identifiers, drug dosages and clinical trial outcomes, and statistical results from papers not in the model's training data.

A 2023 study by Alkaissi and McFarlane in Cureus systematically tested ChatGPT by asking it to generate bibliographies on medical topics. In the test, 69% of generated references were fabricated and 46% of real references contained inaccurate details. The authors concluded that LLM-generated bibliographies should be treated as requiring full manual verification against primary sources.

The practical implication for scientific integrity is that researchers who use LLM tools without disclosure and verification protocols are introducing a new category of error into the literature that existing editorial processes were not designed to catch. Many journals now require explicit disclosure of AI tool use in submitted manuscripts.

Hallucination LLM output that is grammatically fluent and contextually plausible but factually false. In scientific contexts, hallucinated citations, data values, and experimental results are the highest-risk categories.

Fabrication (ORI) Making up data or results and recording or reporting them. The ORI definition requires intentional or reckless behavior — a requirement increasingly difficult to apply to AI-assisted research.

Synthetic Data Laundering The practice of training AI models on AI-generated synthetic data and then reporting results as if derived from experimental observation, obscuring the lack of ground-truth validation.

The Image Manipulation Escalation: Acuna et al.

A 2018 analysis by Daniel Acuna and colleagues at Syracuse University applied computer vision algorithms to 760,000 biomedical figures from published papers. The system identified potential image duplications — cases where the same micrograph, blot, or flow cytometry plot appeared in multiple figures, sometimes in multiple papers — at a scale impossible for human review. The study estimated that approximately 3.8% of papers in their sample contained potentially problematic image duplications.

By 2023, AI image generation introduced an inverse problem: instead of duplicating real images, researchers could generate novel but fabricated images. Diffusion models trained on biological image datasets could produce Western blots, histology slides, and fluorescence microscopy images that were visually indistinguishable from real experimental results to human reviewers, but detectable by AI systems looking for statistical artifacts in pixel distributions.

Regulatory Response

In 2023 the National Institutes of Health (NIH) issued NOT-OD-23-149, explicitly stating that AI-generated text in grant applications must be disclosed and that existing policies on fabrication apply regardless of whether a human or AI tool generated the content. The European Research Council followed with similar guidance in early 2024. Both frameworks hold the researcher — not the tool — accountable for the accuracy of submitted content.

Lesson 2 Quiz

Data Fabrication, Hallucination, and the Fraud Spectrum · 4 questions

The US Office of Research Integrity (ORI) defines the three categories of scientific misconduct as fabrication, falsification, and plagiarism (FFP). What key element of the ORI definition makes it difficult to apply to AI-generated hallucinations?

Correct. The intent requirement creates a genuine gray zone: a researcher who unknowingly included AI-hallucinated citations may have been reckless without being deliberately deceptive.

The key difficulty is the intent requirement. The ORI definition requires intentional, knowing, or reckless behavior — which becomes complicated when AI tools introduce errors researchers did not know about and could not easily detect.

In the Alkaissi and McFarlane (2023) study published in Cureus, approximately what percentage of ChatGPT-generated medical bibliography references were found to be fabricated?

Correct. 69% of generated references were entirely fabricated, and 46% of references that were real contained inaccurate details — establishing that LLM bibliography generation requires systematic verification.

The study found 69% fabrication — a striking result that led the authors to recommend treating all LLM-generated citations as requiring full manual verification against primary sources.

What was the significance of the Diederik Stapel case (2011) in the context of AI-enabled scientific misconduct?

Correct. Stapel knew exactly what he was doing. His case anchors the "deliberate fraud" end of a spectrum, making visible how different AI-enabled unintentional fabrication is from traditional misconduct frameworks.

Stapel did not use AI tools — he manually fabricated data across 55 papers. His case is important as a benchmark of clear intent, which contrasts sharply with the ambiguous intent cases that AI-generated errors create.

The NIH guidance NOT-OD-23-149 (2023) regarding AI-generated content in grant applications established which key principle?

Correct. The NIH made clear that tool use does not transfer accountability: the researcher is responsible for the accuracy of all submitted content, disclosed or not.

The NIH guidance held researchers accountable — it required disclosure of AI use and affirmed that fabrication policies apply regardless of whether a human or AI tool generated the false content.

Lab 2: Navigating AI-Assisted Misconduct

Explore the ethics and detection of AI-generated hallucinations in scientific manuscripts

Lab Scenario

You are a journal editor reviewing a submitted manuscript that has been flagged by your editorial AI tool: 8 of 42 citations in the bibliography cannot be located in any database, and two appear to be plausible-sounding fictions. The corresponding author claims the lab used an AI writing assistant and "did not realize it could invent citations." You must decide how to respond and what policies to recommend.

Starter prompts: "Should hallucinated citations be treated as fabrication under ORI guidelines?" / "What verification protocol should journals require for AI-assisted manuscripts?" / "How do I distinguish reckless from merely negligent use of AI writing tools?"

Scientific Integrity Ethics Assistant

Lab 2

I'm your scientific integrity ethics consultant. I can help you think through cases where AI tools have introduced fabricated content into scientific manuscripts — exploring the ethics, the applicable policies, and what editorial responses are appropriate. What aspect of this case would you like to analyze first?

Module 7 · Lesson 3

Pre-Registration, Open Science, and AI Transparency

Structural solutions to the integrity crisis — and how AI tools are being integrated into them.

How are open science infrastructure and AI disclosure norms evolving to make AI-assisted research verifiable?

In 2013, the Center for Open Science launched the Open Science Framework (OSF) as a free, public infrastructure for pre-registering study designs before data collection began. By 2024, OSF hosted over 250,000 pre-registered studies across disciplines from psychology to genomics to economics. The Registered Reports publishing format — adopted by over 300 journals — takes this further: peer review happens before data collection, and journals commit to publish regardless of outcome. The format eliminates publication bias at the source.

The arrival of AI tools created new challenges for this infrastructure. When a researcher uses an AI to propose hypotheses, optimize study designs, or suggest analysis pipelines — all now possible with tools like AlphaFold, GPT-4, and custom domain-specific LLMs — what must be disclosed in a pre-registration? The frameworks were not written for an era in which a machine might be the effective methodologist.

The Architecture of Open Science

Open science encompasses a cluster of practices designed to make research processes transparent and verifiable: pre-registration (documenting hypotheses and analysis plans before data collection), open data (sharing raw datasets in accessible repositories like Zenodo, Dryad, or NCBI), open code (publishing analysis scripts and computational environments), and open access (making publications available without paywalls).

The empirical case for these practices is strong. A 2018 meta-analysis by Schroeder and colleagues found that pre-registered studies reported smaller effect sizes on average than non-pre-registered studies in the same journals — consistent with the hypothesis that non-pre-registered studies are more susceptible to p-hacking and effect-size inflation. Pre-registration does not prevent poor science, but it constrains the most common post-hoc manipulations.

Containerization tools — Docker images, Binder notebooks, Code Ocean capsules — allow researchers to package entire computational environments so that any reader with internet access can re-run an analysis and obtain the same result. As of 2023, eLife, PLOS Computational Biology, and the Journal of Statistical Software require or strongly encourage executable code submission alongside manuscripts.

The FAIR Principles (2016)

The FAIR guidelines for scientific data management — Findable, Accessible, Interoperable, Reusable — were published in Scientific Data in 2016 by Wilkinson and colleagues. They have become a global standard referenced by funders including the NIH, Wellcome Trust, and European Commission. AI tools that interact with scientific data must increasingly demonstrate FAIR compliance for the outputs they generate.

AI Transparency Requirements: An Evolving Landscape

As of 2024, major scientific publishers have adopted AI disclosure policies that differ substantially in scope. Nature requires that large-scale language models not be listed as authors and that their use be disclosed in Methods sections. Springer Nature's policy requires disclosure of any AI-assisted text generation but stops short of mandating specific technical detail about model versions or prompts. Science's policy is stricter: AI-generated text is prohibited in papers unless specifically approved, and AI tools may not be used to create or manipulate research images.

The machine learning community has developed more technically specific transparency standards. The Model Cards framework, introduced by Mitchell et al. (2019) at Google, requires documentation of a model's intended use, training data, evaluation results, and known limitations. Datasheets for Datasets (Gebru et al., 2018) applies similar principles to training data. The NeurIPS 2020 checklist — now standard at major ML venues — requires authors to explicitly confirm or deny that their code is available, that error bars are reported, and that training compute is disclosed.

Pre-registration Documenting hypotheses, sample sizes, and analysis plans in a time-stamped, publicly accessible repository before data collection begins. Prevents post-hoc hypothesis fitting and selective reporting.

Model Card A structured document accompanying a published ML model that describes its intended uses, training data provenance, evaluation methodology, and known failure modes or biases.

Registered Report A publication format in which peer review and editorial commitment to publish occur before data collection, eliminating publication bias at the structural level.

The ML Reproducibility Challenge

Starting in 2019, the machine learning community launched the ML Reproducibility Challenge — an annual event in which researchers attempt to reproduce results from papers accepted at top conferences including NeurIPS, ICML, and ICLR. The challenge produces structured reproduction reports that document successes, partial successes, and failures.

The 2021 challenge reproduced 26 papers across teams at universities in 12 countries. Results were mixed: approximately 55% of core claims could be reproduced when code was available; the rate dropped to around 30% when authors' code was unavailable and replication required reimplementation. The most common obstacles were undisclosed random seeds, undocumented hyperparameter tuning, version-specific library behavior, and dataset preprocessing steps described in prose but not code.

The challenge demonstrated that prose descriptions of ML methods — even in top-tier venues — are systematically inadequate for reproducibility. Code must be executable, environments must be pinned, and random states must be fixed and documented. These requirements are now explicit in the submission checklists of major ML conferences.

The Positive Case

Open science and AI transparency infrastructure represent a genuine improvement in the structural conditions for scientific integrity. Pre-registered studies, executable code requirements, FAIR data standards, and model cards together create audit trails that simply did not exist a decade ago. The challenge is adoption: these tools require time, training, and institutional incentives that many research environments still lack.

Lesson 3 Quiz

Pre-Registration, Open Science, and AI Transparency · 4 questions

What is the key feature of the "Registered Reports" publication format that eliminates publication bias at the structural level?

Correct. By committing to publish before seeing results, journals remove the incentive structure that drives researchers to p-hack or selectively report findings.

The defining feature is that acceptance is decided before data collection — the journal commits to publish regardless of outcome, removing the publication bias that drives selective reporting and p-hacking.

The FAIR principles (2016) for scientific data management stand for which four properties?

Correct. FAIR — Findable, Accessible, Interoperable, Reusable — has become the global standard for scientific data management referenced by NIH, the European Commission, and major funders worldwide.

FAIR stands for Findable, Accessible, Interoperable, Reusable — a framework published in Scientific Data in 2016 that has become a global standard for scientific data management.

In the 2021 ML Reproducibility Challenge, what was the approximate reproduction rate for core claims from papers where the authors' original code was available versus unavailable?

Correct. The substantial gap between code-available and code-unavailable conditions demonstrated that prose methods descriptions in even top ML venues are insufficient for reproducibility.

The challenge found ~55% reproduction with code available versus ~30% without — a meaningful gap that confirmed prose descriptions are insufficient and executable code is essential for reproducibility.

Which of the following best characterizes the difference between Science magazine's AI policy and Nature's AI policy as of 2024?

Correct. Science's policy is more restrictive on text — prohibiting AI-generated text unless specifically approved — while Nature requires disclosure but does not broadly prohibit AI-assisted writing.

The policies differ in strictness: Science prohibits AI-generated text unless specifically approved and bans AI image manipulation, while Nature requires disclosure of AI-assisted text generation but does not categorically prohibit it.

Lab 3: Designing an Open Science Protocol

Build a transparency and pre-registration plan for an AI-assisted research project

Lab Scenario

You are a graduate student designing an ML-assisted study that will use a fine-tuned language model to extract clinical outcomes from electronic health records and test whether AI-extracted data matches manual chart review. Your advisor has asked you to draft a complete open science protocol before you begin. Your AI assistant specializes in open science infrastructure and transparency requirements.

Starter prompts: "What should I include in my OSF pre-registration for an NLP-assisted data extraction study?" / "How do I write a model card for a fine-tuned LLM used in clinical data extraction?" / "Which data repositories are appropriate for sharing de-identified EHR extraction outputs?"

Open Science Protocol Assistant

Lab 3

I'm your open science and transparency assistant. I can help you design pre-registration protocols, model cards, data management plans, and disclosure frameworks for AI-assisted research. Whether you're working with the OSF, drafting FAIR-compliant data sharing plans, or figuring out what to disclose when you use an LLM in your pipeline — I'm here to help. Where would you like to start?

Module 7 · Lesson 4

Peer Review Under Pressure and the Future of Scientific Trust

Peer review was never perfect — but AI is stressing it in ways that require structural responses.

How should the institutions that certify scientific knowledge adapt when both the generation and evaluation of evidence can be automated?

Between 2019 and 2024, investigative work by researcher Elisabeth Bik, the watchdog blog Retraction Watch, and automated tools at publishing houses revealed a sprawling industry of paper mills — organizations in China, Iran, Russia, and elsewhere that manufactured fake or manipulated scientific papers for sale to researchers needing to meet publication quotas. By 2023, estimates suggested tens of thousands of paper mill articles had entered the scientific literature.

The introduction of LLMs in 2022–2023 transformed the economics of paper mills. Previously, fabricating a convincing paper required domain expertise. Now, a plausible-looking manuscript in any field could be generated in hours. The bottleneck shifted to peer review — and peer review was already struggling.

The Structural Weaknesses of Peer Review

Traditional peer review relies on volunteer expert labor that is unpaid, unacknowledged in most career evaluation systems, and increasingly scarce as the volume of submissions grows. A 2023 analysis by Kyle Siler and colleagues found that the number of journal submissions had grown 50% over a decade, while the pool of qualified reviewers had not. Reviewer fatigue and declining review quality are documented phenomena: average review length and quality scores from editors have declined at major journals.

AI tools are being tested as both adversaries and assistants in this context. On the adversary side, researchers have documented cases where reviewers — overwhelmed and underpaid — have used LLMs to generate peer review reports, some of which contained the characteristic hedging and disclaimer language of AI-generated text. Nature and other journals have issued guidance explicitly prohibiting the use of AI tools to generate peer review content.

On the assistant side, publishers including Springer Nature have deployed AI tools (including the research integrity tool "Sniff" and statistical checking software "StatReviewer") that pre-screen manuscripts for statistical anomalies, duplicate image regions, and citation inconsistencies before human review begins. These tools do not replace expert judgment but reduce the probability that obvious problems reach publication.

The Retraction Watch Database

Founded by Ivan Oransky and Adam Marcus in 2010, Retraction Watch maintains the largest public database of scientific retractions — over 40,000 entries as of 2024. Analysis of the database reveals that retraction rates have increased dramatically since 2010, and that AI-related retractions (for fabricated images, hallucinated citations, and undisclosed AI text generation) appeared as a distinct category in 2023 for the first time.

AI-Assisted Peer Review: Promise and Risk

The potential benefits of AI assistance in peer review are substantial. Statistical checking tools can detect errors in reported p-values, confidence intervals, and sample sizes that human reviewers routinely miss. A 2016 study by Nuijten and colleagues found that approximately 50% of psychology papers published in top journals contained at least one reported statistical value inconsistent with their own data — errors that automated checking could catch. The statcheck R package, which automatically checks statistical reporting against APA reporting standards, is now used by several major journals.

Methodological checking tools can verify that reported methods are consistent with claimed sample sizes, that the statistical tests applied are appropriate for the data type described, and that effect size calculations are correct. These are mechanical tasks that AI handles reliably and that human reviewers often skip under time pressure.

The risk is more subtle. When AI pre-screening becomes a quality filter that manuscripts must pass before human review, there is pressure on researchers to optimize for AI detection rather than for scientific quality — a form of Goodhart's Law applied to research integrity. If AI tools look for specific patterns associated with fraud, researchers (or paper mills) will learn to avoid those patterns while maintaining the underlying problems.

Paper Mill A commercial organization that manufactures fraudulent or manipulated scientific manuscripts for sale to researchers who need publications to meet career or institutional quotas. LLMs have dramatically reduced the cost and expertise required to operate them.

statcheck An automated tool that extracts statistical results from papers and checks for internal consistency — verifying that reported test statistics, degrees of freedom, and p-values are mutually consistent.

Goodhart's Law (in research context) When a measure becomes a target, it ceases to be a good measure. In peer review: if fraud-detection AI looks for specific signals, paper mills optimize to avoid those signals while the underlying misconduct persists.

Toward Resilient Scientific Infrastructure

The long-term response to AI's disruption of scientific integrity will require changes at multiple levels. At the individual level: researchers must develop verification habits for AI-generated content — treating LLM outputs as drafts requiring fact-checking, not authoritative sources. At the institutional level: universities must reform incentive structures that reward publication quantity over quality, removing the demand side of the paper mill market. At the journal level: executable code requirements, mandatory data deposition, and AI disclosure mandates must become uniform standards rather than progressive exceptions.

Several structural innovations show promise. Post-publication peer review platforms — including PubPeer, where researchers can annotate published papers with concerns — have driven hundreds of retractions since 2012 and operate as a distributed quality control mechanism that supplements traditional pre-publication review. Overlay journals, which provide peer review for preprints rather than manuscripts, separate the function of quality certification from the function of dissemination.

The most durable solution may be cultural: a shift in how scientific communities define rigor. The growing norm that a computational result is not scientific knowledge until its code and data are publicly executable — not just described in prose — represents a genuine evolution in what it means to make a scientific claim. AI tools, despite the integrity challenges they introduce, are also the primary drivers of this evolution: the complexity of AI-based analyses made it obvious, faster than any other development, that prose description of methods is fundamentally inadequate for reproducibility.

The Long View

Science has survived previous integrity crises — from the Piltdown Man hoax (1912–1953) to the Schön affair in physics (2002) to Stapel in psychology (2011). Each crisis accelerated institutional reforms that improved the self-correcting mechanism science depends on. The current AI-driven crisis is larger in scale but not categorically different: the response will be structural, slow, and imperfect — but the direction is toward more transparency, more automation of mechanical verification, and higher standards for what counts as a reproducible scientific claim.

Lesson 4 Quiz

Peer Review Under Pressure and the Future of Scientific Trust · 4 questions

What role did LLMs play in transforming the paper mill industry between 2022 and 2024, compared to earlier paper mill operations?

Correct. The key economic change was the removal of the expertise requirement: a plausible-looking manuscript in any scientific field could now be generated quickly without domain knowledge.

LLMs changed the economics by removing the expertise bottleneck. Previously, fabricating convincing papers required domain knowledge — LLMs made it possible to generate plausible-looking manuscripts in any field in hours, dramatically lowering the cost of paper mill operations.

The Nuijten et al. (2016) study using the statcheck tool found which troubling pattern in psychology journals?

Correct. ~50% — a finding that justified the deployment of automated statistical checking in peer review, as these mechanical errors were clearly slipping through human review at scale.

The study found ~50% of psychology papers in top journals contained at least one internally inconsistent statistical value — not necessarily fraud, but mechanical errors that should have been caught by reviewers or authors.

How does Goodhart's Law apply as a risk in AI-assisted peer review screening?

Correct. As AI detection tools become known, the actors they're designed to catch will adapt their behavior to avoid the targeted signals — an adversarial dynamic that requires continuous updating of detection methods.

Goodhart's Law applies directly: when AI fraud-detection targets specific signals, those producing fraudulent work will optimize to avoid those signals. The detection measure becomes a target and ceases to be a reliable quality indicator.

What does the growing norm that "a computational result is not scientific knowledge until its code and data are publicly executable" represent, according to Lesson 4?

Correct. This norm shift is presented as a durable positive development — AI's complexity made the inadequacy of prose-only methods description undeniable, accelerating standards that improve reproducibility for all research.

The lesson frames this as a genuine and durable evolution in scientific epistemology — AI's complexity made it clear faster than any other development that prose description is inadequate for reproducibility, driving standards that benefit all research.

Lab 4: Stress-Testing Peer Review

Analyze the failure modes of peer review in an AI-accelerated publication environment

Lab Scenario

You are on the editorial board of a mid-tier biomedical journal. Your editor-in-chief has asked you to draft a proposal for an AI-assisted review workflow that addresses the paper mill threat without creating the Goodhart's Law trap — and that is fair to legitimate researchers using AI tools appropriately. Your AI assistant specializes in research integrity infrastructure and editorial policy.

Starter prompts: "What automated pre-screening tools should a journal deploy before papers reach human reviewers?" / "How do we design an AI disclosure policy that doesn't penalize legitimate use?" / "What is the strongest argument for post-publication peer review platforms like PubPeer as a complement to pre-publication review?"

Editorial Policy & Integrity Assistant

Lab 4

I'm your editorial policy and research integrity assistant. I can help you think through peer review reform, AI pre-screening tools, paper mill detection strategies, disclosure policy design, and the balance between catching misconduct and avoiding false accusations. What aspect of your journal's AI integrity challenge would you like to work on?

Module 7 Test

Reproducibility and Scientific Integrity · 15 questions · Pass at 80%

1. In Glenn Begley's 2012 attempt to reproduce 53 landmark cancer biology studies, what was the approximate replication success rate?

Correct. Only 6 of 53 studies replicated — 11%.

The success rate was about 11% — only 6 of 53 studies replicated successfully.

2. P-hacking refers to which problematic practice?

Correct. P-hacking exploits the threshold of p < 0.05 by running many tests until one crosses it.

P-hacking involves running multiple tests and reporting only those that reach the p < 0.05 threshold, inflating false-positive rates.

3. Which of the following best describes the difference between "reproducibility" and "replicability" in scientific methodology?

Correct. The distinction matters: reproducibility is a computational property; replicability is a scientific property about generalizability.

Reproducibility (same data, same code) and replicability (new data, independent team) are distinct. AI failures often affect reproducibility while the underlying science may still be replicable.

4. The Alkaissi and McFarlane (2023) study on ChatGPT-generated medical bibliographies found that approximately what percentage of references were entirely fabricated?

Correct. 69% fabrication — a rate that clearly establishes LLM-generated bibliographies as requiring full manual verification.

The study found ~69% of ChatGPT-generated references were entirely fabricated, establishing the need for systematic verification of all LLM citation outputs.

5. The ORI (Office of Research Integrity) defines scientific misconduct as requiring which element that creates ambiguity when applied to AI-generated hallucinations?

Correct. The intent requirement is the crux: AI-generated hallucinations can introduce fabricated content without the researcher's knowing awareness.

The ORI requires intentional, knowing, or reckless behavior — which becomes ambiguous when AI tools silently introduce fabricated content that researchers did not know was false.

6. What did the 2016 Nature survey of 1,576 researchers find regarding reproducibility in practice?

Correct. The survey established that reproducibility failures are widespread, cross-disciplinary, and common even within researchers' own prior work.

Over 70% had failed to reproduce another's results and over 50% had failed to reproduce their own — establishing the crisis as widespread and cross-disciplinary.

7. The FAIR data principles require that scientific data be Findable, Accessible, Interoperable, and Reusable. Which of the following would MOST directly violate the "Interoperable" requirement?

Correct. Interoperability requires that data use open standards and formats so that systems and communities can exchange and interpret data without specialized proprietary tools.

Interoperability means data uses open standards so it can be used across systems. A proprietary format requiring licensed software directly undermines interoperability.

8. In the ML Reproducibility Challenge (2021), what was the most common obstacle to reproducing ML paper results when authors' code was unavailable?

Correct. These were all structural documentation failures — not deliberate concealment — demonstrating that prose-only methods descriptions are systematically inadequate.

The obstacles were technical documentation failures: random seeds, hyperparameter choices, library versions, and preprocessing details described in words but not implemented in shareable code.

9. The NIH guidance NOT-OD-23-149 (2023) on AI-generated content in grant applications established which key accountability principle?

Correct. The NIH held researchers fully accountable — the tool does not transfer responsibility, and disclosure is required.

The NIH principle is clear: the researcher is accountable for all submitted content regardless of source, and AI use must be disclosed. The tool provides no liability shield.

10. The Daniel Acuna et al. (2018) computer vision study of 760,000 biomedical figures estimated what approximate prevalence of potentially problematic image duplications?

Correct. ~3.8% — at the scale of biomedical publishing (millions of papers), this implies tens of thousands of potentially problematic publications.

The study found ~3.8% — low as a percentage but enormous in absolute numbers given the scale of biomedical publishing, justifying systematic AI-assisted screening.

11. Which of the following best characterizes the relationship between Science magazine's and Nature's AI policies for submitted manuscripts (as of 2024)?

Correct. Science is more restrictive on text generation; Nature requires disclosure but does not broadly prohibit AI-assisted writing.

Science is stricter — prohibiting AI text unless specifically approved and banning AI image manipulation — while Nature requires disclosure but permits AI-assisted text with appropriate acknowledgment.

12. The Registered Reports publication format addresses publication bias by:

Correct. Pre-acceptance before data collection means the publication decision is structurally independent of result direction.

The key mechanism: review and acceptance happen before data collection, so the publication decision cannot be influenced by whether results are positive, negative, or null.

13. How does Goodhart's Law apply specifically to AI-based fraud detection in peer review?

Correct. The adversarial dynamic: as detection signals become known, paper mills and dishonest researchers will optimize to avoid them while the underlying fraud persists.

Goodhart's Law: when a measure becomes a target, it stops being a good measure. In fraud detection, as AI signals become known, actors adapt to avoid them while maintaining underlying misconduct.

14. The statcheck tool developed by Nuijten and colleagues performs which specific function in peer review?

Correct. statcheck performs mechanical statistical consistency checking — finding errors that human reviewers routinely miss.

statcheck extracts statistical values from manuscripts and checks internal consistency — verifying that the test statistic, degrees of freedom, and p-value all match. It found such errors in ~50% of psychology papers.

15. According to Lesson 4, what is the most durable long-term response to AI's disruption of scientific integrity?

Correct. The lesson argues that the most durable solution is a combined cultural and structural shift: executable reproducibility standards, reformed incentives, and open science infrastructure working together.

The lesson argues for a durable cultural shift — where executable code and data are required for a result to count as scientific knowledge — combined with institutional reform of incentives that currently reward quantity over verifiability.