Module 5 · Lesson 1

How Research Agents Plan and Search

From single queries to coordinated multi-step investigations — the architecture of an agent that reads the internet.

What separates a research agent from a search engine, and why does the distinction matter for the quality of answers you can trust?

In late 2022, the AI research assistant Elicit began running structured literature reviews by decomposing a user question into sub-queries, searching Semantic Scholar's database of 200 million papers, extracting key claims from each abstract, and synthesising a structured summary — all autonomously. By 2023 the tool processed millions of queries monthly. Researchers at the Institute for Progress used Elicit to map the global biosafety literature in days rather than the weeks a manual review would require. The system's ability to iterate — refining search terms after inspecting early results — is what distinguished it from a simple keyword search.

What Makes a Research Agent Different

A conventional search engine returns a ranked list of documents for a single query. A research agent does something architecturally distinct: it plans before it searches, inspects intermediate results, and revises its strategy based on what it finds. This loop — plan → search → read → revise → synthesise — is the core capability that makes AI research agents qualitatively more powerful than retrieval alone.

The planning step typically involves decomposing a complex question into atomic sub-questions. If you ask "What is the evidence for omega-3 supplementation reducing cardiovascular events in diabetic patients?", a research agent breaks this into: What RCTs exist on omega-3 and cardiovascular outcomes? Which specifically enrol diabetic populations? What are the effect sizes and confidence intervals? This decomposition strategy was formalised by Google DeepMind researchers in the 2023 paper "Decomposed Prompting" and underpins agents like Elicit and Perplexity's "Deep Research" mode.

The reading step involves more than fetching a URL. Modern research agents parse retrieved documents — extracting structured fields like sample size, methodology, and conclusions — before deciding whether to follow citations further or pivot to a new search arm. This is what the WebGPT system demonstrated at OpenAI in 2021: an agent trained to use a web browser to gather evidence before answering, rewarded explicitly for citing sources.

Real System — WebGPT (OpenAI, 2021)

OpenAI's WebGPT was trained via human feedback to browse the web, paste relevant passages, and produce cited answers. Evaluators preferred its answers to those of the base GPT-3 model 56% of the time on open-ended questions — demonstrating that retrieval plus synthesis outperforms parametric memory alone on research tasks.

The Search–Read–Reason Loop

Researchers at Anthropic and DeepMind have studied how agents allocate "compute budget" across a research task. The most effective agents spend roughly equal fractions of their token budget on search query formulation, document reading, and synthesis — not front-loading everything into the first search. This mirrors how expert human researchers work: a literature review iterates between finding papers and updating the conceptual map of what's already known.

The loop terminates when one of three conditions is met: (1) the agent judges that coverage is sufficient — it has found consistent evidence across multiple independent sources; (2) a budget constraint is hit (time, tokens, or API calls); or (3) the agent detects irresolvable contradiction and escalates to a human. Condition (3) is the critical safety valve — without it, agents can confidently synthesise contradictory sources into a false consensus.

DecompositionBreaking a complex research question into atomic sub-queries that can each be answered by a single search or document read.

Retrieval-Augmented Generation (RAG)The architecture in which a language model is given retrieved documents as context before generating an answer, grounding the response in external evidence.

Search–Read–Revise LoopThe iterative cycle — search, inspect results, update the plan, search again — that distinguishes an agent from a single-shot query.

Retrieval Tools Available to Research Agents

Research agents in 2024–25 typically integrate several retrieval surfaces: general web search (Bing, Google APIs), academic databases (Semantic Scholar, PubMed, ArXiv), curated knowledge graphs (Wikidata), and private document stores via vector databases like Pinecone or Weaviate. Each surface has different freshness, authority, and coverage characteristics. A well-designed agent routes sub-questions to the appropriate surface — asking PubMed about clinical trial evidence rather than scraping Reddit, for instance.

The tool-use infrastructure is standardised in some ecosystems. OpenAI's function calling mechanism, released in June 2023, lets agents declare tools (search, calculator, code interpreter) and receive structured results. This allowed commercial products like ChatGPT with browsing and Perplexity AI to build reliable research loops on top of a shared tool-call interface rather than each engineering custom integration.

Key Insight

The intelligence of a research agent is not primarily in its retrieval mechanism — web search is a commodity. It is in the planning and synthesis layers: how the agent decides what to look for next, how it reads for relevant claims rather than full comprehension, and how it reconciles conflicting evidence into a calibrated answer.

Lesson 1 Quiz

How Research Agents Plan and Search — 5 questions

1. What distinguishes a research agent from a conventional search engine?

Correct. The plan→search→read→revise loop is the structural difference, not raw retrieval speed or model size.

Not quite. The key distinction is the iterative planning loop, not a technical detail about indexing or model scale.

2. Which OpenAI system demonstrated in 2021 that browsing plus synthesis outperforms parametric memory on research tasks?

Correct. WebGPT was trained via human feedback to browse and cite, with evaluators preferring its answers 56% of the time over base GPT-3.

Incorrect. WebGPT, published in 2021, was the system specifically trained to browse the web and cite sources for research tasks.

3. What is "decomposition" in the context of research agents?

Correct. Decomposition is the planning step that converts a broad question into manageable sub-questions, each addressable by a targeted search.

Not right. Decomposition here means breaking the research question apart into sub-queries — a planning step, not a data engineering technique.

4. Which of the following is a valid termination condition for a research agent's search–read–revise loop?

Correct. Detecting unresolvable contradiction and escalating is the critical safety valve termination condition described in the lesson.

Incorrect. Valid termination conditions include sufficient coverage, budget exhaustion, or irresolvable contradiction — not arbitrary query counts or keyword presence.

5. OpenAI's function calling mechanism, released in June 2023, primarily enabled research agents by doing what?

Correct. Function calling gave agents a reliable, structured way to invoke tools like search and code interpreters — standardising the integration layer.

Incorrect. Function calling's value was in standardising tool invocation, not in expanding context or providing database access.

Lab 1: Designing a Research Plan

Practice decomposing a complex question and routing sub-queries to the right retrieval surface.

Your Task

You're designing a research agent to answer a hard policy question. Work through the decomposition and retrieval-routing steps with the AI assistant below. Engage in at least three exchanges to complete the lab.

Scenario: A policy team needs to know — "Does paid parental leave reduce child poverty rates, and what are the strongest RCTs or natural experiments on this?" Design the research plan: break down the question, identify which sources to search, and describe how the agent should handle contradictory findings.

Research Planning Assistant

Lab 1

Welcome to Lab 1. Let's design a research agent plan together. Start by telling me how you'd decompose the question about paid parental leave and child poverty into sub-queries. What are the first two or three questions your agent should answer?

Module 5 · Lesson 2

Autonomous Literature Review

How AI agents are replacing weeks of manual screening with hours of systematic evidence synthesis.

When an AI agent screens 10,000 papers in a day, what quality controls must be in place before you trust its conclusions?

The Cochrane Collaboration — the gold standard for systematic medical reviews — began piloting AI-assisted screening in 2023. Their tool, integrating machine learning classifiers trained on Cochrane's own inclusion criteria, reduced abstract screening time by roughly 65% across several pilot reviews. In one published pilot on a cardiovascular intervention review, the AI screened 8,200 abstracts in under four hours; two human reviewers then verified a 10% sample. The AI achieved 97.3% sensitivity for relevant papers — meaning it missed only 2.7% of studies that should have been included. This is the documented capability baseline that serious autonomous literature agents are now approaching.

The Systematic Review Pipeline

A systematic literature review follows a defined protocol: (1) specify a PICO question (Population, Intervention, Comparison, Outcome); (2) search multiple databases with pre-registered search strings; (3) screen titles and abstracts for relevance; (4) retrieve full texts and apply inclusion/exclusion criteria; (5) extract data from included studies; (6) assess risk of bias; (7) synthesise evidence, often via meta-analysis. AI agents are now capable of automating steps 2 through 5 with human oversight on 6 and 7.

The key breakthrough enabling step 5 — structured data extraction — was the combination of PDF parsing and prompted extraction. Elicit's "extract data" feature, launched in 2023, could read a full-text clinical trial PDF and populate a structured table with fields like sample size, intervention dosage, follow-up duration, and primary outcome effect size. Researchers at Harvard's Catalyst programme validated this against manual extraction and found 89% field-level agreement — good enough to flag for human check but too unreliable to accept without review.

Real Tool — Elicit Data Extraction (2023)

Elicit's automated extraction achieved 89% field-level agreement with manual extractors on a validation set of 200 RCTs. The 11% discrepancy was concentrated in complex fields like subgroup analyses and adjusted effect sizes — precisely the fields most consequential for meta-analysis.

Bias Detection and Quality Assessment

Risk-of-bias assessment — evaluating whether a study's design could have produced a systematically skewed result — is the most technically demanding step for autonomous agents. The standard tool is the Cochrane Risk of Bias 2 (RoB 2) framework, which requires the assessor to make nuanced judgements about randomisation adequacy, blinding, and selective reporting. A 2024 pre-print from Oxford's Centre for Evidence-Based Medicine tested GPT-4 on 100 RCT risk-of-bias assessments and found 71% agreement with expert human assessors — significantly above random but below the 80%+ threshold typically required for automated replacement in high-stakes reviews.

The implication is architecturally important: even the most capable research agents in 2024 should be designed as human-in-the-loop for bias assessment, not as fully autonomous replacements. The agent flags, prioritises, and pre-populates; the expert decides. This division of labour is precisely what the Cochrane AI pilots implement.

Evidence Synthesis and Contradiction Handling

When an agent discovers conflicting findings — one RCT shows a strong positive effect, another shows null — it must not simply average or ignore the discrepancy. Proper evidence synthesis requires investigating why the results differ: different populations, different dosing regimes, different follow-up periods, or genuine heterogeneity in the underlying effect. This is called heterogeneity analysis in meta-analytic terms. AI agents that skip this step produce confidently wrong summaries.

Perplexity AI's Deep Research feature, released in January 2025, partially addresses this by showing its source list and flagging when sources contradict each other. Users reported on forums that the system would note "Source A reports X; Source B, using a different methodology, reports Y" — a meaningful step toward transparent contradiction handling rather than silent averaging. This transparency design is the right direction, though verifying the quality of the heterogeneity detection itself remains an open research problem.

PICO FrameworkPopulation, Intervention, Comparison, Outcome — the structured question format that systematic reviews use to define scope before searching.

Risk of BiasA structured assessment of whether a study's design or conduct could have produced a systematically skewed result.

HeterogeneityVariability in effect sizes across studies that must be investigated rather than averaged away during evidence synthesis.

Design Principle

Design autonomous literature agents as screening accelerators and extraction assistants, not as autonomous decision-makers for the highest-stakes synthesis steps. The agent does the volume work; the expert applies the critical judgement that the agent cannot yet reliably replicate.

Lesson 2 Quiz

Autonomous Literature Review — 5 questions

1. In the Cochrane AI screening pilot (2023), what sensitivity did the AI achieve for relevant papers?

Correct. The pilot achieved 97.3% sensitivity, meaning only 2.7% of relevant studies were missed by the AI screener.

Incorrect. The documented sensitivity was 97.3% in the Cochrane cardiovascular review pilot.

2. What field-level agreement did Elicit's automated data extraction achieve against manual extractors?

Correct. Elicit achieved 89% field-level agreement, with discrepancies concentrated in complex fields like subgroup analyses.

Incorrect. Elicit's validated field-level agreement was 89%, based on a 200-RCT validation set at Harvard's Catalyst programme.

3. The PICO framework stands for which four elements?

Correct. PICO — Population, Intervention, Comparison, Outcome — is the standard framework for structuring systematic review questions.

Incorrect. PICO stands for Population, Intervention, Comparison, Outcome — the structured question format for systematic reviews.

4. Why is heterogeneity analysis critical when an AI agent finds conflicting results across studies?

Correct. Heterogeneity analysis investigates the source of conflicting results — population differences, dosing, follow-up — rather than producing a misleadingly averaged conclusion.

Incorrect. Heterogeneity analysis means investigating why results differ — not averaging, not excluding, but understanding the source of variability.

5. According to a 2024 Oxford pre-print, what was GPT-4's agreement rate with expert human assessors on Cochrane Risk of Bias assessments?

Correct. 71% agreement — above random but below the ~80% threshold required for automated replacement in high-stakes systematic reviews.

Incorrect. The Oxford 2024 pre-print found 71% agreement — meaningful but below the threshold for autonomous bias assessment in high-stakes reviews.

Lab 2: Systematic Review Quality Control

Design the quality-control layer for an autonomous literature review agent.

Your Task

An autonomous agent has screened 5,000 abstracts and extracted data from 120 included RCTs on a nutrition intervention. Your job is to design the human-in-the-loop quality controls. Discuss with the assistant below. Engage in at least three exchanges.

Scenario: The agent reports that 40 studies show a significant positive effect, 35 show no effect, and 45 show a small positive effect. It has flagged 12 studies with "possible high risk of bias." What quality controls do you put in place, and how do you instruct the agent to handle the heterogeneity in the results?

Literature Review QC Assistant

Lab 2

Welcome to Lab 2. You're the methodologist overseeing this systematic review. Start by telling me what you'd do first with those 12 high-risk-of-bias studies — include them, exclude them, or something else? And why?

Module 5 · Lesson 3

Multi-Agent Research Networks

When a single agent isn't enough — how orchestrated agent networks tackle research problems too large or complex for one model.

What coordination problems emerge when multiple AI agents collaborate on the same research task, and how have real systems solved them?

The computational pipeline around AlphaFold 2 illustrates multi-agent research at scale. After AlphaFold's protein structure predictions were released in 2021, DeepMind and EMBL-EBI built automated pipelines that used structure predictions as inputs to downstream agents: one agent mined PubMed for papers describing the protein's known function, a second agent ran functional annotation queries against UniProt, and a third synthesised the structure and literature evidence into a summary for biologists. By 2023 this pipeline had processed over 200 million protein structures with automated literature linkage — a scale no single-agent or human team could approach.

Orchestrator–Worker Architecture

The dominant pattern for multi-agent research systems is the orchestrator–worker model. An orchestrator agent receives the high-level research question, decomposes it into tasks, assigns tasks to specialised worker agents, monitors their progress, and synthesises results. Worker agents have narrower competencies: one might specialise in PubMed queries, another in patent database search, another in statistical analysis of retrieved data.

This architecture was formalised in Microsoft's AutoGen framework (published September 2023), which allows developers to define multi-agent conversations where agents have different system prompts, tools, and roles. In documented research applications, AutoGen-based systems have been used to run competitive landscape analyses — a task that requires simultaneously searching patent databases, academic literature, and company news, then synthesising across all three with a single coherent narrative.

Real Framework — Microsoft AutoGen (2023)

AutoGen's multi-agent architecture allows an orchestrator to assign sub-tasks to worker agents with different tool access. In a documented competitive intelligence use case, a two-worker system (one handling academic search, one handling news/patent search) reduced synthesis time from several days of analyst work to under two hours, while maintaining source traceability through structured message passing.

Coordination Problems: Duplication, Contradiction, and Hallucination Amplification

Multi-agent systems introduce coordination failures that single-agent systems don't face. The three most documented in research contexts are:

Duplication: Two worker agents independently retrieve and process the same source, wasting compute and sometimes producing inconsistent extractions from the same document. AutoGen and similar frameworks address this via a shared document registry that marks sources as "claimed" when an agent begins processing them.

Contradiction propagation: If one worker agent makes an error in extraction and reports a false claim to the orchestrator, the orchestrator may incorporate that claim into the synthesis before the second worker agent corrects it. The temporal order of message passing matters. Systems like LangGraph (released by LangChain in January 2024) implement explicit validation nodes — checkpoints where a verification agent inspects claims before they enter the synthesis context.

Hallucination amplification: In a chain of agents, each agent's output becomes the next agent's input. A confident-sounding hallucination in step 2 can be treated as established fact by agents in steps 3, 4, and 5, each adding elaboration that makes the hallucination appear increasingly authoritative. This was documented in a 2023 Stanford study on multi-hop reasoning chains — error rates compounded significantly across hops.

Source Attribution in Multi-Agent Outputs

When a multi-agent system produces a research report, attribution becomes a coordination challenge. Each worker agent may have processed dozens of sources; the orchestrator's synthesis must trace which claim came from which source without collapsing the attribution chain. Systems that lose this trail produce unreproducible research outputs.

Perplexity Deep Research (January 2025) handles this by requiring each step in its agent chain to pass source citations alongside claims, not just text. The final report shows nested citations — a claim in the synthesis links back to a specific agent step, which links to a specific document passage. This "citation chain" approach is the state-of-the-art pattern for trustworthy multi-step research outputs as of 2025.

Orchestrator–WorkerA multi-agent architecture where one orchestrator agent decomposes a task and delegates to specialised worker agents.

Hallucination AmplificationThe compounding of errors when one agent's false output becomes authoritative input for subsequent agents in a pipeline.

Citation ChainA traceability structure where every claim in a final synthesis can be traced back through agent steps to a specific source passage.

Design Principle

In any multi-agent research network, build validation checkpoints between worker agents and the synthesis layer. Never let an unverified claim from one worker propagate directly into the final output — insert a verification agent or human review step that scrutinises claims before they become part of the synthesis context.

Lesson 3 Quiz

Multi-Agent Research Networks — 5 questions

1. In the orchestrator–worker architecture, what is the orchestrator's primary role?

Correct. The orchestrator decomposes, delegates, monitors, and synthesises — it is the strategic coordinator, not the execution layer.

Incorrect. The orchestrator coordinates: it breaks the problem down and assigns work to specialised worker agents, then synthesises their outputs.

2. Which framework, published in September 2023, formalised multi-agent conversations with different roles and tool access?

Correct. Microsoft AutoGen, published September 2023, formalised the multi-agent conversation pattern with different system prompts and tool access per agent.

Incorrect. Microsoft AutoGen (September 2023) is the framework described in the lesson for formalising multi-agent research architectures.

3. What is "hallucination amplification" in a multi-agent research pipeline?

Correct. Hallucination amplification occurs when a confident-sounding error in an early agent step propagates and gains apparent authority through subsequent processing steps.

Incorrect. Hallucination amplification describes how one agent's error becomes the next agent's "established fact," compounding across the pipeline — documented in Stanford's 2023 study on multi-hop reasoning.

4. How does LangGraph address contradiction propagation in multi-agent research pipelines?

Correct. LangGraph's validation nodes intercept claims between worker agents and the synthesis layer, allowing a verification step before unverified claims enter the final context.

Incorrect. LangGraph uses validation checkpoints — explicit nodes in the graph where a verification agent checks claims before they are passed to synthesis.

5. What is the "citation chain" approach used by Perplexity Deep Research (2025)?

Correct. Perplexity's citation chain ensures every synthesised claim is traceable back through the agent pipeline to a specific source passage, making the research reproducible and auditable.

Incorrect. The citation chain is about traceability — linking each claim in the synthesis back through agent steps to the specific document passage that supports it.

Lab 3: Designing a Multi-Agent Research Network

Architect a multi-agent system and identify its coordination failure modes.

Your Task

You're building a competitive intelligence system using a multi-agent network. Work through the architecture and failure modes with the assistant. Engage in at least three exchanges.

Scenario: Your company needs a weekly report on competitor AI product launches. Design a multi-agent system: define the orchestrator's role, specify at least two worker agents with distinct tool access, and identify where hallucination amplification could occur and how you'd mitigate it.

Multi-Agent Architecture Assistant

Lab 3

Welcome to Lab 3. Let's architect your competitive intelligence multi-agent system. Start by describing the orchestrator agent — what is its system prompt and what decisions does it make? What tasks will it delegate, and to which types of worker agents?

Module 5 · Lesson 4

Trust, Verification, and the Future of AI Research

Why the hardest problem isn't making research agents faster — it's making their outputs trustworthy enough to act on.

How should organisations verify AI-generated research before it influences high-stakes decisions, and what verification failures have already caused harm?

In May 2023, US attorney Steven Schwartz submitted a court brief citing six legal precedents — all generated by ChatGPT, none of which existed. Judge P. Kevin Castel of the Southern District of New York ordered sanctions when opposing counsel discovered the citations were fabricated. Schwartz told the court he was unaware that ChatGPT could produce false information. The case, Mata v. Avianca, became the definitive documented example of AI-generated research entering high-stakes professional practice without verification. The AI had produced plausible-sounding citations complete with case names, docket numbers, and quoted passages — all invented.

This was not a research agent failure per se — it was a single-shot generation from ChatGPT — but it illustrates the verification problem that research agents must solve. An agent that retrieves real documents and cites them accurately addresses the citation fabrication problem, but introduces new trust questions: Did the agent correctly interpret what the retrieved document actually says? Did it select a representative passage or one cherry-picked to support a predetermined conclusion?

The Verification Stack

Organisations deploying research agents for high-stakes tasks need a verification stack — a layered set of checks that operates at different points in the pipeline. The layers, from automated to human, typically include:

Layer 1 — Source existence check: Verify that cited URLs, DOIs, or PubMed IDs resolve to real documents. This is automatable and should be done before any human reads the output. Services like CrossRef API enable DOI verification programmatically.

Layer 2 — Passage grounding check: Verify that the specific quoted passage appears in the retrieved document. Vector similarity between the claim and the source passage can flag misattribution. This is also automatable but less reliable — a paraphrased claim that accurately represents the source may score lower than a misleading direct quote.

Layer 3 — Interpretation check: A human expert (or a second, independent AI agent) reviews whether the agent's interpretation of the source is reasonable given the full document context. This is the hardest layer to automate and the most important for scientific literature.

Layer 4 — Synthesis coherence check: Does the overall synthesis follow logically from the sources, or has the agent introduced claims not supported by any retrieved document? This requires reading the final report against the full source set — feasible for a human reviewer, difficult to automate reliably.

Documented Failure — Mata v. Avianca (2023)

Six non-existent legal precedents were submitted to a federal court, generated by ChatGPT without retrieval grounding. The case established in legal practice that AI-generated citations must be independently verified before submission — a verification norm that research agents, by design, help satisfy through retrieval, but which still requires human confirmation of interpretation accuracy.

Confidence Calibration and Uncertainty Communication

A well-designed research agent should communicate not just findings but confidence levels. When the evidence base is thin — only one or two small studies — the agent should say so explicitly rather than presenting findings with the same grammatical confidence as a Cochrane review of 50 RCTs. This is called calibrated uncertainty communication.

Anthropic published a Constitutional AI document in 2022 and ongoing model cards describing how Claude is trained to express uncertainty with hedged language ("the evidence suggests…", "one study found…", "there is limited evidence that…") rather than asserting findings categorically. This epistemic humility is designed into the training objective, not layered on as a post-hoc filter. Research agents built on such models inherit some of this calibration, but users must be trained to read the hedges as meaningful signal rather than stylistic decoration.

The practical implication: when procuring or evaluating a research agent, test explicitly for calibration by asking it about a topic where the evidence is genuinely mixed or limited. A well-calibrated agent will reflect the uncertainty; a poorly calibrated one will sound equally confident on a well-established fact and on a contested claim with one small supporting study.

The Near-Future Trajectory

Several developments are converging to expand research agent capability in 2025 and beyond. Extended context windows — GPT-4 Turbo's 128K token context, Gemini 1.5 Pro's 1M token context — allow agents to ingest entire research corpora in a single context, reducing the need for iterative retrieval loops. Tool-use and code execution allow agents to run statistical analyses on retrieved data rather than just summarising qualitative claims. And multi-modal agents can now read charts, figures, and tables in PDF papers — addressing a major limitation of earlier text-only research agents that missed visual evidence entirely.

OpenAI's Deep Research feature, released in February 2025 for ChatGPT Pro subscribers, represents the current commercial frontier. It runs autonomously for 5–30 minutes on complex questions, executes dozens of searches, reads full documents, and produces cited reports of 1,000–5,000 words. Early benchmarks showed it outperforming PhD-level human researchers on structured information retrieval tasks — while still requiring expert review for interpretation and synthesis quality.

Verification StackLayered checks — source existence, passage grounding, interpretation, synthesis coherence — applied before acting on research agent output.

Calibrated UncertaintyExpressing confidence levels proportional to evidence strength, using hedged language for weak evidence and strong language only for well-established findings.

Passage GroundingVerifying that a specific claim or quote attributed to a source actually appears in that source document.

Core Principle

Research agents do not solve the trust problem — they transform it. Instead of worrying about whether an AI fabricated its citations, you worry about whether it correctly interpreted real ones. The verification burden shifts from existence checks to interpretation checks. Both require human judgment; neither can be fully automated. Design your workflows accordingly.

Lesson 4 Quiz

Trust, Verification, and the Future of AI Research — 5 questions

1. In Mata v. Avianca (2023), what specifically did attorney Steven Schwartz submit that led to sanctions?

Correct. Schwartz submitted six fabricated case citations — complete with case names, docket numbers, and invented quoted passages — all generated by ChatGPT.

Incorrect. The sanctions arose from submitting six non-existent legal precedents that ChatGPT invented, complete with plausible case names and docket numbers.

2. Which layer of the verification stack checks whether a quoted passage actually appears in the cited source document?

Correct. Layer 2 (passage grounding) verifies that the specific claim or quote attributed to a source actually appears in that document — distinct from simply checking whether the source exists.

Incorrect. Passage grounding is Layer 2. Layer 1 checks whether the source exists at all; Layer 3 evaluates whether the interpretation is reasonable.

3. What is calibrated uncertainty communication in the context of research agents?

Correct. Calibrated uncertainty means the agent's expressed confidence tracks the actual evidence strength — hedged language for limited evidence, stronger language only for well-established findings.

Incorrect. Calibrated uncertainty is about matching linguistic confidence to evidence strength — not about statistical thresholds or word limits.

4. What capability of Gemini 1.5 Pro is particularly relevant to reducing iterative retrieval loops in research agents?

Correct. A 1M token context allows a research agent to read an entire corpus at once rather than iterating through retrieval loops — fundamentally changing the architecture of the planning–search cycle.

Incorrect. The relevant capability is the 1 million token context window, which allows entire corpora to be ingested without iterative retrieval.

5. According to the lesson, how does deploying a well-designed research agent transform — rather than eliminate — the trust problem?

Correct. Retrieval agents solve fabrication but introduce interpretation risk — you must now verify that real sources were correctly read, not just that sources exist at all.

Incorrect. Research agents transform the trust problem: fabrication risk decreases, but interpretation risk increases. Human verification remains necessary, just focused differently.

Lab 4: Building a Verification Protocol

Design a verification stack for a real research agent deployment scenario.

Your Task

A pharmaceutical company wants to use an AI research agent to produce competitive landscape reports on drug pipeline developments. Design the full verification protocol before the reports go to the executive team. Engage in at least three exchanges with the assistant.

Scenario: The research agent produces a 3,000-word report citing 28 sources (PubMed papers, clinical trial registry entries, SEC filings, and news articles). The report concludes that a competitor's Phase 3 trial shows "promising efficacy signals." Design a four-layer verification protocol for this report, and identify which layer is most likely to catch a dangerous interpretation error in this specific context.

Verification Protocol Assistant

Lab 4

Welcome to Lab 4. You're designing the verification stack for a high-stakes pharmaceutical research report. Start with Layer 1 — the source existence check. How would you automate this for a report that cites PubMed papers (DOIs), ClinicalTrials.gov entries, SEC filings, and news URLs? What APIs or tools would you use, and how would you handle sources that don't resolve?

Module 5 Test

Research Agents — 15 questions · 80% to pass

1. What is the core structural difference between a research agent and a conventional search engine?

Correct. The plan→search→read→revise loop is what structurally distinguishes a research agent.

Incorrect. The distinction is the iterative planning and revision loop, not scale or database access.

2. The AI research tool Elicit primarily searches which database?

Correct. Elicit uses Semantic Scholar's database of over 200 million papers as its primary retrieval surface.

Incorrect. Elicit searches Semantic Scholar's database of 200 million papers.

3. WebGPT evaluators preferred its answers over base GPT-3 what percentage of the time?

Correct. Evaluators preferred WebGPT's cited, browsed answers 56% of the time on open-ended questions.

Incorrect. The figure was 56%, demonstrating that retrieval plus synthesis outperformed parametric memory.

4. In a systematic review pipeline, which steps is an AI agent most reliably able to automate as of 2024?

Correct. Steps 2–5 (search, screen, retrieve, extract) are automatable with oversight; bias assessment and synthesis require human expert judgment.

Incorrect. AI agents handle search through data extraction most reliably; bias assessment and high-stakes synthesis still require human expertise.

5. What is a PICO question used for in systematic reviews?

Correct. PICO defines the scope of the review question before any searching occurs — ensuring the agent searches for the right things.

Incorrect. PICO (Population, Intervention, Comparison, Outcome) is the question-structuring framework used before search strategy development.

6. What field-level agreement did Elicit achieve on structured data extraction from RCT PDFs?

Correct. 89% field-level agreement — sufficient to flag for human review, but not reliable enough for unverified acceptance.

Incorrect. Elicit achieved 89% agreement, validated on 200 RCTs at Harvard's Catalyst programme.

7. Which multi-agent coordination problem involves one agent's false claim becoming accepted fact for downstream agents?

Correct. Hallucination amplification describes how errors compound across pipeline steps when downstream agents treat earlier errors as established facts.

Incorrect. Hallucination amplification is the specific term for errors compounding across agent steps, not duplication or contradiction propagation.

8. Microsoft AutoGen was primarily designed to solve what problem?

Correct. AutoGen formalised the multi-agent conversation architecture with role-differentiated agents — enabling orchestrator–worker research pipelines.

Incorrect. AutoGen was designed for multi-agent conversation architectures with different system prompts and tool access per agent.

9. In the AlphaFold multi-agent pipeline, what did the agent that queried UniProt primarily do?

Correct. One worker agent ran functional annotation queries against UniProt, complementing the PubMed literature mining agent and AlphaFold's structure predictions.

Incorrect. The UniProt agent performed functional annotation — retrieving what is known about each protein's biological function.

10. What does the "citation chain" approach ensure in a multi-agent research output?

Correct. The citation chain is a traceability structure — each claim links back through the agent pipeline to the specific document passage that supports it.

Incorrect. The citation chain is about traceable provenance — every synthesised claim linked back to a specific source passage through the agent pipeline.

11. What was the core legal finding in Mata v. Avianca (2023) that is most relevant to research agents?

Correct. The case established the professional verification norm: AI-generated research outputs, specifically citations, must be independently confirmed before high-stakes use.

Incorrect. The key finding was that non-existent citations — fabricated by ChatGPT — require independent verification before submission, establishing a professional norm.

12. Which layer of the verification stack is the most important for catching dangerous interpretation errors in scientific literature?

Correct. Layer 3 (human expert or independent agent review of whether the source was correctly interpreted) is the hardest to automate and most consequential for scientific accuracy.

Incorrect. Layer 3 — the interpretation check — is most critical for catching misreadings of real sources, which is the dominant error type for retrieval-grounded agents.

13. GPT-4's agreement with expert human assessors on Cochrane Risk of Bias assessments was 71%. Why is this insufficient for autonomous deployment in high-stakes reviews?

Correct. The ~80% agreement threshold is the professional standard for automated replacement; 71% represents meaningful capability but falls short of autonomous deployment.

Incorrect. The issue is that high-stakes systematic reviews require ~80%+ agreement for automation; 71% is above random but below that threshold.

14. What does calibrated uncertainty communication require a research agent to do when evidence is thin?

Correct. Calibrated uncertainty means linguistic confidence tracks evidence strength — hedged language for limited evidence, stronger language only when evidence is robust.

Incorrect. Calibrated uncertainty means matching expressed confidence to evidence strength using hedged language — not refusing to answer or issuing generic disclaimers.

15. OpenAI's Deep Research (released February 2025) was notable for what capability benchmark?

Correct. Deep Research's benchmark results showed it outperforming PhD researchers on structured retrieval — with the important caveat that expert interpretation review remains necessary.

Incorrect. Deep Research's documented benchmark achievement was outperforming PhD-level researchers on structured information retrieval, while still needing expert interpretation review.