In late 2022, the AI research assistant Elicit began running structured literature reviews by decomposing a user question into sub-queries, searching Semantic Scholar's database of 200 million papers, extracting key claims from each abstract, and synthesising a structured summary — all autonomously. By 2023 the tool processed millions of queries monthly. Researchers at the Institute for Progress used Elicit to map the global biosafety literature in days rather than the weeks a manual review would require. The system's ability to iterate — refining search terms after inspecting early results — is what distinguished it from a simple keyword search.
A conventional search engine returns a ranked list of documents for a single query. A research agent does something architecturally distinct: it plans before it searches, inspects intermediate results, and revises its strategy based on what it finds. This loop — plan → search → read → revise → synthesise — is the core capability that makes AI research agents qualitatively more powerful than retrieval alone.
The planning step typically involves decomposing a complex question into atomic sub-questions. If you ask "What is the evidence for omega-3 supplementation reducing cardiovascular events in diabetic patients?", a research agent breaks this into: What RCTs exist on omega-3 and cardiovascular outcomes? Which specifically enrol diabetic populations? What are the effect sizes and confidence intervals? This decomposition strategy was formalised by Google DeepMind researchers in the 2023 paper "Decomposed Prompting" and underpins agents like Elicit and Perplexity's "Deep Research" mode.
The reading step involves more than fetching a URL. Modern research agents parse retrieved documents — extracting structured fields like sample size, methodology, and conclusions — before deciding whether to follow citations further or pivot to a new search arm. This is what the WebGPT system demonstrated at OpenAI in 2021: an agent trained to use a web browser to gather evidence before answering, rewarded explicitly for citing sources.
OpenAI's WebGPT was trained via human feedback to browse the web, paste relevant passages, and produce cited answers. Evaluators preferred its answers to those of the base GPT-3 model 56% of the time on open-ended questions — demonstrating that retrieval plus synthesis outperforms parametric memory alone on research tasks.
Researchers at Anthropic and DeepMind have studied how agents allocate "compute budget" across a research task. The most effective agents spend roughly equal fractions of their token budget on search query formulation, document reading, and synthesis — not front-loading everything into the first search. This mirrors how expert human researchers work: a literature review iterates between finding papers and updating the conceptual map of what's already known.
The loop terminates when one of three conditions is met: (1) the agent judges that coverage is sufficient — it has found consistent evidence across multiple independent sources; (2) a budget constraint is hit (time, tokens, or API calls); or (3) the agent detects irresolvable contradiction and escalates to a human. Condition (3) is the critical safety valve — without it, agents can confidently synthesise contradictory sources into a false consensus.
Research agents in 2024–25 typically integrate several retrieval surfaces: general web search (Bing, Google APIs), academic databases (Semantic Scholar, PubMed, ArXiv), curated knowledge graphs (Wikidata), and private document stores via vector databases like Pinecone or Weaviate. Each surface has different freshness, authority, and coverage characteristics. A well-designed agent routes sub-questions to the appropriate surface — asking PubMed about clinical trial evidence rather than scraping Reddit, for instance.
The tool-use infrastructure is standardised in some ecosystems. OpenAI's function calling mechanism, released in June 2023, lets agents declare tools (search, calculator, code interpreter) and receive structured results. This allowed commercial products like ChatGPT with browsing and Perplexity AI to build reliable research loops on top of a shared tool-call interface rather than each engineering custom integration.
The intelligence of a research agent is not primarily in its retrieval mechanism — web search is a commodity. It is in the planning and synthesis layers: how the agent decides what to look for next, how it reads for relevant claims rather than full comprehension, and how it reconciles conflicting evidence into a calibrated answer.
You're designing a research agent to answer a hard policy question. Work through the decomposition and retrieval-routing steps with the AI assistant below. Engage in at least three exchanges to complete the lab.
The Cochrane Collaboration — the gold standard for systematic medical reviews — began piloting AI-assisted screening in 2023. Their tool, integrating machine learning classifiers trained on Cochrane's own inclusion criteria, reduced abstract screening time by roughly 65% across several pilot reviews. In one published pilot on a cardiovascular intervention review, the AI screened 8,200 abstracts in under four hours; two human reviewers then verified a 10% sample. The AI achieved 97.3% sensitivity for relevant papers — meaning it missed only 2.7% of studies that should have been included. This is the documented capability baseline that serious autonomous literature agents are now approaching.
A systematic literature review follows a defined protocol: (1) specify a PICO question (Population, Intervention, Comparison, Outcome); (2) search multiple databases with pre-registered search strings; (3) screen titles and abstracts for relevance; (4) retrieve full texts and apply inclusion/exclusion criteria; (5) extract data from included studies; (6) assess risk of bias; (7) synthesise evidence, often via meta-analysis. AI agents are now capable of automating steps 2 through 5 with human oversight on 6 and 7.
The key breakthrough enabling step 5 — structured data extraction — was the combination of PDF parsing and prompted extraction. Elicit's "extract data" feature, launched in 2023, could read a full-text clinical trial PDF and populate a structured table with fields like sample size, intervention dosage, follow-up duration, and primary outcome effect size. Researchers at Harvard's Catalyst programme validated this against manual extraction and found 89% field-level agreement — good enough to flag for human check but too unreliable to accept without review.
Elicit's automated extraction achieved 89% field-level agreement with manual extractors on a validation set of 200 RCTs. The 11% discrepancy was concentrated in complex fields like subgroup analyses and adjusted effect sizes — precisely the fields most consequential for meta-analysis.
Risk-of-bias assessment — evaluating whether a study's design could have produced a systematically skewed result — is the most technically demanding step for autonomous agents. The standard tool is the Cochrane Risk of Bias 2 (RoB 2) framework, which requires the assessor to make nuanced judgements about randomisation adequacy, blinding, and selective reporting. A 2024 pre-print from Oxford's Centre for Evidence-Based Medicine tested GPT-4 on 100 RCT risk-of-bias assessments and found 71% agreement with expert human assessors — significantly above random but below the 80%+ threshold typically required for automated replacement in high-stakes reviews.
The implication is architecturally important: even the most capable research agents in 2024 should be designed as human-in-the-loop for bias assessment, not as fully autonomous replacements. The agent flags, prioritises, and pre-populates; the expert decides. This division of labour is precisely what the Cochrane AI pilots implement.
When an agent discovers conflicting findings — one RCT shows a strong positive effect, another shows null — it must not simply average or ignore the discrepancy. Proper evidence synthesis requires investigating why the results differ: different populations, different dosing regimes, different follow-up periods, or genuine heterogeneity in the underlying effect. This is called heterogeneity analysis in meta-analytic terms. AI agents that skip this step produce confidently wrong summaries.
Perplexity AI's Deep Research feature, released in January 2025, partially addresses this by showing its source list and flagging when sources contradict each other. Users reported on forums that the system would note "Source A reports X; Source B, using a different methodology, reports Y" — a meaningful step toward transparent contradiction handling rather than silent averaging. This transparency design is the right direction, though verifying the quality of the heterogeneity detection itself remains an open research problem.
Design autonomous literature agents as screening accelerators and extraction assistants, not as autonomous decision-makers for the highest-stakes synthesis steps. The agent does the volume work; the expert applies the critical judgement that the agent cannot yet reliably replicate.
An autonomous agent has screened 5,000 abstracts and extracted data from 120 included RCTs on a nutrition intervention. Your job is to design the human-in-the-loop quality controls. Discuss with the assistant below. Engage in at least three exchanges.
The computational pipeline around AlphaFold 2 illustrates multi-agent research at scale. After AlphaFold's protein structure predictions were released in 2021, DeepMind and EMBL-EBI built automated pipelines that used structure predictions as inputs to downstream agents: one agent mined PubMed for papers describing the protein's known function, a second agent ran functional annotation queries against UniProt, and a third synthesised the structure and literature evidence into a summary for biologists. By 2023 this pipeline had processed over 200 million protein structures with automated literature linkage — a scale no single-agent or human team could approach.
The dominant pattern for multi-agent research systems is the orchestrator–worker model. An orchestrator agent receives the high-level research question, decomposes it into tasks, assigns tasks to specialised worker agents, monitors their progress, and synthesises results. Worker agents have narrower competencies: one might specialise in PubMed queries, another in patent database search, another in statistical analysis of retrieved data.
This architecture was formalised in Microsoft's AutoGen framework (published September 2023), which allows developers to define multi-agent conversations where agents have different system prompts, tools, and roles. In documented research applications, AutoGen-based systems have been used to run competitive landscape analyses — a task that requires simultaneously searching patent databases, academic literature, and company news, then synthesising across all three with a single coherent narrative.
AutoGen's multi-agent architecture allows an orchestrator to assign sub-tasks to worker agents with different tool access. In a documented competitive intelligence use case, a two-worker system (one handling academic search, one handling news/patent search) reduced synthesis time from several days of analyst work to under two hours, while maintaining source traceability through structured message passing.
Multi-agent systems introduce coordination failures that single-agent systems don't face. The three most documented in research contexts are:
Duplication: Two worker agents independently retrieve and process the same source, wasting compute and sometimes producing inconsistent extractions from the same document. AutoGen and similar frameworks address this via a shared document registry that marks sources as "claimed" when an agent begins processing them.
Contradiction propagation: If one worker agent makes an error in extraction and reports a false claim to the orchestrator, the orchestrator may incorporate that claim into the synthesis before the second worker agent corrects it. The temporal order of message passing matters. Systems like LangGraph (released by LangChain in January 2024) implement explicit validation nodes — checkpoints where a verification agent inspects claims before they enter the synthesis context.
Hallucination amplification: In a chain of agents, each agent's output becomes the next agent's input. A confident-sounding hallucination in step 2 can be treated as established fact by agents in steps 3, 4, and 5, each adding elaboration that makes the hallucination appear increasingly authoritative. This was documented in a 2023 Stanford study on multi-hop reasoning chains — error rates compounded significantly across hops.
When a multi-agent system produces a research report, attribution becomes a coordination challenge. Each worker agent may have processed dozens of sources; the orchestrator's synthesis must trace which claim came from which source without collapsing the attribution chain. Systems that lose this trail produce unreproducible research outputs.
Perplexity Deep Research (January 2025) handles this by requiring each step in its agent chain to pass source citations alongside claims, not just text. The final report shows nested citations — a claim in the synthesis links back to a specific agent step, which links to a specific document passage. This "citation chain" approach is the state-of-the-art pattern for trustworthy multi-step research outputs as of 2025.
In any multi-agent research network, build validation checkpoints between worker agents and the synthesis layer. Never let an unverified claim from one worker propagate directly into the final output — insert a verification agent or human review step that scrutinises claims before they become part of the synthesis context.
You're building a competitive intelligence system using a multi-agent network. Work through the architecture and failure modes with the assistant. Engage in at least three exchanges.
In May 2023, US attorney Steven Schwartz submitted a court brief citing six legal precedents — all generated by ChatGPT, none of which existed. Judge P. Kevin Castel of the Southern District of New York ordered sanctions when opposing counsel discovered the citations were fabricated. Schwartz told the court he was unaware that ChatGPT could produce false information. The case, Mata v. Avianca, became the definitive documented example of AI-generated research entering high-stakes professional practice without verification. The AI had produced plausible-sounding citations complete with case names, docket numbers, and quoted passages — all invented.
This was not a research agent failure per se — it was a single-shot generation from ChatGPT — but it illustrates the verification problem that research agents must solve. An agent that retrieves real documents and cites them accurately addresses the citation fabrication problem, but introduces new trust questions: Did the agent correctly interpret what the retrieved document actually says? Did it select a representative passage or one cherry-picked to support a predetermined conclusion?
Organisations deploying research agents for high-stakes tasks need a verification stack — a layered set of checks that operates at different points in the pipeline. The layers, from automated to human, typically include:
Layer 1 — Source existence check: Verify that cited URLs, DOIs, or PubMed IDs resolve to real documents. This is automatable and should be done before any human reads the output. Services like CrossRef API enable DOI verification programmatically.
Layer 2 — Passage grounding check: Verify that the specific quoted passage appears in the retrieved document. Vector similarity between the claim and the source passage can flag misattribution. This is also automatable but less reliable — a paraphrased claim that accurately represents the source may score lower than a misleading direct quote.
Layer 3 — Interpretation check: A human expert (or a second, independent AI agent) reviews whether the agent's interpretation of the source is reasonable given the full document context. This is the hardest layer to automate and the most important for scientific literature.
Layer 4 — Synthesis coherence check: Does the overall synthesis follow logically from the sources, or has the agent introduced claims not supported by any retrieved document? This requires reading the final report against the full source set — feasible for a human reviewer, difficult to automate reliably.
Six non-existent legal precedents were submitted to a federal court, generated by ChatGPT without retrieval grounding. The case established in legal practice that AI-generated citations must be independently verified before submission — a verification norm that research agents, by design, help satisfy through retrieval, but which still requires human confirmation of interpretation accuracy.
A well-designed research agent should communicate not just findings but confidence levels. When the evidence base is thin — only one or two small studies — the agent should say so explicitly rather than presenting findings with the same grammatical confidence as a Cochrane review of 50 RCTs. This is called calibrated uncertainty communication.
Anthropic published a Constitutional AI document in 2022 and ongoing model cards describing how Claude is trained to express uncertainty with hedged language ("the evidence suggests…", "one study found…", "there is limited evidence that…") rather than asserting findings categorically. This epistemic humility is designed into the training objective, not layered on as a post-hoc filter. Research agents built on such models inherit some of this calibration, but users must be trained to read the hedges as meaningful signal rather than stylistic decoration.
The practical implication: when procuring or evaluating a research agent, test explicitly for calibration by asking it about a topic where the evidence is genuinely mixed or limited. A well-calibrated agent will reflect the uncertainty; a poorly calibrated one will sound equally confident on a well-established fact and on a contested claim with one small supporting study.
Several developments are converging to expand research agent capability in 2025 and beyond. Extended context windows — GPT-4 Turbo's 128K token context, Gemini 1.5 Pro's 1M token context — allow agents to ingest entire research corpora in a single context, reducing the need for iterative retrieval loops. Tool-use and code execution allow agents to run statistical analyses on retrieved data rather than just summarising qualitative claims. And multi-modal agents can now read charts, figures, and tables in PDF papers — addressing a major limitation of earlier text-only research agents that missed visual evidence entirely.
OpenAI's Deep Research feature, released in February 2025 for ChatGPT Pro subscribers, represents the current commercial frontier. It runs autonomously for 5–30 minutes on complex questions, executes dozens of searches, reads full documents, and produces cited reports of 1,000–5,000 words. Early benchmarks showed it outperforming PhD-level human researchers on structured information retrieval tasks — while still requiring expert review for interpretation and synthesis quality.
Research agents do not solve the trust problem — they transform it. Instead of worrying about whether an AI fabricated its citations, you worry about whether it correctly interpreted real ones. The verification burden shifts from existence checks to interpretation checks. Both require human judgment; neither can be fully automated. Design your workflows accordingly.
A pharmaceutical company wants to use an AI research agent to produce competitive landscape reports on drug pipeline developments. Design the full verification protocol before the reports go to the executive team. Engage in at least three exchanges with the assistant.