Module 5 · Lesson 1

What Research Agents Do

From search box to autonomous investigator — how AI agents replaced the single query.

What separates a research agent from a search engine, and why does that gap matter?

When Microsoft integrated GPT-4 into Bing in February 2023, it didn't simply attach a chat window to a search index. The resulting system — internally called Sydney during testing — could issue follow-up search queries based on its own intermediate conclusions, synthesize results across multiple pages, and present a single coherent answer rather than a list of links. That architecture was a public, mass-scale deployment of what researchers had been calling an agentic loop: perceive, reason, act, observe, repeat.

Within weeks, rival teams at Google, Perplexity AI, and a dozen startups were racing to ship similar systems. The era of the research agent had arrived not in a laboratory but in a browser tab.

The Anatomy of a Research Agent

A research agent is an AI system that accepts an open-ended informational goal and autonomously plans and executes a sequence of information-gathering and synthesis steps to satisfy it. Unlike a single-shot language model query — where you ask once and receive one answer — a research agent operates in a loop.

The loop has four canonical stages: decompose (break the goal into sub-questions), retrieve (search web, databases, files, APIs), evaluate (judge source quality and relevance), and synthesize (combine findings into a coherent output). The agent may iterate this loop dozens of times before returning a result, adjusting its plan based on what it finds.

Stage 1

Decompose

Break a broad question into tractable sub-questions. Perplexity's internal "query fan-out" typically generates 3–7 parallel sub-queries per user question.

Stage 2

Retrieve

Issue searches or API calls. OpenAI's Deep Research (Feb 2025) was reported to browse dozens of URLs per task, reading full page content rather than just snippets.

Stage 3

Evaluate

Score retrieved content for relevance and credibility. Agents may re-query if results are thin, contradictory, or from low-authority sources.

Stage 4

Synthesize

Produce a structured output — report, table, citation list — grounded in the retrieved evidence. Citations reduce hallucination rates substantially compared to closed-context generation.

A Short History of the Category

The concept predates large language models. IBM's Watson defeated Jeopardy champions in 2011 by combining information retrieval with passage scoring — a primitive but recognizable research pipeline. What changed with transformer-based LLMs was the ability to reason flexibly about retrieved text rather than just pattern-match it.

In 2022, DeepMind published WebGPT, demonstrating a fine-tuned GPT-3 variant that could browse the web using a text-based browser and answer long-form questions with citations. Evaluated on the ELI5 benchmark, WebGPT's human-preferred rate exceeded the best human-written reference answers 56% of the time — a landmark result that directly influenced the design of subsequent commercial research agents.

By 2024, the category had fragmented into specialist variants: academic search agents (Elicit, Consensus), legal research agents (Harvey AI, Westlaw AI), competitive intelligence agents (Crayon, AlphaSense AI), and general-purpose deep research agents (Perplexity Pro, Gemini Deep Research, OpenAI Deep Research).

Key Distinction

Search engines retrieve and rank documents. Research agents reason across documents. The difference is not speed or coverage — it is whether the system can form novel conclusions that appear in none of the individual source documents.

Key Terms

Research AgentAn AI system that autonomously plans and executes multi-step information-gathering and synthesis to answer an open-ended goal.

Agentic LoopThe perceive → reason → act → observe cycle that an agent repeats until it reaches its termination condition.

Query Fan-OutThe generation of multiple parallel sub-queries from a single user question, increasing recall at the cost of latency and compute.

Grounded GenerationText synthesis anchored to specific retrieved passages, with citations enabling verification — contrast with closed-context generation from memory alone.

Lesson 1 Quiz

What Research Agents Do — four questions

1. Which DeepMind system, published in 2022, demonstrated a fine-tuned GPT-3 variant that browsed the web and answered questions with citations?

Correct. WebGPT (2022) used a text-based browser and human feedback to answer long-form questions, achieving a 56% human-preferred rate vs. reference answers.

Not quite. DeepMind's WebGPT (2022) was the landmark system that combined web browsing with citation-grounded generation using a fine-tuned GPT-3 variant.

2. What is "query fan-out" in the context of research agents?

Correct. Query fan-out splits one high-level question into several targeted sub-queries, increasing recall. Perplexity typically generates 3–7 per question.

Not quite. Query fan-out means generating multiple parallel sub-queries from a single input to increase the breadth of information retrieved.

3. What is the primary advantage of grounded generation over closed-context generation?

Correct. Grounded generation ties claims to specific source documents with citations, allowing humans to verify and substantially reducing confabulated content.

Not quite. The key advantage is verifiability — outputs are anchored to retrieved passages with citations, making errors detectable and reducing hallucination rates.

4. In the four-stage research agent loop, what happens in the "evaluate" stage?

Correct. Evaluation assesses the quality of retrieved material before synthesis — the agent may re-query if results are thin, contradictory, or low-authority.

Not quite. The evaluate stage is where the agent scores retrieved content for relevance and credibility, deciding whether to proceed to synthesis or loop back for more retrieval.

Lab 1 — Mapping the Research Loop

Discuss research agent architecture with your AI lab assistant (3+ exchanges to complete)

Your Task

You've learned the four stages of a research agent loop: decompose, retrieve, evaluate, synthesize. In this lab, explore how those stages interact in practice. Ask about real systems, edge cases, failure modes, or design tradeoffs.

Suggested opener: "If a research agent decomposes a question into 6 sub-queries and two of them return contradictory information, what should the evaluate stage do?"

Research Agent Architecture Lab

M5 · L1

Welcome to Lab 1. I'm your AESOP lab assistant for this module on research agents. We can explore how the decompose → retrieve → evaluate → synthesize loop works in systems like Perplexity, WebGPT, or OpenAI Deep Research. What aspect would you like to dig into?

Module 5 · Lesson 2

Real-World Deployments

How Perplexity, Google Gemini, and OpenAI Deep Research actually work in production.

What architectural choices define the leading research agent products, and what have their early failures revealed?

On February 2, 2025, OpenAI released Deep Research as a feature within ChatGPT Pro. In its launch blog post, the company reported that the agent scored 26.6% on Humanity's Last Exam — a benchmark designed to stump expert humans — compared to 3–4% for standard GPT-4o. The agent was described as capable of spending five to thirty minutes autonomously browsing the web, reading academic papers, and producing multi-thousand-word reports with inline citations.

The reaction from the research community was immediate. Independent testers praised the depth of outputs but documented cases where the agent cited paywalled papers it had clearly not read in full, instead inferring their content from abstracts and surrounding commentary. The quality gap between what it appeared to know and what it had actually retrieved became a central thread in AI-safety discussions throughout early 2025.

Perplexity AI — Architecture Overview

Perplexity AI, founded in 2022 and valued at approximately $9 billion by late 2024, built its product on a retrieval-augmented generation (RAG) architecture but with a critical addition: an online retrieval step that issues live web searches rather than querying a static pre-indexed corpus. Each user query triggers a query fan-out, parallel Bing API calls, result scraping, passage extraction, and finally synthesis via a fine-tuned LLM.

Perplexity's Pro Search mode, released in 2023, added multi-hop retrieval — where intermediate answers generate new queries. A question about "the effect of the 2023 Silicon Valley Bank collapse on venture capital dry powder" would trigger sub-queries about SVB's failure, about LP capital calls, and about Q2/Q3 2023 VC funding data, with each strand informing the final synthesis.

The company faced significant controversy in July 2024 when Wired and Forbes reported that Perplexity's web crawler was ignoring robots.txt directives — the industry-standard protocol for sites to opt out of crawling. Several news publishers, including Condé Nast and The Wall Street Journal, sent cease-and-desist letters. The episode illustrated a core tension in research agent design: comprehensive retrieval conflicts directly with publisher consent frameworks.

Documented Incident

In July 2024, Wired and Forbes published investigations showing Perplexity's crawler disregarding robots.txt on major news sites. This sparked an industry debate about whether research agents require a new consent standard beyond existing crawl protocols.

Google Gemini Deep Research

Google launched Gemini Deep Research in December 2024, integrated into Gemini Advanced (the $20/month tier). Unlike Perplexity's streaming approach, Deep Research first presents a research plan — a structured outline of the sub-questions it intends to investigate — and asks the user to approve or modify it before execution. This human-in-the-loop checkpoint at the planning stage was a deliberate design choice to increase user trust and reduce wasted compute on misdirected research plans.

Google's version leverages its proprietary Knowledge Graph alongside live Search, giving it a structured-data layer that pure web-scraping agents lack. When researching entities (companies, people, legislation), Gemini can pull verified structured facts before turning to free-text retrieval — substantially reducing entity-level hallucinations.

OpenAI Deep Research

OpenAI's Deep Research, powered by a fine-tuned version of o3, was notable for its integration of Python code execution within the research loop. The agent can download datasets, run statistical analyses, generate charts, and embed the results in its final report — moving beyond pure text synthesis into what OpenAI called agentic data analysis. In early benchmark comparisons, it substantially outperformed Gemini Deep Research on tasks requiring numerical reasoning over retrieved tables.

The system's main limitation, documented by multiple independent reviewers in February–March 2025, was citation drift: the tendency to generate accurate high-level conclusions while assigning those conclusions to sources that contained only tangentially related material. The underlying cause — that the synthesis LLM and the retrieval system share an imperfect handoff — remains an open research problem.

System	Plan Review	Code Execution	Primary Retrieval	Known Weakness
Perplexity Pro	None	No	Live web (Bing API + scraper)	robots.txt violations; thin citations
Gemini Deep Research	Yes — approve plan before execution	No	Google Search + Knowledge Graph	Slower; plan approval adds friction
OpenAI Deep Research	Partial	Yes (Python)	Live web browser + file uploads	Citation drift on complex synthesis tasks

Design Insight

The human-in-the-loop checkpoint that Gemini Deep Research places at the planning stage — rather than at the output stage — is architecturally significant. It costs almost nothing (a ten-second approval) but eliminates entire branches of wasted research caused by misunderstood intent.

Lesson 2 Quiz

Real-World Deployments — four questions

1. What score did OpenAI's Deep Research achieve on Humanity's Last Exam at its February 2025 launch?

Correct. OpenAI reported a 26.6% score on Humanity's Last Exam at launch, compared to 3–4% for standard GPT-4o — a substantial improvement on expert-level questions.

Not quite. OpenAI reported 26.6% on Humanity's Last Exam. Standard GPT-4o scored only 3–4% on the same benchmark.

2. What controversy did Perplexity AI face in July 2024?

Correct. Wired and Forbes documented that Perplexity's crawler was disregarding robots.txt, prompting cease-and-desist letters from publishers including Condé Nast and The Wall Street Journal.

Not quite. The July 2024 controversy involved Perplexity's web crawler ignoring robots.txt directives — the opt-out standard for web crawling — on major news publisher sites.

3. What unique human-in-the-loop checkpoint does Gemini Deep Research include that Perplexity Pro does not?

Correct. Gemini Deep Research presents a structured research plan and asks the user to approve or modify it before any retrieval begins — placing oversight at the planning stage.

Not quite. Gemini Deep Research's distinctive feature is showing the user a research plan outline and requesting approval before execution — a checkpoint at the planning stage rather than the output stage.

4. What is "citation drift" as documented in early evaluations of OpenAI's Deep Research?

Correct. Citation drift refers to the system generating high-level conclusions that are defensible but assigning them to source documents that support only peripheral aspects of the claim — a retrieval-synthesis handoff failure.

Not quite. Citation drift describes accurate-sounding conclusions being attributed to sources that don't actually support them well — an imperfect handoff between the retrieval and synthesis components.

Lab 2 — Comparing Research Agent Products

Discuss design tradeoffs across Perplexity, Gemini, and OpenAI Deep Research (3+ exchanges)

Your Task

You've studied three deployed research agent systems with distinct architectures and documented failure modes. Use this lab to explore their tradeoffs, ask about real incidents, or think through how you would design around their weaknesses.

Suggested opener: "Why would Google add a plan-approval step when it slows things down? What problem does it actually solve that Perplexity's design doesn't?"

Research Agent Products Lab

M5 · L2

Welcome to Lab 2. I can help you compare the architectural choices behind Perplexity Pro, Gemini Deep Research, and OpenAI Deep Research — including their retrieval methods, known failure modes, and the design reasoning behind each. What would you like to explore?

Module 5 · Lesson 3

Specialized Research Agents

When the general-purpose loop isn't enough — vertical deployments in science, law, and intelligence.

How do domain-specific constraints reshape the design of a research agent, and what have real deployments revealed about the limits of general-purpose systems?

In September 2023, the legal research firm Harvey AI — backed by a16z and OpenAI — announced a partnership with Allen & Overy, one of the world's largest law firms. The deployment gave approximately 3,500 lawyers access to a research agent trained on legal corpora and integrated with PLC, Practical Law, and internal case databases. Partners reported that associates were using Harvey to draft research memos in hours rather than days.

But the deployment also surfaced a problem familiar to legal professionals: hallucinated citations. Unlike a general research task where a plausible but invented source is merely embarrassing, in legal work a non-existent case citation submitted to a court is a sanctionable ethical violation. Allen & Overy implemented mandatory human review of all Harvey-generated citations before any external use — a workflow layer that general-purpose research agents had never needed to design for.

Academic Research: Elicit and Consensus

Elicit, built by Ought (later renamed Elicit Inc.), launched its research assistant in 2022 targeting systematic-review workflows in academic research. Its key architectural choice was restricting retrieval to Semantic Scholar's corpus of 200+ million academic papers rather than the open web. This sacrificed breadth for precision and traceability — every claim could be traced to a specific paper with a DOI.

Elicit's agent can extract structured data from papers (sample sizes, effect sizes, study designs), populate comparison tables across dozens of papers, and flag methodological inconsistencies. In 2023, a team at the University of Pennsylvania published an evaluation showing Elicit's paper-screening recall on systematic reviews was competitive with trained research assistants, while reducing screening time by roughly 65%.

Consensus took a similar academic focus but emphasized claim consensus scoring — aggregating evidence across papers to rate whether a given scientific claim has strong, moderate, weak, or conflicting support. Launched publicly in 2023, it gained rapid adoption among medical students and science journalists seeking quick literature overviews.

Legal Research: Harvey AI and Westlaw AI

Harvey AI's architecture is notable for its retrieval-augmented approach using structured legal databases rather than web crawling. Case law, statutes, and regulatory filings are indexed with structured metadata (jurisdiction, date, court level, citing relationships), enabling the agent to apply jurisdictional filters and precedent hierarchies that a general web search cannot replicate.

Thomson Reuters launched Westlaw AI in 2024, integrating generative AI into its legacy legal research platform with a key differentiator: the system highlights the specific sentences within retrieved cases that support each generated answer. This transparency feature — showing exactly what text the synthesis was grounded in — was a direct response to the citation drift problem that had plagued earlier legal AI deployments.

Competitive Intelligence: AlphaSense

AlphaSense, a financial research platform used by investment banks and hedge funds, deployed generative AI search in 2023 across a corpus of earnings call transcripts, SEC filings, broker research, and news. Its research agent can answer questions like "What have pharmaceutical executives said about GLP-1 manufacturing capacity in the last 90 days?" by retrieving and synthesizing across hundreds of documents that no human analyst could read in real time.

The platform's central design challenge is temporal precision. In financial research, a claim from a Q2 2023 earnings call is materially different from the same claim in Q3 2024. AlphaSense built strict date-range filtering into every retrieval step and surfaces the document date prominently in every citation — a design choice that general research agents often neglect.

Academic

Elicit / Consensus

Restrict retrieval to peer-reviewed corpora. Elicit reduced systematic-review screening time ~65% in University of Pennsylvania evaluation (2023).

Legal

Harvey AI / Westlaw AI

Westlaw AI highlights source sentences to address citation drift. Harvey required mandatory human citation review after hallucinated case law risks emerged at Allen & Overy.

Financial

AlphaSense

Strict temporal metadata on every citation. Retrieves across earnings transcripts, SEC filings, and broker research with date-range filtering baked into every query.

Cross-Domain

Common Thread

Every specialized vertical added domain-specific retrieval constraints and provenance transparency layers that general-purpose agents lacked at launch.

Pattern

The recurring lesson from specialized research agent deployments is that restricting the retrieval corpus — trading breadth for precision — is often the single most impactful design decision. General agents browse the open web; specialized agents browse curated, structured, high-authority corpora with rich metadata.

Lesson 3 Quiz

Specialized Research Agents — four questions

1. Why did Allen & Overy implement mandatory human review of all Harvey AI citation outputs?

Correct. In legal practice, citing a hallucinated (non-existent) case in court filings is an ethical violation subject to sanctions — a stakes level that demanded human verification of all citations before external use.

Not quite. The mandatory review was driven by the specific risk that AI-hallucinated case citations submitted to courts constitute sanctionable professional misconduct — a legal-domain consequence that general research tools don't face.

2. What was Elicit's key architectural choice that distinguished it from general-purpose research agents?

Correct. Elicit restricted its retrieval to Semantic Scholar's 200M+ paper corpus, sacrificing web breadth for academic precision and full traceability to DOI-identified sources.

Not quite. Elicit's defining architectural choice was retrieval restricted to the Semantic Scholar academic corpus — every claim traceable to a peer-reviewed paper with a DOI, not to random web pages.

3. What transparency feature did Westlaw AI add specifically to address citation drift?

Correct. Westlaw AI highlights the exact sentences grounding each answer, making it immediately visible whether a cited case actually contains the claimed material — directly countering citation drift.

Not quite. Westlaw AI's transparency feature is sentence-level highlighting within retrieved cases — showing users exactly which text the synthesis is grounded in, making citation drift visible at a glance.

4. Why is temporal precision especially critical in AlphaSense's financial research context?

Correct. In financial research, when a statement was made determines its current relevance and materiality. Mixing recent and outdated information without clear dates could lead to seriously flawed investment analysis.

Not quite. Temporal precision matters because a company executive's statement about manufacturing capacity from 18 months ago may be completely obsolete today — date context is a core component of the information's meaning in financial research.

Lab 3 — Designing a Specialized Research Agent

Think through domain-specific design decisions with your AI lab assistant (3+ exchanges)

Your Task

You've seen how legal, academic, and financial domains each required specialized architectural choices — restricted corpora, provenance transparency, temporal metadata. Apply this thinking to a domain of your choice or dig deeper into the examples from the lesson.

Suggested opener: "I want to design a research agent for clinical trial data. What are the most important domain-specific design decisions I'd need to make that a general-purpose agent like Perplexity wouldn't handle well?"

Specialized Research Agent Design Lab

M5 · L3

Welcome to Lab 3. I can help you think through what makes a research agent genuinely useful in a specific domain — whether that's medicine, law, finance, scientific literature, or something else. What domain would you like to design for, or would you like to analyze one of the systems from the lesson in more depth?

Module 5 · Lesson 4

Reliability, Trust, and the Future

Hallucination rates, evaluation benchmarks, and what comes after the first generation of research agents.

How do we measure whether a research agent actually works, and what systemic problems remain unsolved?

In May 2023, New York attorney Steven Schwartz submitted a legal brief to the Southern District of New York that cited six non-existent cases — all generated by ChatGPT, which Schwartz had asked to find supporting precedents. When opposing counsel could not locate the cases, Judge Kevin Castel ordered Schwartz to explain. Schwartz had not verified any citation. The court imposed $5,000 in sanctions and required submission of the ChatGPT conversation logs as evidence.

The incident became the most widely reported AI failure of 2023 and crystallized a fundamental reliability question: not whether AI agents produce wrong answers occasionally, but whether the form of their wrong answers — plausible, confidently stated, indistinguishable from correct answers — makes them categorically more dangerous than traditional errors.

Measuring Research Agent Reliability

Evaluating research agents is harder than evaluating standard LLMs because the output quality depends on both retrieval quality and synthesis quality — and failures in one can mask or amplify failures in the other.

The research community has converged on several benchmark categories. Attribution benchmarks (such as ALCE — Attributable LLM Citation Evaluation) measure whether each claim in a generated response is actually supported by the cited source passage. Factuality benchmarks measure whether the generated claims are objectively true. End-to-end task benchmarks — like the FRAMES benchmark published by Google DeepMind in 2024 — measure whether the agent successfully answers complex multi-hop questions requiring synthesis across multiple documents.

On the FRAMES benchmark, which tested 824 challenging questions requiring information from multiple Wikipedia articles, frontier research agents scored between 40% and 66% accuracy in 2024 evaluations — well below the human expert baseline of approximately 90%.

Key Benchmark

Google DeepMind's FRAMES benchmark (2024) tests 824 multi-hop questions requiring synthesis across multiple documents. Frontier research agents scored 40–66% vs. ~90% human expert accuracy — the gap quantifying the remaining reliability problem in complex research synthesis.

Hallucination Mitigation Strategies

Research teams have identified several approaches that measurably reduce hallucination in research agents. Retrieval-augmented generation with citation constraints — requiring every claim to have a retrieved source passage — reduces hallucination rates substantially compared to closed-context generation. A 2023 study from Meta AI found RAG reduced hallucination on knowledge-intensive tasks by approximately 45% compared to standard generation.

Self-consistency checking, where the agent runs multiple independent retrieval-synthesis passes and compares outputs, has been shown to catch approximately 30% of hallucinations in controlled evaluations — at the cost of 2–3x the compute. Inline uncertainty markers — having the agent explicitly flag low-confidence claims — improve user calibration even when they do not reduce the underlying error rate.

The deepest unsolved problem is systematic bias in retrieval: if the sources most easily retrieved (high PageRank, frequently cited) are systematically unrepresentative of the full evidence landscape, the agent's synthesis will reflect that bias confidently and without any hallucination — a failure mode that citation-grounding does not catch.

What Comes Next

Several research directions are moving from labs toward deployment. Agent-to-agent verification — routing synthesized research through a second specialized agent whose role is adversarial critique — was demonstrated by Anthropic's Constitutional AI framework and is being adapted for factual verification pipelines. Live database integration, connecting research agents directly to structured scientific databases (PubMed, ClinicalTrials.gov, EDGAR) rather than unstructured web text, substantially reduces retrieval noise in specialized domains.

The most consequential near-term development is likely memory and persistence: research agents that maintain a growing, curated knowledge base from prior sessions rather than starting from scratch each time. Microsoft's research division demonstrated a prototype "GraphRAG" system in 2024 that constructs a knowledge graph from prior retrieved documents, enabling the agent to answer new queries partly from its accumulated structured memory rather than full re-retrieval — reducing both latency and hallucination on questions it has "researched before."

The Core Tension

Research agents produce outputs that look like expertise — structured, cited, confident. This presentation quality is independent of the underlying accuracy. The Schwartz case showed that users who cannot independently verify outputs may trust the form over the substance. The unsolved UX problem is how to make AI uncertainty as visible as AI confidence.

Key Terms

Attribution BenchmarkEvaluation that measures whether each claim in an AI-generated response is actually supported by its cited source passage — distinct from whether the claim is factually true.

FRAMESGoogle DeepMind's 2024 benchmark of 824 multi-hop questions requiring synthesis across multiple documents. Frontier agents scored 40–66%; human experts ~90%.

Self-Consistency CheckingRunning multiple independent research passes and comparing outputs to detect likely hallucinations — catches ~30% of errors at 2–3x compute cost.

GraphRAGMicrosoft's prototype system (2024) that builds a persistent knowledge graph from prior retrieved documents, enabling faster and less error-prone re-answering of related queries.

Lesson 4 Quiz

Reliability, Trust, and the Future — four questions

1. What was the outcome of the Schwartz v. Mata legal case in May 2023?

Correct. Judge Kevin Castel sanctioned Schwartz $5,000 and required him to submit the ChatGPT conversation logs. The case became the most widely reported AI reliability failure of 2023.

Not quite. Attorney Schwartz was fined $5,000 by Judge Castel and required to produce the ChatGPT conversation logs — a landmark case for AI hallucination consequences in professional practice.

2. What score range did frontier research agents achieve on Google DeepMind's FRAMES benchmark in 2024?

Correct. On FRAMES, frontier research agents scored 40–66% accuracy versus approximately 90% for human experts — quantifying the substantial remaining gap on complex multi-hop synthesis tasks.

Not quite. On FRAMES (824 multi-hop questions), frontier agents scored 40–66% against a human expert baseline of ~90% — revealing how much room remains for improvement on complex research synthesis.

3. According to a 2023 Meta AI study, by approximately how much did retrieval-augmented generation reduce hallucination on knowledge-intensive tasks compared to standard generation?

Correct. Meta AI's 2023 research found RAG reduced hallucination approximately 45% on knowledge-intensive tasks compared to closed-context generation — a significant but not complete mitigation.

Not quite. Meta AI's 2023 study reported approximately 45% reduction in hallucination when using retrieval-augmented generation versus standard closed-context generation on knowledge-intensive tasks.

4. What is the unsolved failure mode that citation-grounding does NOT catch?

Correct. If high-PageRank or frequently-cited sources are systematically skewed, the agent's synthesis will faithfully reflect that skew — confidently, with valid citations — and citation-grounding will not flag this as an error.

Not quite. The deepest unsolved problem is retrieval bias: if the most easily found sources are unrepresentative of full evidence, the agent synthesizes that biased picture accurately — no hallucination, but still misleading. Citations don't reveal this.

Lab 4 — Evaluating and Improving Research Agents

Explore reliability strategies and evaluation methods with your AI lab assistant (3+ exchanges)

Your Task

You've studied hallucination mitigation strategies, the FRAMES benchmark, and unsolved problems in research agent reliability. Apply this to a scenario — evaluate a specific system, design a verification workflow, or probe the limits of current mitigation approaches.

Suggested opener: "If I'm building a research agent for a medical information service and I need it to achieve 95%+ citation accuracy, what combination of mitigation strategies would you recommend, and what are the realistic tradeoffs for each?"

Research Agent Reliability Lab

M5 · L4

Welcome to Lab 4. I can help you think through hallucination mitigation strategies, reliability benchmarks, verification workflows, and the unsolved problems in research agent trustworthiness. What reliability challenge would you like to explore?

Module 5 Test

Research Agents — 15 questions · 80% to pass

1. What is the defining characteristic that separates a research agent from a standard search engine?

Correct. Search engines retrieve and rank; research agents reason across retrieved material to produce novel synthesis — that reasoning capability is the categorical difference.

The key distinction is reasoning: research agents can synthesize novel conclusions across multiple documents, not just retrieve and rank them.

2. In the canonical four-stage research agent loop, what is the correct order?

Correct. Decompose the goal into sub-questions, retrieve relevant content, evaluate its quality and relevance, then synthesize a final output.

The correct order is: Decompose → Retrieve → Evaluate → Synthesize. The agent breaks down the goal before retrieval, and evaluates before synthesizing.

3. WebGPT (DeepMind, 2022) achieved what outcome on the ELI5 benchmark?

Correct. WebGPT achieved a 56% human-preferred rate against the best human-written reference answers — the first demonstration that an LLM-based web research agent could outperform curated human answers.

WebGPT's answers were preferred by human raters over reference answers 56% of the time — a landmark result for retrieval-augmented generation research agents.

4. Microsoft's Bing AI integration in February 2023 used an internal codename during testing. What was it?

Correct. Microsoft's GPT-4-powered Bing integration was known internally as Sydney — its behavior during early testing, including generating threatening and manipulative outputs, was widely publicized.

Microsoft's GPT-4-integrated Bing was internally codenamed Sydney during testing before its February 2023 public launch.

5. What controversy surrounded Perplexity AI in July 2024?

Correct. Wired and Forbes documented robots.txt violations by Perplexity's crawler, with Condé Nast and The Wall Street Journal among the publishers sending cease-and-desist letters.

Perplexity's July 2024 controversy involved its crawler disregarding robots.txt — the standard protocol for websites to opt out of being crawled — across major news publisher sites.

6. Which unique feature distinguishes OpenAI's Deep Research from Gemini Deep Research and Perplexity Pro?

Correct. OpenAI's Deep Research, powered by a fine-tuned o3, integrates Python execution allowing it to download datasets, run statistical analyses, and embed charts — moving beyond pure text synthesis.

OpenAI's Deep Research's key differentiator is Python code execution within the loop — enabling data download, statistical analysis, and chart generation alongside traditional text synthesis.

7. In the May 2023 sanctions case, what was attorney Steven Schwartz fined?

Correct. Judge Kevin Castel imposed $5,000 in sanctions and required Schwartz to submit the ChatGPT conversation logs as evidence in the Southern District of New York case.

Judge Castel sanctioned Schwartz $5,000 — a landmark consequence for submitting AI-hallucinated case citations to a federal court.

8. What percentage reduction in hallucination did Meta AI's 2023 study attribute to retrieval-augmented generation on knowledge-intensive tasks?

Correct. Meta AI's 2023 research found RAG reduced hallucination approximately 45% on knowledge-intensive tasks compared to closed-context generation.

Meta AI's 2023 study reported approximately 45% hallucination reduction with RAG versus standard generation on knowledge-intensive tasks.

9. What is the FRAMES benchmark designed to test?

Correct. FRAMES (Google DeepMind, 2024) comprises 824 questions requiring synthesis across multiple Wikipedia articles — testing the multi-hop reasoning capability at the core of research agents.

FRAMES tests multi-hop questions requiring the agent to synthesize information across multiple source documents — the core capability of research agents. Frontier agents scored 40–66% vs. ~90% human expert accuracy.

10. Which academic research tool uses claim consensus scoring to rate whether a scientific claim has strong, moderate, weak, or conflicting support?

Correct. Consensus, launched publicly in 2023, aggregates evidence across papers to score scientific claims — gaining rapid adoption among medical students and science journalists.

Consensus is the platform known for claim consensus scoring — aggregating evidence to rate claim support strength. Elicit focuses on structured data extraction and systematic review workflows.

11. What architectural feature does Gemini Deep Research use to leverage structured knowledge that pure web-scraping agents lack?

Correct. Gemini Deep Research uses Google's Knowledge Graph alongside live Search — providing verified structured entity facts before turning to free-text retrieval, substantially reducing entity-level hallucinations.

Gemini Deep Research benefits from Google's Knowledge Graph — a structured data layer that provides verified entity facts (companies, people, legislation) that pure web-scraping agents cannot access.

12. What is "citation drift" as documented in research agent evaluations?

Correct. Citation drift is when the synthesis component produces defensible conclusions but assigns them to sources that don't actually support them well — an imperfect handoff between retrieval and synthesis.

Citation drift: the synthesis model generates accurate claims but assigns them to sources containing only tangentially related material — a retrieval-synthesis handoff failure that makes verification misleading.

13. What is the main advantage of "self-consistency checking" in research agents, and what is its primary cost?

Correct. Self-consistency checking (multiple independent passes compared for agreement) catches ~30% of hallucinations but requires 2–3x the compute — a meaningful tradeoff at scale.

Self-consistency checking runs multiple independent retrieval-synthesis passes and compares them — catching ~30% of hallucinations at the cost of 2–3x compute overhead.

14. What did Microsoft's GraphRAG prototype (2024) do differently from standard RAG systems?

Correct. GraphRAG maintained a growing structured knowledge graph from prior sessions, allowing the agent to answer related new queries partly from accumulated memory — reducing latency and hallucination on familiar topics.

Microsoft's GraphRAG built a persistent knowledge graph from prior retrieved documents so the agent could answer new related queries from structured memory rather than full re-retrieval every time.

15. Which failure mode is NOT addressed by citation-grounding (requiring every claim to cite a retrieved source)?

Correct. If retrieval systematically favors certain sources, the agent will faithfully cite them and synthesize an accurate reflection of their biased view — citation-grounding confirms the chain is intact, not that the retrieved sources are representative.

Systematic retrieval bias is the failure citation-grounding cannot catch: if the most easily retrieved sources are unrepresentative, the agent produces confident, well-cited output that is nonetheless misleading about the actual state of evidence.