When Microsoft integrated GPT-4 into Bing in February 2023, it didn't simply attach a chat window to a search index. The resulting system — internally called Sydney during testing — could issue follow-up search queries based on its own intermediate conclusions, synthesize results across multiple pages, and present a single coherent answer rather than a list of links. That architecture was a public, mass-scale deployment of what researchers had been calling an agentic loop: perceive, reason, act, observe, repeat.
Within weeks, rival teams at Google, Perplexity AI, and a dozen startups were racing to ship similar systems. The era of the research agent had arrived not in a laboratory but in a browser tab.
A research agent is an AI system that accepts an open-ended informational goal and autonomously plans and executes a sequence of information-gathering and synthesis steps to satisfy it. Unlike a single-shot language model query — where you ask once and receive one answer — a research agent operates in a loop.
The loop has four canonical stages: decompose (break the goal into sub-questions), retrieve (search web, databases, files, APIs), evaluate (judge source quality and relevance), and synthesize (combine findings into a coherent output). The agent may iterate this loop dozens of times before returning a result, adjusting its plan based on what it finds.
The concept predates large language models. IBM's Watson defeated Jeopardy champions in 2011 by combining information retrieval with passage scoring — a primitive but recognizable research pipeline. What changed with transformer-based LLMs was the ability to reason flexibly about retrieved text rather than just pattern-match it.
In 2022, DeepMind published WebGPT, demonstrating a fine-tuned GPT-3 variant that could browse the web using a text-based browser and answer long-form questions with citations. Evaluated on the ELI5 benchmark, WebGPT's human-preferred rate exceeded the best human-written reference answers 56% of the time — a landmark result that directly influenced the design of subsequent commercial research agents.
By 2024, the category had fragmented into specialist variants: academic search agents (Elicit, Consensus), legal research agents (Harvey AI, Westlaw AI), competitive intelligence agents (Crayon, AlphaSense AI), and general-purpose deep research agents (Perplexity Pro, Gemini Deep Research, OpenAI Deep Research).
Search engines retrieve and rank documents. Research agents reason across documents. The difference is not speed or coverage — it is whether the system can form novel conclusions that appear in none of the individual source documents.
You've learned the four stages of a research agent loop: decompose, retrieve, evaluate, synthesize. In this lab, explore how those stages interact in practice. Ask about real systems, edge cases, failure modes, or design tradeoffs.
On February 2, 2025, OpenAI released Deep Research as a feature within ChatGPT Pro. In its launch blog post, the company reported that the agent scored 26.6% on Humanity's Last Exam — a benchmark designed to stump expert humans — compared to 3–4% for standard GPT-4o. The agent was described as capable of spending five to thirty minutes autonomously browsing the web, reading academic papers, and producing multi-thousand-word reports with inline citations.
The reaction from the research community was immediate. Independent testers praised the depth of outputs but documented cases where the agent cited paywalled papers it had clearly not read in full, instead inferring their content from abstracts and surrounding commentary. The quality gap between what it appeared to know and what it had actually retrieved became a central thread in AI-safety discussions throughout early 2025.
Perplexity AI, founded in 2022 and valued at approximately $9 billion by late 2024, built its product on a retrieval-augmented generation (RAG) architecture but with a critical addition: an online retrieval step that issues live web searches rather than querying a static pre-indexed corpus. Each user query triggers a query fan-out, parallel Bing API calls, result scraping, passage extraction, and finally synthesis via a fine-tuned LLM.
Perplexity's Pro Search mode, released in 2023, added multi-hop retrieval — where intermediate answers generate new queries. A question about "the effect of the 2023 Silicon Valley Bank collapse on venture capital dry powder" would trigger sub-queries about SVB's failure, about LP capital calls, and about Q2/Q3 2023 VC funding data, with each strand informing the final synthesis.
The company faced significant controversy in July 2024 when Wired and Forbes reported that Perplexity's web crawler was ignoring robots.txt directives — the industry-standard protocol for sites to opt out of crawling. Several news publishers, including Condé Nast and The Wall Street Journal, sent cease-and-desist letters. The episode illustrated a core tension in research agent design: comprehensive retrieval conflicts directly with publisher consent frameworks.
In July 2024, Wired and Forbes published investigations showing Perplexity's crawler disregarding robots.txt on major news sites. This sparked an industry debate about whether research agents require a new consent standard beyond existing crawl protocols.
Google launched Gemini Deep Research in December 2024, integrated into Gemini Advanced (the $20/month tier). Unlike Perplexity's streaming approach, Deep Research first presents a research plan — a structured outline of the sub-questions it intends to investigate — and asks the user to approve or modify it before execution. This human-in-the-loop checkpoint at the planning stage was a deliberate design choice to increase user trust and reduce wasted compute on misdirected research plans.
Google's version leverages its proprietary Knowledge Graph alongside live Search, giving it a structured-data layer that pure web-scraping agents lack. When researching entities (companies, people, legislation), Gemini can pull verified structured facts before turning to free-text retrieval — substantially reducing entity-level hallucinations.
OpenAI's Deep Research, powered by a fine-tuned version of o3, was notable for its integration of Python code execution within the research loop. The agent can download datasets, run statistical analyses, generate charts, and embed the results in its final report — moving beyond pure text synthesis into what OpenAI called agentic data analysis. In early benchmark comparisons, it substantially outperformed Gemini Deep Research on tasks requiring numerical reasoning over retrieved tables.
The system's main limitation, documented by multiple independent reviewers in February–March 2025, was citation drift: the tendency to generate accurate high-level conclusions while assigning those conclusions to sources that contained only tangentially related material. The underlying cause — that the synthesis LLM and the retrieval system share an imperfect handoff — remains an open research problem.
| System | Plan Review | Code Execution | Primary Retrieval | Known Weakness |
|---|---|---|---|---|
| Perplexity Pro | None | No | Live web (Bing API + scraper) | robots.txt violations; thin citations |
| Gemini Deep Research | Yes — approve plan before execution | No | Google Search + Knowledge Graph | Slower; plan approval adds friction |
| OpenAI Deep Research | Partial | Yes (Python) | Live web browser + file uploads | Citation drift on complex synthesis tasks |
The human-in-the-loop checkpoint that Gemini Deep Research places at the planning stage — rather than at the output stage — is architecturally significant. It costs almost nothing (a ten-second approval) but eliminates entire branches of wasted research caused by misunderstood intent.
You've studied three deployed research agent systems with distinct architectures and documented failure modes. Use this lab to explore their tradeoffs, ask about real incidents, or think through how you would design around their weaknesses.
In September 2023, the legal research firm Harvey AI — backed by a16z and OpenAI — announced a partnership with Allen & Overy, one of the world's largest law firms. The deployment gave approximately 3,500 lawyers access to a research agent trained on legal corpora and integrated with PLC, Practical Law, and internal case databases. Partners reported that associates were using Harvey to draft research memos in hours rather than days.
But the deployment also surfaced a problem familiar to legal professionals: hallucinated citations. Unlike a general research task where a plausible but invented source is merely embarrassing, in legal work a non-existent case citation submitted to a court is a sanctionable ethical violation. Allen & Overy implemented mandatory human review of all Harvey-generated citations before any external use — a workflow layer that general-purpose research agents had never needed to design for.
Elicit, built by Ought (later renamed Elicit Inc.), launched its research assistant in 2022 targeting systematic-review workflows in academic research. Its key architectural choice was restricting retrieval to Semantic Scholar's corpus of 200+ million academic papers rather than the open web. This sacrificed breadth for precision and traceability — every claim could be traced to a specific paper with a DOI.
Elicit's agent can extract structured data from papers (sample sizes, effect sizes, study designs), populate comparison tables across dozens of papers, and flag methodological inconsistencies. In 2023, a team at the University of Pennsylvania published an evaluation showing Elicit's paper-screening recall on systematic reviews was competitive with trained research assistants, while reducing screening time by roughly 65%.
Consensus took a similar academic focus but emphasized claim consensus scoring — aggregating evidence across papers to rate whether a given scientific claim has strong, moderate, weak, or conflicting support. Launched publicly in 2023, it gained rapid adoption among medical students and science journalists seeking quick literature overviews.
Harvey AI's architecture is notable for its retrieval-augmented approach using structured legal databases rather than web crawling. Case law, statutes, and regulatory filings are indexed with structured metadata (jurisdiction, date, court level, citing relationships), enabling the agent to apply jurisdictional filters and precedent hierarchies that a general web search cannot replicate.
Thomson Reuters launched Westlaw AI in 2024, integrating generative AI into its legacy legal research platform with a key differentiator: the system highlights the specific sentences within retrieved cases that support each generated answer. This transparency feature — showing exactly what text the synthesis was grounded in — was a direct response to the citation drift problem that had plagued earlier legal AI deployments.
AlphaSense, a financial research platform used by investment banks and hedge funds, deployed generative AI search in 2023 across a corpus of earnings call transcripts, SEC filings, broker research, and news. Its research agent can answer questions like "What have pharmaceutical executives said about GLP-1 manufacturing capacity in the last 90 days?" by retrieving and synthesizing across hundreds of documents that no human analyst could read in real time.
The platform's central design challenge is temporal precision. In financial research, a claim from a Q2 2023 earnings call is materially different from the same claim in Q3 2024. AlphaSense built strict date-range filtering into every retrieval step and surfaces the document date prominently in every citation — a design choice that general research agents often neglect.
The recurring lesson from specialized research agent deployments is that restricting the retrieval corpus — trading breadth for precision — is often the single most impactful design decision. General agents browse the open web; specialized agents browse curated, structured, high-authority corpora with rich metadata.
You've seen how legal, academic, and financial domains each required specialized architectural choices — restricted corpora, provenance transparency, temporal metadata. Apply this thinking to a domain of your choice or dig deeper into the examples from the lesson.
In May 2023, New York attorney Steven Schwartz submitted a legal brief to the Southern District of New York that cited six non-existent cases — all generated by ChatGPT, which Schwartz had asked to find supporting precedents. When opposing counsel could not locate the cases, Judge Kevin Castel ordered Schwartz to explain. Schwartz had not verified any citation. The court imposed $5,000 in sanctions and required submission of the ChatGPT conversation logs as evidence.
The incident became the most widely reported AI failure of 2023 and crystallized a fundamental reliability question: not whether AI agents produce wrong answers occasionally, but whether the form of their wrong answers — plausible, confidently stated, indistinguishable from correct answers — makes them categorically more dangerous than traditional errors.
Evaluating research agents is harder than evaluating standard LLMs because the output quality depends on both retrieval quality and synthesis quality — and failures in one can mask or amplify failures in the other.
The research community has converged on several benchmark categories. Attribution benchmarks (such as ALCE — Attributable LLM Citation Evaluation) measure whether each claim in a generated response is actually supported by the cited source passage. Factuality benchmarks measure whether the generated claims are objectively true. End-to-end task benchmarks — like the FRAMES benchmark published by Google DeepMind in 2024 — measure whether the agent successfully answers complex multi-hop questions requiring synthesis across multiple documents.
On the FRAMES benchmark, which tested 824 challenging questions requiring information from multiple Wikipedia articles, frontier research agents scored between 40% and 66% accuracy in 2024 evaluations — well below the human expert baseline of approximately 90%.
Google DeepMind's FRAMES benchmark (2024) tests 824 multi-hop questions requiring synthesis across multiple documents. Frontier research agents scored 40–66% vs. ~90% human expert accuracy — the gap quantifying the remaining reliability problem in complex research synthesis.
Research teams have identified several approaches that measurably reduce hallucination in research agents. Retrieval-augmented generation with citation constraints — requiring every claim to have a retrieved source passage — reduces hallucination rates substantially compared to closed-context generation. A 2023 study from Meta AI found RAG reduced hallucination on knowledge-intensive tasks by approximately 45% compared to standard generation.
Self-consistency checking, where the agent runs multiple independent retrieval-synthesis passes and compares outputs, has been shown to catch approximately 30% of hallucinations in controlled evaluations — at the cost of 2–3x the compute. Inline uncertainty markers — having the agent explicitly flag low-confidence claims — improve user calibration even when they do not reduce the underlying error rate.
The deepest unsolved problem is systematic bias in retrieval: if the sources most easily retrieved (high PageRank, frequently cited) are systematically unrepresentative of the full evidence landscape, the agent's synthesis will reflect that bias confidently and without any hallucination — a failure mode that citation-grounding does not catch.
Several research directions are moving from labs toward deployment. Agent-to-agent verification — routing synthesized research through a second specialized agent whose role is adversarial critique — was demonstrated by Anthropic's Constitutional AI framework and is being adapted for factual verification pipelines. Live database integration, connecting research agents directly to structured scientific databases (PubMed, ClinicalTrials.gov, EDGAR) rather than unstructured web text, substantially reduces retrieval noise in specialized domains.
The most consequential near-term development is likely memory and persistence: research agents that maintain a growing, curated knowledge base from prior sessions rather than starting from scratch each time. Microsoft's research division demonstrated a prototype "GraphRAG" system in 2024 that constructs a knowledge graph from prior retrieved documents, enabling the agent to answer new queries partly from its accumulated structured memory rather than full re-retrieval — reducing both latency and hallucination on questions it has "researched before."
Research agents produce outputs that look like expertise — structured, cited, confident. This presentation quality is independent of the underlying accuracy. The Schwartz case showed that users who cannot independently verify outputs may trust the form over the substance. The unsolved UX problem is how to make AI uncertainty as visible as AI confidence.
You've studied hallucination mitigation strategies, the FRAMES benchmark, and unsolved problems in research agent reliability. Apply this to a scenario — evaluate a specific system, design a verification workflow, or probe the limits of current mitigation approaches.