In March 2023, Perplexity AI processed over 500 million queries in a single month by building a retrieval pipeline on top of Bing's Web Search API combined with their own crawling infrastructure. Rather than sending users to a list of blue links, their system issued structured API calls, extracted relevant snippets, re-ranked results by semantic similarity, and synthesized a grounded answer — all within roughly 1.5 seconds per query. The engineering insight was treating search not as a destination but as a data source: a structured feed of URLs, titles, and snippets that an LLM could reason over.
A search API translates a natural-language or keyword query into a structured JSON response. The major providers — Bing Web Search API, Google Custom Search JSON API, SerpAPI, Brave Search API, and Tavily — each expose slightly different schemas, rate limits, and data freshness guarantees. Choosing the right one for an agent is an architectural decision, not a UI choice.
Bing's API returns webPages.value[] objects with fields including name, url, snippet, dateLastCrawled, and cachedPageUrl. Google's Custom Search JSON API returns items[] with title, link, snippet, and a pagemap sub-object containing structured metadata. SerpAPI abstracts over both Google and Bing and normalizes the schema. Tavily, built specifically for LLM agents, returns pre-scored relevance alongside raw snippets and supports a search_depth: advanced mode that triggers full-page content extraction.
The critical technical difference for agent design is snippet quality vs. full content access. Standard search APIs return 150–300 character snippets optimized for human reading. Agents that need to cite sources accurately or synthesize detailed answers must follow URLs and fetch full page content — a separate operation with its own latency and token cost.
Tavily's include_raw_content: true flag retrieves full extracted text alongside search results in a single API call, collapsing what would otherwise be a two-step fetch-and-parse pipeline. For latency-sensitive agents, this is a meaningful design tradeoff worth benchmarking.
Raw user intent rarely maps cleanly to an effective search query. The agent must perform query transformation — converting a conversational request like "what's the latest on OpenAI's valuation?" into a structured query like "OpenAI valuation" site:reuters.com OR site:bloomberg.com after:2024-01-01. This is a non-trivial step. Production systems at companies like Perplexity and You.com use a dedicated query-rewriting LLM call before the API request.
Result processing involves three layers: deduplication (removing near-identical snippets from syndicated sources), relevance re-ranking (scoring results against the original user intent, not just the transformed query), and provenance tracking (maintaining the URL-to-snippet mapping so citations survive the LLM's summarization step). Losing provenance is a common failure mode — the agent produces a confident answer but cannot tell the user where the information came from.
In November 2023, early versions of Google's Search Generative Experience (SGE) cited a satirical Onion article as factual health advice because the snippet-level relevance score was high but the source domain credibility check was absent. Production agents need explicit domain-trust filtering as a post-retrieval step, not just retrieval quality optimization.
Bing Web Search API's S1 tier allows 3 queries per second and 1,000 queries per month at $7. Tavily's free tier allows 1,000 searches per month; production usage is priced at $0.004 per search at scale. At those prices, an agent that issues 5 search calls per user session costs $0.02 per session in search costs alone — before LLM inference costs. Multiply by 100,000 daily active users and search API costs become a significant line item.
Production teams address this through semantic caching: storing the embedding of each query alongside its results, then returning cached results for new queries within a cosine similarity threshold (typically 0.92–0.95). Redis with vector search extensions or Qdrant are commonly used for this layer. The cache hit rate for a mature consumer product typically reaches 30–40%, substantially reducing API spend.
search_depth: advanced mode from standard search API calls?You are designing the search retrieval layer for a financial news agent. The agent needs to answer real-time questions like "What are analysts saying about Nvidia's Q3 earnings?" with cited, accurate answers.
In January 2024, Cognition AI's Devin software engineering agent made headlines by autonomously navigating to GitHub, reading a repository's issue tracker, cloning code, running tests, and submitting a pull request — all through a headless browser and terminal integration. Devin used Playwright under the hood to control a Chromium instance, taking screenshots at each step for visual state verification. The architecture exposed a key insight: for tasks requiring multi-step web interaction, a headless browser is not a convenience feature but a fundamental capability layer.
Playwright (Microsoft, 2020) and Puppeteer (Google, 2018) both expose a programmatic API for controlling a headless browser. Playwright supports Chromium, Firefox, and WebKit with a unified API; Puppeteer targets Chromium/Chrome. Both operate by connecting to the browser's DevTools Protocol (CDP), which allows JavaScript execution, network interception, DOM inspection, screenshot capture, and user input simulation.
For agent use, the most important primitives are: page.goto(url) for navigation, page.waitForSelector(selector) for element readiness, page.fill(selector, text) and page.click(selector) for form interaction, page.screenshot() for visual state capture, and page.evaluate(fn) for arbitrary JS execution in page context. The challenge is that modern web apps render content dynamically — a naive goto + innerText extraction misses content loaded by JavaScript after initial page load.
Playwright's page.waitForLoadState('networkidle') waits until no network requests have fired for 500ms — a much more reliable signal for SPA content readiness than DOMContentLoaded, which fires before JavaScript frameworks render their content.
For content extraction, page.locator() with semantic selectors (ARIA roles, text content) is more resilient to site redesigns than XPath or CSS class selectors, which break whenever a developer renames a class. Production browser agents use a layered extraction strategy: try semantic selectors first, fall back to readability-based extraction (Mozilla's Readability library), and finally fall back to raw innerText with noise filtering.
In October 2024, Anthropic released the Claude 3.5 Sonnet Computer Use API, which allows Claude to observe screenshots and emit low-level actions: mouse clicks at pixel coordinates, keyboard input, and scroll commands. This represents a fundamental shift from selector-based automation (which requires knowing the DOM structure) to vision-based automation (which works on any rendered interface). The agent sees the page as a bitmap and decides where to click by understanding the visual layout.
Google's Project Mariner, demonstrated in December 2024, took a similar approach using Gemini to control Chrome via screenshot observation. OpenAI's Operator product, announced January 2025, uses GPT-4o to control a browser through a cloud-hosted Chromium instance, with a human-in-the-loop confirmation step for actions involving payments or account changes.
Anthropic's Computer Use API documentation notes that vision-based control averages 3–8 seconds per action step due to screenshot capture, encoding, and LLM inference latency. A 10-step task therefore takes 30–80 seconds — acceptable for autonomous research tasks but too slow for real-time user-facing interactions.
Production browser agents face significant anti-bot infrastructure. Cloudflare's Bot Management product (used by over 30% of Fortune 1000 websites as of 2024) uses TLS fingerprinting, JavaScript challenge evaluation, mouse movement heuristics, and behavioral analysis to distinguish automated browsers from human users. Playwright and Puppeteer emit detectable signals by default — navigator.webdriver is set to true, CDP socket connections are observable, and default viewport sizes differ from typical user profiles.
Mitigations include: playwright-extra with the stealth plugin (patches the most common detection vectors), residential proxy rotation, and browser profile persistence (reusing cookies and localStorage to appear as a returning user). However, the arms race between automation detection and evasion is ongoing. Ethical agent design means respecting robots.txt, rate-limiting requests to avoid server overload, and not using automation to circumvent access controls or paywalls.
page.waitForLoadState('networkidle') more reliable than DOMContentLoaded for SPA content extraction?You are building an agent that monitors competitor pricing on three e-commerce sites: one static HTML site, one React SPA, and one behind a login wall. For each, decide:
In September 2023, OpenAI re-enabled web browsing in ChatGPT after disabling it in July due to a vulnerability where the Bing-powered browsing tool could be manipulated through prompt injection in web pages — adversarial content that caused the agent to exfiltrate conversation context by encoding it in a URL the agent was instructed to navigate to. The fix required adding an output filter that blocked navigation to URLs containing patterns resembling encoded conversation data. This real incident illustrates that web tools dramatically expand an agent's attack surface: every web page it reads is potentially adversarial input.
In the OpenAI function calling / Anthropic tool use paradigm, web access is implemented as one or more named tools that the LLM can invoke by emitting a structured JSON block. A minimal web-capable agent typically exposes three tools: web_search(query: str) → list[SearchResult], fetch_url(url: str) → str, and optionally browser_action(action: BrowserAction) → Screenshot. The LLM decides when to call these based on whether its parametric knowledge is sufficient for the task.
The tool calling loop runs as follows: the LLM receives user message → determines whether web access is needed → emits a tool call → the orchestrator executes the tool and appends the result to context → the LLM reasons over the result and either calls another tool or generates a final response. The number of tool call cycles per query is bounded by a max_iterations parameter to prevent infinite loops.
The ReAct (Reasoning + Acting) pattern, published by Yao et al. at Princeton in 2022, formalizes this loop by interleaving explicit reasoning traces ("Thought: I need to find the current price...") with tool calls ("Action: web_search(...)") and observations ("Observation: ..."). Production implementations at LangChain and LlamaIndex directly implement ReAct as the default agent executor pattern.
Web retrieval creates a context management problem. A single webpage can contain 50,000–200,000 tokens of text. Even with 200K context windows, injecting multiple full pages alongside the conversation history and system prompt can saturate the context window, increase inference latency, and degrade model attention on the most relevant content.
Production systems address this through retrieved content compression: after fetching a URL, a lightweight LLM call (or a non-LLM extractive summarization model like DistilBERT-based MRC) extracts only the passages relevant to the current query. This "query-focused summarization" step reduces a full webpage to 300–800 tokens of relevant content. LangChain's MapReduce chain and LlamaIndex's SentenceWindowNodeParser implement variants of this approach.
Stanford and UC Berkeley research (Liu et al., 2023) demonstrated that LLMs perform significantly worse when the relevant information appears in the middle of a long context window versus at the beginning or end. For agents injecting multiple web sources, ordering matters: put the most relevant retrieved content last, immediately before the generation prompt.
Complex research queries require multiple sequential web interactions. GPT-Researcher (open source, 2023) implements a multi-agent pattern: a "Director" agent breaks the research question into sub-queries, dispatches them to parallel "Researcher" agents each running their own search-and-fetch loops, then a "Writer" agent synthesizes the results. This pattern reduced end-to-end latency by 3x compared to sequential single-agent research by parallelizing the search calls.
The key orchestration decision is whether sub-tasks run in parallel (faster but requires more context aggregation logic) or sequentially with information passing between steps (slower but allows each step to refine based on prior findings). For fact-finding tasks, parallel is preferred. For tasks with logical dependencies — "first find the CEO, then research their background" — sequential chaining is necessary.
Design a web research agent for a law firm that needs to monitor regulatory changes across three jurisdictions daily. Your design must address:
This lesson explores reliability & safety — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to reliability & safety.