🎯 Advanced · Lesson 1 of 4

Search APIs: Architecture and Integration

How agents query the web programmatically — from raw HTTP calls to structured results pipelines.

In March 2023, Perplexity AI processed over 500 million queries in a single month by building a retrieval pipeline on top of Bing's Web Search API combined with their own crawling infrastructure. Rather than sending users to a list of blue links, their system issued structured API calls, extracted relevant snippets, re-ranked results by semantic similarity, and synthesized a grounded answer — all within roughly 1.5 seconds per query. The engineering insight was treating search not as a destination but as a data source: a structured feed of URLs, titles, and snippets that an LLM could reason over.

Search API Landscape

A search API translates a natural-language or keyword query into a structured JSON response. The major providers — Bing Web Search API, Google Custom Search JSON API, SerpAPI, Brave Search API, and Tavily — each expose slightly different schemas, rate limits, and data freshness guarantees. Choosing the right one for an agent is an architectural decision, not a UI choice.

Bing's API returns webPages.value[] objects with fields including name, url, snippet, dateLastCrawled, and cachedPageUrl. Google's Custom Search JSON API returns items[] with title, link, snippet, and a pagemap sub-object containing structured metadata. SerpAPI abstracts over both Google and Bing and normalizes the schema. Tavily, built specifically for LLM agents, returns pre-scored relevance alongside raw snippets and supports a search_depth: advanced mode that triggers full-page content extraction.

The critical technical difference for agent design is snippet quality vs. full content access. Standard search APIs return 150–300 character snippets optimized for human reading. Agents that need to cite sources accurately or synthesize detailed answers must follow URLs and fetch full page content — a separate operation with its own latency and token cost.

Architecture Note

Tavily's include_raw_content: true flag retrieves full extracted text alongside search results in a single API call, collapsing what would otherwise be a two-step fetch-and-parse pipeline. For latency-sensitive agents, this is a meaningful design tradeoff worth benchmarking.

Query Construction and Result Processing

Raw user intent rarely maps cleanly to an effective search query. The agent must perform query transformation — converting a conversational request like "what's the latest on OpenAI's valuation?" into a structured query like "OpenAI valuation" site:reuters.com OR site:bloomberg.com after:2024-01-01. This is a non-trivial step. Production systems at companies like Perplexity and You.com use a dedicated query-rewriting LLM call before the API request.

Result processing involves three layers: deduplication (removing near-identical snippets from syndicated sources), relevance re-ranking (scoring results against the original user intent, not just the transformed query), and provenance tracking (maintaining the URL-to-snippet mapping so citations survive the LLM's summarization step). Losing provenance is a common failure mode — the agent produces a confident answer but cannot tell the user where the information came from.

Query transformation: convert intent to optimized keyword/operator string
API call with pagination awareness (page 1 may not have the best result)
Snippet extraction: pull the most relevant 2–4 snippets, not all 10
Optional full-page fetch for sources requiring deeper content
Re-rank by semantic similarity to original user query
Inject top-k results into context with source URLs preserved

Real Failure Case

In November 2023, early versions of Google's Search Generative Experience (SGE) cited a satirical Onion article as factual health advice because the snippet-level relevance score was high but the source domain credibility check was absent. Production agents need explicit domain-trust filtering as a post-retrieval step, not just retrieval quality optimization.

Rate Limits, Caching, and Cost Management

Bing Web Search API's S1 tier allows 3 queries per second and 1,000 queries per month at $7. Tavily's free tier allows 1,000 searches per month; production usage is priced at $0.004 per search at scale. At those prices, an agent that issues 5 search calls per user session costs $0.02 per session in search costs alone — before LLM inference costs. Multiply by 100,000 daily active users and search API costs become a significant line item.

Production teams address this through semantic caching: storing the embedding of each query alongside its results, then returning cached results for new queries within a cosine similarity threshold (typically 0.92–0.95). Redis with vector search extensions or Qdrant are commonly used for this layer. The cache hit rate for a mature consumer product typically reaches 30–40%, substantially reducing API spend.

→ Lesson 1 Quiz

🎯 Advanced · Quiz 1

Quiz: Search APIs

3 questions — free, untracked, retake anytime.

1. What distinguishes Tavily's search_depth: advanced mode from standard search API calls?

✓ Correct — ✓ Correct. Tavily's advanced mode fetches and extracts full page text alongside snippets, collapsing the typical two-step fetch-and-parse workflow into one API call.

Not quite. Tavily's distinguishing feature is full-page content extraction in a single call — which collapses the separate URL-fetch step agents otherwise need.

2. What is "provenance tracking" in a search result processing pipeline?

✓ Correct — ✓ Correct. Provenance tracking ensures the original source URL stays attached to each piece of information so the agent can cite it accurately even after synthesizing an answer.

Incorrect. Provenance tracking specifically means keeping source URLs linked to their content snippets so citations are not lost during LLM summarization.

3. Why do production search-augmented agents use semantic caching?

✓ Correct — ✓ Correct. Semantic caching stores query embeddings and their results, returning cached responses when a new query exceeds a cosine similarity threshold — cutting API costs by 30–40% at scale.

Incorrect. Semantic caching's primary benefit is cost reduction: similar queries return cached results instead of triggering new API calls, cutting spend significantly at scale.

← Back to Lesson → Lab 1

🎯 Advanced · Lab 1

Lab: Search API Pipeline Design

Design a production-grade search retrieval pipeline for an LLM agent.

Your Challenge

You are designing the search retrieval layer for a financial news agent. The agent needs to answer real-time questions like "What are analysts saying about Nvidia's Q3 earnings?" with cited, accurate answers.

Decide which search API(s) to use and justify the choice
Design the query transformation step
Explain how you will handle provenance and citations
Describe your caching strategy

Walk me through your architecture decisions for a financial news search pipeline — which API, how you transform queries, how you preserve citations, and how you cache results.

🔬 Search API Lab AI Tutor Active

← Back to Quiz → Lesson 2

🎯 Advanced · Lesson 2 of 4

Browser Automation: Playwright, Puppeteer, and Agent Control

When search snippets are not enough — programmatic browser control for agents that need to interact with live web pages.

In January 2024, Cognition AI's Devin software engineering agent made headlines by autonomously navigating to GitHub, reading a repository's issue tracker, cloning code, running tests, and submitting a pull request — all through a headless browser and terminal integration. Devin used Playwright under the hood to control a Chromium instance, taking screenshots at each step for visual state verification. The architecture exposed a key insight: for tasks requiring multi-step web interaction, a headless browser is not a convenience feature but a fundamental capability layer.

Playwright and Puppeteer: Core Mechanics

Playwright (Microsoft, 2020) and Puppeteer (Google, 2018) both expose a programmatic API for controlling a headless browser. Playwright supports Chromium, Firefox, and WebKit with a unified API; Puppeteer targets Chromium/Chrome. Both operate by connecting to the browser's DevTools Protocol (CDP), which allows JavaScript execution, network interception, DOM inspection, screenshot capture, and user input simulation.

For agent use, the most important primitives are: page.goto(url) for navigation, page.waitForSelector(selector) for element readiness, page.fill(selector, text) and page.click(selector) for form interaction, page.screenshot() for visual state capture, and page.evaluate(fn) for arbitrary JS execution in page context. The challenge is that modern web apps render content dynamically — a naive goto + innerText extraction misses content loaded by JavaScript after initial page load.

Key Difference

Playwright's page.waitForLoadState('networkidle') waits until no network requests have fired for 500ms — a much more reliable signal for SPA content readiness than DOMContentLoaded, which fires before JavaScript frameworks render their content.

For content extraction, page.locator() with semantic selectors (ARIA roles, text content) is more resilient to site redesigns than XPath or CSS class selectors, which break whenever a developer renames a class. Production browser agents use a layered extraction strategy: try semantic selectors first, fall back to readability-based extraction (Mozilla's Readability library), and finally fall back to raw innerText with noise filtering.

Computer Use and Multimodal Browser Control

In October 2024, Anthropic released the Claude 3.5 Sonnet Computer Use API, which allows Claude to observe screenshots and emit low-level actions: mouse clicks at pixel coordinates, keyboard input, and scroll commands. This represents a fundamental shift from selector-based automation (which requires knowing the DOM structure) to vision-based automation (which works on any rendered interface). The agent sees the page as a bitmap and decides where to click by understanding the visual layout.

Google's Project Mariner, demonstrated in December 2024, took a similar approach using Gemini to control Chrome via screenshot observation. OpenAI's Operator product, announced January 2025, uses GPT-4o to control a browser through a cloud-hosted Chromium instance, with a human-in-the-loop confirmation step for actions involving payments or account changes.

Selector-based automation (Playwright/Puppeteer): fast, brittle, requires DOM knowledge
Vision-based automation (Claude Computer Use, Operator): slower, robust, works on any rendered UI
Hybrid: use selectors where reliable, fall back to vision for login/CAPTCHA/dynamic content
Screenshot loops: capture state → reason about next action → execute → repeat

Production Reality

Anthropic's Computer Use API documentation notes that vision-based control averages 3–8 seconds per action step due to screenshot capture, encoding, and LLM inference latency. A 10-step task therefore takes 30–80 seconds — acceptable for autonomous research tasks but too slow for real-time user-facing interactions.

Session Management and Anti-Bot Measures

Production browser agents face significant anti-bot infrastructure. Cloudflare's Bot Management product (used by over 30% of Fortune 1000 websites as of 2024) uses TLS fingerprinting, JavaScript challenge evaluation, mouse movement heuristics, and behavioral analysis to distinguish automated browsers from human users. Playwright and Puppeteer emit detectable signals by default — navigator.webdriver is set to true, CDP socket connections are observable, and default viewport sizes differ from typical user profiles.

Mitigations include: playwright-extra with the stealth plugin (patches the most common detection vectors), residential proxy rotation, and browser profile persistence (reusing cookies and localStorage to appear as a returning user). However, the arms race between automation detection and evasion is ongoing. Ethical agent design means respecting robots.txt, rate-limiting requests to avoid server overload, and not using automation to circumvent access controls or paywalls.

← Lab 1 → Lesson 2 Quiz

🎯 Advanced · Quiz 2

Quiz: Browser Automation

3 questions — free, untracked, retake anytime.

1. Why is page.waitForLoadState('networkidle') more reliable than DOMContentLoaded for SPA content extraction?

✓ Correct — ✓ Correct. DOMContentLoaded fires when HTML is parsed, but SPA frameworks render content via JavaScript after that point. networkidle waits for JS rendering to complete.

Incorrect. The key reason is that DOMContentLoaded fires before JavaScript frameworks render their content, so you'd miss dynamically loaded data. networkidle waits for rendering to finish.

2. What is the primary advantage of vision-based browser control (like Claude Computer Use) over selector-based automation?

✓ Correct — ✓ Correct. Vision-based control observes screenshots and clicks by visual understanding, so it works regardless of DOM structure — making it robust to site redesigns that break selector-based automation.

Incorrect. Vision-based control is actually slower. Its advantage is robustness: it works on any rendered interface without needing DOM knowledge, making it resilient to site changes.

3. Which signal makes default Playwright/Puppeteer instances detectable by anti-bot systems?

✓ Correct — ✓ Correct. navigator.webdriver=true is a standard WebDriver flag exposed to page JavaScript, and CDP connections have observable network signatures — both are standard detection vectors.

Incorrect. The primary giveaways are navigator.webdriver being set to true (visible to page JavaScript) and observable CDP socket connections that anti-bot systems can fingerprint.

← Back to Lesson → Lab 2

🎯 Advanced · Lab 2

Lab: Browser Automation Strategy

Choose the right automation approach for complex real-world agent tasks.

Your Challenge

You are building an agent that monitors competitor pricing on three e-commerce sites: one static HTML site, one React SPA, and one behind a login wall. For each, decide:

Which automation approach to use (selector-based, vision-based, or hybrid)
How to handle session persistence for the login-walled site
How to avoid triggering anti-bot defenses
What ethical constraints apply

Walk me through your automation approach for each of the three sites — and explain what ethical constraints shape your design choices.

🤖 Browser Automation Lab AI Tutor Active

← Back to Quiz → Lesson 3

🎯 Advanced · Lesson 3 of 4

Web-Capable Agent Architecture

How search and browsing tools fit into the broader agent loop — tool calling, memory, and orchestration patterns.

In September 2023, OpenAI re-enabled web browsing in ChatGPT after disabling it in July due to a vulnerability where the Bing-powered browsing tool could be manipulated through prompt injection in web pages — adversarial content that caused the agent to exfiltrate conversation context by encoding it in a URL the agent was instructed to navigate to. The fix required adding an output filter that blocked navigation to URLs containing patterns resembling encoded conversation data. This real incident illustrates that web tools dramatically expand an agent's attack surface: every web page it reads is potentially adversarial input.

Tool Calling Architecture for Web Access

In the OpenAI function calling / Anthropic tool use paradigm, web access is implemented as one or more named tools that the LLM can invoke by emitting a structured JSON block. A minimal web-capable agent typically exposes three tools: web_search(query: str) → list[SearchResult], fetch_url(url: str) → str, and optionally browser_action(action: BrowserAction) → Screenshot. The LLM decides when to call these based on whether its parametric knowledge is sufficient for the task.

The tool calling loop runs as follows: the LLM receives user message → determines whether web access is needed → emits a tool call → the orchestrator executes the tool and appends the result to context → the LLM reasons over the result and either calls another tool or generates a final response. The number of tool call cycles per query is bounded by a max_iterations parameter to prevent infinite loops.

ReAct Pattern

The ReAct (Reasoning + Acting) pattern, published by Yao et al. at Princeton in 2022, formalizes this loop by interleaving explicit reasoning traces ("Thought: I need to find the current price...") with tool calls ("Action: web_search(...)") and observations ("Observation: ..."). Production implementations at LangChain and LlamaIndex directly implement ReAct as the default agent executor pattern.

Context Window Management for Retrieved Content

Web retrieval creates a context management problem. A single webpage can contain 50,000–200,000 tokens of text. Even with 200K context windows, injecting multiple full pages alongside the conversation history and system prompt can saturate the context window, increase inference latency, and degrade model attention on the most relevant content.

Production systems address this through retrieved content compression: after fetching a URL, a lightweight LLM call (or a non-LLM extractive summarization model like DistilBERT-based MRC) extracts only the passages relevant to the current query. This "query-focused summarization" step reduces a full webpage to 300–800 tokens of relevant content. LangChain's MapReduce chain and LlamaIndex's SentenceWindowNodeParser implement variants of this approach.

Fetch → extract → compress → inject: never inject raw full-page HTML
Chunk with overlap: 512-token chunks with 64-token overlap for dense retrieval
Top-k injection: select only the top 3–5 most relevant chunks per source
Recency weighting: for news/financial queries, prefer results within 30 days
Source diversity: enforce that injected context spans at least 2 distinct domains

Lost in the Middle

Stanford and UC Berkeley research (Liu et al., 2023) demonstrated that LLMs perform significantly worse when the relevant information appears in the middle of a long context window versus at the beginning or end. For agents injecting multiple web sources, ordering matters: put the most relevant retrieved content last, immediately before the generation prompt.

Multi-Step Web Research Patterns

Complex research queries require multiple sequential web interactions. GPT-Researcher (open source, 2023) implements a multi-agent pattern: a "Director" agent breaks the research question into sub-queries, dispatches them to parallel "Researcher" agents each running their own search-and-fetch loops, then a "Writer" agent synthesizes the results. This pattern reduced end-to-end latency by 3x compared to sequential single-agent research by parallelizing the search calls.

The key orchestration decision is whether sub-tasks run in parallel (faster but requires more context aggregation logic) or sequentially with information passing between steps (slower but allows each step to refine based on prior findings). For fact-finding tasks, parallel is preferred. For tasks with logical dependencies — "first find the CEO, then research their background" — sequential chaining is necessary.

← Lab 2 → Lesson 3 Quiz

🎯 Advanced · Quiz 3

Quiz: Web-Capable Agent Architecture

3 questions — free, untracked, retake anytime.

1. What security vulnerability caused OpenAI to temporarily disable ChatGPT's web browsing in July 2023?

✓ Correct — ✓ Correct. Adversarial content on web pages could inject instructions causing the agent to encode conversation context into navigation URLs — a classic prompt injection + data exfiltration chain.

Incorrect. The vulnerability was prompt injection: malicious web pages contained instructions that caused the agent to encode and exfiltrate conversation data through URLs it was told to navigate to.

2. According to the "Lost in the Middle" research finding, where should the most relevant retrieved content be placed in the context window?

✓ Correct — ✓ Correct. Liu et al. (2023) showed LLMs attend better to content at the beginning and end of the context window. Placing the most relevant content last — just before the prompt — maximizes attention on it.

Incorrect. Research shows LLMs have a "lost in the middle" problem — content buried in the middle of a long context gets less attention. The most relevant content should go last, right before the generation prompt.

3. When is sequential (rather than parallel) sub-task execution preferred in a multi-step web research agent?

✓ Correct — ✓ Correct. Sequential execution is necessary when tasks depend on each other — like "find the CEO, then research their background" — where the second step needs the output of the first.

Incorrect. Sequential chaining is specifically needed when tasks have logical dependencies — each step uses information from the prior step. For independent fact-finding tasks, parallel is preferred for speed.

← Back to Lesson → Lab 3

🎯 Advanced · Lab 3

Lab: Agent Architecture Design

Design a web-capable agent with proper context management and security controls.

Your Challenge

Design a web research agent for a law firm that needs to monitor regulatory changes across three jurisdictions daily. Your design must address:

Which tools to expose and their exact function signatures
How you manage context window budget across multiple sources
How you defend against prompt injection in retrieved legal documents
Whether sub-tasks run in parallel or sequentially, and why

Design the architecture for a multi-jurisdiction regulatory monitoring agent — cover your tool set, context management strategy, security controls, and orchestration pattern.

🔗 Agent Architecture Lab AI Tutor Active

← Back to Quiz → Lesson 4

Building AI Agents III — Tools · Module 2 · Lesson 4

Reliability & Safety

Advanced concepts, real-world applications, and practical implications

Core Concepts

This lesson explores reliability & safety — examining the key principles, real-world applications, and implications for practitioners working in this domain.

Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.

Practical Applications

The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.

Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.

Looking Forward

As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.

Lesson 4 Quiz

Reliability & Safety

What is the primary focus of Reliability & Safety?

✓ Correct — Correct. This lesson bridges theory and practice, focusing on real-world implementation.

Review the lesson — the focus is on connecting frameworks to practical reality.

Why does real-world deployment introduce challenges that pure theory doesn't capture?

✓ Correct — Correct. Real deployment requires judgment, not just framework application.

Practice doesn't invalidate theory — it reveals complexities that require nuanced application of theoretical principles.

What separates effective practitioners from those who merely follow checklists?

✓ Correct — Correct. Critical thinking and adaptability matter more than memorized procedures.

The key differentiator is critical thinking ability, not experience or resources alone.

🎯 Advanced · Lesson 4 Lab

Lab: Apply What You've Learned

Synthesize concepts from Reliability & Safety through guided AI conversation

Your Task

Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to reliability & safety.

Try: "How would the concepts from this lesson apply to a real-world scenario in this field?"

🤖 AESOP Lab Assistant Lesson 4 Lab

Module 2 Test

Web Search and Browsing Tools · 15 Questions · 70% to Pass

Score: 0/15

1. What is the core objective of Web Search and Browsing Tools?

2. How should practitioners approach applying concepts from this module?

3. Which best describes the relationship between theory and practice in Building AI Agents III — Tools?

4. What distinguishes expert practitioners from novices in this field?

5. How does Web Search and Browsing Tools build on previous modules?

6. What role do constraints play in practical implementation?

7. When applying frameworks from this module, what is most important?

8. How should practitioners handle conflicting perspectives in this field?

9. What makes the concepts in Web Search and Browsing Tools relevant beyond their immediate context?

10. How should practitioners continue developing expertise after completing this module?

11. What is the relationship between understanding Building AI Agents III — Tools concepts and making decisions?

12. How do the lessons from this module apply to novel situations?

13. What is the value of understanding multiple perspectives on {course_title}?

14. How should practitioners evaluate new information or developments in this field?

15. What is the ultimate goal of learning Web Search and Browsing Tools?