In 1945, Vannevar Bush published "As We May Think" in The Atlantic, imagining the Memex — a desk-sized device that would store every book, record, and communication a person needed, linked by associative trails rather than indexes. Bush understood something the engineers of that era did not yet: the bottleneck was never compute. It was retrieval. An instrument that could calculate but could not recall was a narrow tool. The same pattern surfaced in 1962 when Douglas Engelbart at SRI began building NLS — the oN-Line System — specifically because he believed augmenting human intellect depended on giving people frictionless access to the right information at the right moment.
Today's large language models arrived under nearly identical conditions. GPT-3 launched in June 2020 with remarkable fluency and a hard cutoff — it knew nothing after its training window closed, nothing proprietary, nothing live. Google's own researchers published the Retrieval-Augmented Generation paper in May 2020 not as an academic curiosity but as an engineering acknowledgment: the model alone is insufficient. By 2023, every serious enterprise deployment of a language model had discovered the same constraint Bush named in 1945. The bottleneck is retrieval.
This course is about building AI agents on Google Cloud that actually know things — your things, current things, structured and unstructured things — by connecting them properly to BigQuery, Cloud Storage, AlloyDB, Vertex AI Search, and the retrieval infrastructure that makes the difference between a demo and a production system. We will be specific about architecture, honest about limits, and grounded in what Google's documentation actually specifies as of mid-2025. You will finish knowing how data access decisions propagate through agent quality, latency, cost, and correctness.
If you finish every module, here's who you become:
In late 2023, Morgan Stanley's wealth management division disclosed to journalists at Bloomberg that it had spent more than a year building an OpenAI-powered assistant for financial advisors. The system was impressive in demos. In production, advisors discovered it confidently cited research reports that existed in the firm's proprietary database but had not been ingested into the retrieval layer. The model was not hallucinating — it was blind. The assistant knew what OpenAI had trained it on; it did not know what Morgan Stanley's analysts had written last quarter. The retrieval architecture, not the model, was the failure point. The firm responded by rebuilding the indexing pipeline before expanding the rollout.
The Morgan Stanley case is not exceptional — it is representative. In 2024, Google's own Cloud Next conference presentations from enterprise customers repeatedly named data freshness and retrieval coverage as the top production blockers, ahead of model capability, ahead of cost, ahead of latency. The model is rarely the problem. The data pipeline is almost always the problem.
When an agent cannot access the data it needs, failure arrives in one of three forms. Understanding which failure is occurring determines which part of the architecture to fix.
Staleness: The agent's knowledge is frozen at a training cutoff or last-sync timestamp. It answers questions about current inventory, pricing, or policy using information that is months or years old. This is the most common failure in production systems that launched with a batch-ingestion pipeline and no refresh mechanism.
Coverage gaps: The agent has access to some of the relevant corpus but not all. It answers correctly for the data it can see, incorrectly or not at all for the data it cannot. Coverage gaps are insidious because they are invisible to users — the agent does not say "I don't have that document," it says something plausible that is wrong.
Retrieval imprecision: The data exists and is fresh, but the retrieval mechanism returns the wrong chunks. The model reasons over the wrong evidence and produces a confident, coherent, incorrect answer. This is a vector-index design problem or a chunking strategy problem, not a model problem.
Vertex AI Agent Builder (formerly Dialogflow CX + Vertex AI Search, unified in 2024) exposes data connectors for BigQuery, Cloud Storage, Google Drive, and web crawl. Each connector has distinct latency, freshness, and coverage characteristics. Choosing the wrong connector for a use case is the single most common architectural mistake in enterprise Vertex AI agent deployments.
Data access quality cascades through every layer of agent output. Consider the chain: a user asks a question → the retrieval layer fetches candidate documents → the model synthesizes an answer from those documents → the answer is presented with apparent confidence. Each link in this chain multiplies or divides the quality of the final output.
Google's internal research team (DeepMind and Google Brain, now merged as Google DeepMind) published findings in 2023 showing that retrieval augmentation improved factual accuracy on closed-domain enterprise tasks by 38–61% compared to base model prompting — but only when retrieval recall exceeded 70%. Below 70% recall, retrieval augmentation could decrease accuracy because it introduced misleading partial context. The lesson: a retrieval layer that is half-built is sometimes worse than no retrieval layer at all.
This means the engineering task is not simply "add a vector store." It is to achieve sufficient coverage, freshness, and precision that the retrieval layer becomes a genuine amplifier rather than a noise source.
Every lesson in this module examines one dimension of the relationship between data access architecture and agent output quality. Lesson 1 establishes the theoretical framework. Lessons 2–4 examine specific Google Cloud data sources — BigQuery, Cloud Storage / Vertex AI Search, and AlloyDB — and how their access patterns shape what agents can and cannot do reliably.
As of mid-2025, Google Cloud provides four primary pathways for agent data access. Each has a different latency profile, freshness guarantee, and appropriate use case.
Vertex AI Search data stores are the highest-level abstraction. You configure a data store, connect a source (Cloud Storage bucket, BigQuery table, Google Drive folder, or web crawl), and the service handles chunking, embedding, and indexing. Retrieval is via the Discovery Engine API. Freshness depends on sync configuration — scheduled or triggered.
Direct BigQuery access via the BigQuery API or Vertex AI's BigQuery tool in Agent Builder allows agents to execute SQL at query time against live warehouse data. This provides real-time freshness at the cost of higher latency (seconds per query) and potential cost at scale.
AlloyDB for PostgreSQL with pgvector extension supports hybrid search — structured SQL queries combined with vector similarity search — in a single database engine. This is appropriate when data is transactional and relational, and the agent needs both lookup and semantic retrieval.
Cloud Storage + manual RAG pipelines using Vertex AI Embeddings API and a Vector Search index give engineers the most control over chunking strategy, embedding model, and retrieval logic, at the cost of managing the pipeline themselves.
A recurring mistake in early enterprise AI projects (2022–2023) was treating model selection as the primary architectural decision. Teams spent weeks evaluating GPT-4 versus Claude versus Gemini while deploying all of them against the same under-built retrieval layer. The differences in model output quality were measurable. The differences introduced by retrieval gaps were larger.
Google's Gemini 1.5 Pro, released in February 2024, has a 1-million-token context window. A naive interpretation is that large context makes retrieval unnecessary — simply inject the entire corpus. In practice, this does not work for three reasons: (1) most enterprise corpora exceed even 1M tokens; (2) model attention degrades for relevant information buried in large contexts, a phenomenon studied in the "Lost in the Middle" paper by Liu et al. at Stanford in July 2023; (3) cost at scale is prohibitive. Retrieval remains necessary. The quality of retrieval remains the binding constraint.
You are designing a Vertex AI agent for a retail company. The agent needs to answer customer service questions using product catalog data (updated daily in BigQuery), return policy documents (PDFs in Cloud Storage), and live inventory counts (updated every 5 minutes in a Cloud SQL database). Use this lab to work through which data access pattern fits each data type and why the wrong choice will degrade agent quality.
In early 2024, the logistics company DHL Supply Chain published a case study with Google Cloud describing an internal operations assistant built on Vertex AI. The agent answered questions from warehouse managers about shipment status, inventory levels, and routing exceptions. The initial architecture pre-indexed daily BigQuery snapshots into Vertex AI Search — a batch sync every 24 hours. Managers quickly rejected the tool because shipment status changed hourly. A manager asking "where is shipment 4821-C right now?" received an answer that was true yesterday and wrong today. The team rebuilt the agent to query BigQuery directly at inference time for time-sensitive fields and retain Vertex AI Search only for static reference data. The difference in user adoption was immediate and dramatic.
BigQuery, Google's serverless data warehouse, processed over 110 exabytes of data in 2023 according to Google's infrastructure disclosures. It is the most common enterprise data store in Google Cloud deployments and therefore the most common data source for Vertex AI agents. Understanding its access patterns is not optional — it is foundational.
BigQuery can serve agents in two distinct modes. In tool mode, the agent calls a BigQuery tool at inference time, generates or receives a SQL query, executes it, and incorporates the results into its context before responding. This mode provides real-time freshness — the answer reflects data as of the moment of the query. In index mode, BigQuery data is exported or synced into a Vertex AI Search data store, embedded, and retrieved semantically. This mode is faster at inference time but introduces freshness lag equal to the sync interval.
The choice between modes is determined by three factors: how fast does the data change, what kind of query does the agent need to run (semantic search vs. exact lookup), and what latency is acceptable to the user.
As of 2025, Vertex AI Agent Builder includes a native BigQuery tool that can be attached to an agent. The tool accepts natural language, generates BigQuery SQL via Gemini, executes the query, and returns structured results. It requires appropriate IAM roles (bigquery.dataViewer minimum) and supports row-level security via BigQuery's column-level and row-level security policies.
The core tension in BigQuery-backed agent design is between freshness and latency. A BigQuery query that scans 10 GB of data typically returns in 2–8 seconds. For a conversational agent where users expect sub-second responses, this is often unacceptable. For an agent answering operational questions where accuracy is critical and a 5-second wait is tolerable, it is the right architecture.
Google's recommended pattern as of the Agent Builder documentation (2025) is a hybrid: use BigQuery direct access for high-velocity, small-result queries (single record lookups, aggregations over recent time windows) and Vertex AI Search for large-corpus semantic search where the data changes infrequently. This hybrid approach was validated in the DHL case and in Google's own internal deployment of Duet AI for Google Cloud, which uses BigQuery direct access for billing and usage queries and pre-indexed search for documentation.
A critical implementation detail: BigQuery's BI Engine can cache frequently-accessed data in memory, reducing query latency to under 1 second for repeated or similar queries. For agent use cases where the same or similar queries recur, BI Engine reservation is a significant latency optimization that most teams overlook in initial deployments.
Querying at the wrong granularity. An agent asked "how are sales trending?" that runs a full table scan across 3 years of transaction data will hit slot limits, run for 30+ seconds, and potentially time out. Agents must be designed with query budget constraints — either via materialized views, partitioned tables queried with date filters, or query cost limits enforced at the API level.
Ignoring partition pruning. BigQuery tables partitioned by date dramatically reduce scan cost and latency when queries include partition filters. An agent that generates SQL without date constraints on a partitioned table will perform full scans. This is a prompt engineering problem as much as a schema design problem — the agent's system prompt should include guidance on which tables are partitioned and how.
Schema blindness. Without schema context in the system prompt or tool description, a Gemini-generated SQL query will guess column names. Google recommends providing table schemas, sample values, and semantic descriptions of columns as part of the tool definition. This is documented in the Vertex AI Agent Builder tool configuration guide (2025).
You are building a Vertex AI agent that answers questions about a BigQuery dataset containing 5 years of e-commerce transactions (2 trillion rows, partitioned by transaction_date). Questions range from "what were total sales last week?" to "find all customers who bought product X and returned it." Work through query design, partitioning strategy, BI Engine use, and schema context with the AI advisor.
In 2024, Highmark Health, one of the largest integrated health systems in the United States, publicly described a Vertex AI Search deployment at Google Cloud Next. The system indexed clinical guidelines, policy documents, and member handbooks — roughly 2.4 million pages — to help care managers answer coverage questions quickly. The initial deployment achieved high retrieval speed but low answer accuracy on complex multi-part questions. Investigation revealed two chunking problems: very long documents were split at fixed character intervals that crossed section boundaries, and short documents (single-page memos) were embedded as single chunks that were too coarse for precise retrieval. Highmark's engineering team rebuilt the ingestion pipeline with semantic chunking (splitting at paragraph and section boundaries) and added metadata fields for document type and effective date. Answer accuracy on the evaluation set improved by 29 percentage points.
When you connect a Cloud Storage bucket or Google Drive folder to a Vertex AI Search data store, the service executes a pipeline with four stages: document extraction, chunking, embedding, and indexing. Each stage is a point where quality can be gained or lost.
Document extraction converts raw files — PDFs, DOCX, HTML, TXT — into plain text and structured metadata. Vertex AI Search uses Google's Document AI under the hood for PDF parsing as of 2024. Scanned PDFs without OCR text layers will produce poor extraction results. Password-protected files will fail silently in some configurations.
Chunking splits extracted text into segments that fit within the embedding model's token limit. Vertex AI Search's default chunking is fixed-size with overlap. As the Highmark case demonstrated, fixed-size chunking at document boundaries can produce chunks that are semantically incoherent — mid-sentence splits, separated question-answer pairs, orphaned tables. Google introduced configurable chunking strategies in Vertex AI Search in late 2024, including layout-aware chunking that respects HTML/DOCX structure.
Embedding converts each chunk into a dense vector representation using a Google embedding model (text-embedding-004 as of mid-2025). The semantic distance between these vectors determines what the retrieval step returns. Embedding model quality sets a ceiling on semantic retrieval quality that no amount of indexing optimization can exceed.
Indexing stores the vectors in Google's proprietary approximate nearest-neighbor index (descendant of the ScaNN algorithm published by Google Research in 2019). The index is built automatically and managed by the service — you do not configure it directly, but you can configure the number of results returned and the relevance threshold.
Vertex AI Search's layout-aware chunking mode, released to GA in November 2024, uses document structure signals (HTML headers, DOCX styles, PDF bookmark trees) to split at meaningful semantic boundaries. For enterprise document corpora with consistent formatting, this significantly improves chunk coherence and retrieval precision. It requires that source documents have structural metadata — flat-text PDFs do not benefit.
Vertex AI Search retrieval operates as a two-stage pipeline: approximate nearest neighbor (ANN) search over the vector index returns a candidate set, then a re-ranking model orders the candidates by relevance. The final ranked list is what the agent receives. Both stages have tunable parameters.
The primary tension is between precision and recall. Requesting more candidates from the ANN stage improves recall (more relevant documents are in the candidate set) but increases re-ranking latency and risks diluting the top results with irrelevant documents. Google's recommended starting configuration for enterprise deployments is 10–20 candidates with re-ranking enabled, evaluated against a set of representative user queries.
A critical and frequently overlooked feature is extractive answers: Vertex AI Search can return not just document chunks but the specific passage within a chunk most likely to answer the query. This reduces the context the agent must reason over and improves answer precision. Extractive answers are enabled via the contentSearchSpec parameter in the Search API and are available for data stores backed by unstructured documents.
Vertex AI Search is optimized for semantic retrieval over large unstructured corpora. It is the wrong tool when the agent needs exact record lookup (use BigQuery or a SQL database), when the corpus changes faster than the minimum sync interval (use direct API access), or when the retrieval logic requires complex multi-hop joins across entities (use a knowledge graph or structured database with explicit traversal logic).
A common mistake is deploying Vertex AI Search for data that is fundamentally tabular — product SKUs, customer IDs, order numbers — because semantic similarity is a poor substitute for exact match on structured identifiers. Searching a Vertex AI Search index for "order #4821-C" will return semantically similar documents, not the specific order record. For exact-match use cases, BigQuery direct access or Firestore lookups are appropriate.
You are building a Vertex AI agent for a legal team. The agent must answer questions over 80,000 contract PDFs stored in Cloud Storage, ranging from 1-page NDAs to 400-page master agreements, updated monthly. Work through chunking strategy, metadata schema, extractive answer configuration, and retrieval tuning with the AI advisor.
At Google Cloud Next 2024, Wayfair's engineering team presented an architecture in which a Vertex AI agent helped catalog specialists enrich product listings. The agent needed to answer two types of questions simultaneously: exact lookups ("does SKU WF-44821 already have a 'material' attribute?") and semantic search ("find five similar products with detailed dimension specifications we can use as templates"). A pure Vertex AI Search deployment answered the semantic queries well but could not perform the exact SKU lookups reliably. A pure BigQuery deployment answered exact queries quickly but semantic similarity search over 30 million product embeddings required custom infrastructure. The solution was AlloyDB for PostgreSQL with the pgvector extension — a single database that handled exact-match SQL and approximate nearest-neighbor vector search in the same query, with sub-100ms response times on both query types after index warming.
AlloyDB for PostgreSQL is Google Cloud's fully-managed, PostgreSQL-compatible database engine, first released to GA in November 2022. It uses a disaggregated storage architecture (log-structured storage with intelligent caching) that achieves 4x faster transactional write throughput and up to 100x faster analytical query performance compared to standard CloudSQL PostgreSQL on Google Cloud's benchmarks.
The pgvector extension, originally developed by Andrew Kane and open-sourced in 2021, adds a VECTOR data type to PostgreSQL and enables approximate nearest-neighbor (ANN) search using IVFFlat or HNSW indexes. AlloyDB added native support for pgvector in 2023 and optimized its execution in the AlloyDB Omni (on-premises) and managed cloud variants through 2024. As of mid-2025, AlloyDB supports HNSW indexes up to 64,000 dimensions with sub-50ms query latency at the 10-million-vector scale with appropriate index configuration.
The key capability that neither BigQuery nor Vertex AI Search alone provides is hybrid queries: combining WHERE clause filters (exact SQL) with ORDER BY vector_distance() (semantic similarity) in a single query execution plan. For agents that need to say "find the five most semantically similar products to this description, but only among products in category X with inventory > 0," AlloyDB executes this as a single index-accelerated query rather than two separate API calls with application-layer joining.
AlloyDB's Google ML integration (alloydb.create_embedding() function, GA 2024) allows embedding generation to be called directly from SQL using Vertex AI Embeddings API. An agent can insert new text, trigger embedding generation at the database level, and have the new vector immediately available for similarity search — no separate embedding pipeline required. This is documented in the AlloyDB AI documentation under "Work with embeddings."
The decision framework is straightforward once the data characteristics are understood. Use AlloyDB when: the agent's queries combine structured filters with semantic search; data is updated frequently (transactional workload); the corpus is under ~100M vectors (beyond which managed vector databases or Vertex AI Vector Search become more cost-effective); and the team already has PostgreSQL expertise.
Use Vertex AI Search when: the corpus is primarily unstructured documents; semantic search is the primary retrieval mode; the team prefers a fully managed, no-schema abstraction; and freshness latency of minutes-to-hours is acceptable.
Use BigQuery direct access when: the data is in a warehouse and freshness must be real-time; the query is aggregational (sums, counts, averages) rather than retrieval; and per-query latency of 2–8 seconds is acceptable.
These are not mutually exclusive. The production architectures described at Google Cloud Next 2024 by Wayfair, DHL, and Highmark all used two or three of these systems together, routing agent queries to the appropriate backend based on query type classification in the agent's routing logic.
This module established that data access is the primary determinant of agent quality — not model capability, not prompting technique, not inference infrastructure. The specific architecture choices covered were: Vertex AI Search for large unstructured corpora with latency tolerance; BigQuery direct access for real-time warehouse queries; AlloyDB for hybrid transactional-plus-semantic workloads; and the principle that production systems require routing across multiple backends.
The failures documented — Morgan Stanley's retrieval blindness, DHL's staleness, Highmark's chunking imprecision — are not edge cases. They are the default outcomes when data access architecture receives insufficient attention in the design phase. The engineers who avoided these failures did so by asking, before building, what the data looks like, how fast it changes, and what kind of query the agent actually needs to run. That sequence of questions is the through-line of this entire course.
You are building a Vertex AI agent for a pharmaceutical company. The agent helps researchers find clinical trial records. It needs to: (1) perform exact lookup by trial ID (NCT number), (2) find semantically similar trials by description, and (3) filter results to only trials in a specific therapeutic area with active enrollment status. Work through the AlloyDB schema design, pgvector index configuration, and query structure with the AI advisor.