Module 4 · Lesson 1

Embeddings and the Geometry of Meaning

How numbers encode semantics — and why proximity in high-dimensional space equals relevance.

What actually happens when a model "understands" that a customer complaint about a billing error is similar to a chargeback dispute?

In 2023, Spotify's engineering team published details of their "Taste Profile" system — a recommendation engine that represents every song, podcast, and user preference as a dense vector in a shared embedding space. When a user skips death metal but lingers on ambient electronica, the system updates a vector. When that vector drifts near the neighborhood of "focus music," the agent surfacing study playlists needs no explicit rule. The geometry is the rule. The same principle now underlies enterprise data retrieval on Vertex AI.

What Is an Embedding?

An embedding is a fixed-length array of floating-point numbers — a vector — produced by a neural encoder model when given text, image, audio, or structured data as input. The encoder is trained so that semantically similar inputs produce vectors that are geometrically close to one another, while dissimilar inputs produce vectors that are far apart.

Google's text-embedding-004 model (available via Vertex AI) produces 768-dimensional vectors for text. That means a sentence like "invoice discrepancy on account" and "billing error charge" each become a point in a 768-dimensional space — and those two points sit very close together.

Technical Detail

Cosine similarity is the standard distance metric for text embeddings. It measures the angle between two vectors, ignoring magnitude. A cosine similarity of 1.0 means identical direction (semantically the same); 0.0 means orthogonal (unrelated); negative values indicate opposition.

Embedding Models on Vertex AI

Vertex AI exposes embedding models through the Vertex AI Model Garden and the Embeddings API. As of 2024, the primary production models are:

text-embedding-004

768 dimensions
Optimized for retrieval and clustering
Supports task types: RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION
Max 2,048 tokens per input

multimodalembedding@001

1,408 dimensions (image) / 128 dimensions (text)
Projects images and text into a shared space
Used for cross-modal retrieval
Powers Google's Visual Search features

Why Task Type Matters

A subtlety that catches engineers off-guard: the task_type parameter shifts the embedding toward a specific geometric objective. When you embed a user query with RETRIEVAL_QUERY and documents with RETRIEVAL_DOCUMENT, the model applies asymmetric optimizations — the query vector is shaped to match against document vectors more accurately than if both used the same task type.

Google's internal benchmarks on BEIR (Benchmarking IR) showed that correct task_type assignment improved NDCG@10 by 3–7 percentage points over task-agnostic embedding — a meaningful difference for enterprise retrieval pipelines where precision matters.

Production Insight

Always embed your queries with RETRIEVAL_QUERY and your knowledge-base documents with RETRIEVAL_DOCUMENT. Mixing task types is one of the most common embedding pipeline bugs in agentic systems — it silently degrades retrieval quality without throwing errors.

From Embeddings to Agent Memory

In an agentic workflow, embeddings serve as the bridge between natural-language questions and structured data stores. When an agent receives a user query, it embeds it, searches for the nearest vectors in a prebuilt index, retrieves the associated documents or records, and injects them into its context window. This pattern — embed → retrieve → inject → generate — is called Retrieval-Augmented Generation (RAG) and is the dominant architecture for grounding LLM agents in real enterprise data.

The quality of the embedding model directly determines what the agent "knows." A weak embedder produces noisy retrievals; the agent hallucinates. A strong, task-typed embedder retrieves precisely relevant documents; the agent reasons from ground truth.

EmbeddingA dense vector representation of input data, produced by a neural encoder, where geometric proximity encodes semantic similarity.

task_typeA Vertex AI parameter that optimizes the embedding for a specific use — retrieval query, retrieval document, semantic similarity, or classification.

RAGRetrieval-Augmented Generation — the pattern of embedding a query, retrieving relevant documents, and injecting them into an LLM's context before generation.

Lesson 1 Quiz

Embeddings and the Geometry of Meaning

What does cosine similarity measure between two embedding vectors?

Correct. Cosine similarity captures directional similarity — two vectors pointing in the same direction score 1.0 regardless of their length. This makes it robust to input-length variation, which matters when comparing short queries to long documents.

Not quite. Cosine similarity specifically measures the angle (direction) between vectors, not their spatial distance or token overlap.

Which task_type should you use when embedding a user query intended for document retrieval?

Correct. RETRIEVAL_QUERY is used for queries, while RETRIEVAL_DOCUMENT is used for the documents in your index. This asymmetry is intentional — the model optimizes query vectors specifically to match against document vectors, improving retrieval precision by 3–7 NDCG points on benchmarks.

Incorrect. Queries should use RETRIEVAL_QUERY, not RETRIEVAL_DOCUMENT or other types. The asymmetry between query and document task types is what improves retrieval accuracy.

How many dimensions does Vertex AI's text-embedding-004 model produce?

Correct. text-embedding-004 produces 768-dimensional vectors. The multimodalembedding model uses different dimensions (1,408 for images, 128 for its text projection), which is why selecting the right model for your modality matters.

Incorrect. text-embedding-004 produces 768-dimensional vectors. 1,408 dimensions belong to the image embedding from multimodalembedding@001.

Lab 1: Designing an Embedding Strategy

Practice selecting models, task types, and distance metrics for real enterprise scenarios.

Your Challenge

You are architecting a RAG system for a financial services firm. They have 2 million support ticket documents, 50,000 compliance policy PDFs, and product images for 8,000 financial instruments. Users ask natural-language questions; an agent must retrieve relevant documents before generating responses.

Ask the assistant how to choose embedding models, set task_type parameters, and structure embedding pipelines for this scenario. Explore at least 3 exchanges — covering model selection, task_type assignment, and a tradeoff you're uncertain about.

Embedding Strategy Assistant

Vertex AI · text-embedding-004

Hello! I'm your embedding strategy assistant for this module. Tell me about your retrieval scenario — what types of content are you indexing, and what kinds of questions will users be asking? We'll work through the right embedding models, task types, and distance metrics together.

Module 4 · Lesson 2

Vertex AI Vector Search: Architecture and Indexing

ScaNN at planetary scale — how Google's approximate nearest-neighbor engine handles billions of vectors with sub-100ms latency.

How do you search a billion vectors in under 100 milliseconds without comparing every single one?

In 2022, Google DeepMind researchers published benchmark results for ScaNN (Scalable Nearest Neighbors), showing it outperformed FAISS, HNSW, and Annoy on the glove-100-angular benchmark by up to 2× in queries per second at equivalent recall. Google subsequently productized ScaNN as the core of Vertex AI Vector Search (formerly Matching Engine), making the same approximate nearest-neighbor infrastructure that powers Google Search's semantic features available to enterprise developers — with SLAs, VPC peering, and Terraform support.

What Vertex AI Vector Search Is

Vertex AI Vector Search is a fully managed, high-scale ANN (approximate nearest-neighbor) service. You upload your embeddings, configure an Index, deploy that index to an IndexEndpoint, and query it via gRPC or REST. Google handles the infrastructure — sharding, replication, serving latency, and incremental updates.

Key architectural properties: the service supports both streaming updates (insert/delete in near-real-time) and batch updates (full index rebuilds from Cloud Storage). Batch updates suit static knowledge bases; streaming updates suit inventory systems or live event feeds where vectors change continuously.

1B+

Vectors per Index

<100ms

Query Latency (p99)

99.9%

Availability SLA

VPC

Private Network Peering

Index Types and Approximate Search

Vector Search offers two index update modes:

Stream Update (TreeAH): Uses a tree + asymmetric hashing structure that supports insertions without full rebuilds. Best for data that changes frequently — support tickets, live product catalogs, event streams.

Batch Update (Brute Force or TreeAH): For static datasets, you can request brute-force exact search on small indexes (<10M vectors) or the full ScaNN approximation for larger ones. Brute force is exact but does not scale beyond ~10M vectors economically.

Approximate vs. Exact Search

ANN algorithms trade a small amount of recall for enormous speed gains. At 90% recall and 1 billion vectors, ScaNN can return the 10 nearest neighbors in ~50ms. Exact search would require comparing every vector — computationally infeasible at scale. For most enterprise RAG applications, 90–95% recall is sufficient; the LLM that receives the retrieved documents can compensate for minor retrieval imprecision.

Building an Index: The Core Workflow

Generate embeddings — Use text-embedding-004 (RETRIEVAL_DOCUMENT task type) to embed all documents. Store vectors as JSON Lines in Cloud Storage.

Create an Index — Call aiplatform.IndexServiceClient.create_index(), specifying dimensions (768), distance measure (DOT_PRODUCT_DISTANCE or COSINE_DISTANCE), and approximate neighbors count.

Deploy to IndexEndpoint — Deploy the index to a dedicated machine type (e2-standard-2 for dev, n1-standard-16 for production). This creates a gRPC endpoint for queries.

Query from your agent — Embed the user query (RETRIEVAL_QUERY), call find_neighbors() with num_neighbors=10, retrieve document IDs, fetch source content from BigQuery or Cloud Storage.

Inject and generate — Pass retrieved documents as context to Gemini or Claude. The agent now reasons from actual retrieved data, not from training memory.

Distance Measures and When to Use Each

DOT_PRODUCT_DISTANCE

Highest performance for text-embedding-004, which produces unit-normalized vectors. When vectors are normalized, dot product equals cosine similarity. Google recommends this for all text retrieval.

COSINE_DISTANCE

Explicitly normalizes during comparison. Use when your embedding model does not guarantee unit-length outputs, or when combining embeddings from multiple models with different magnitude ranges.

Engineering Note

Vector Search pricing is based on machine type and hours deployed, plus a per-query charge above certain volumes. For development, deploy a small index to e2-standard-2 and undeploy when not testing. In production, use n1-standard-16 with autoscaling enabled. The index persists in Cloud Storage regardless of deployment state — you only pay for deployed endpoints.

ANNApproximate Nearest Neighbor — an algorithm that finds vectors very close to a query vector without exhaustive comparison, trading small recall loss for massive speed gains.

ScaNNGoogle's Scalable Nearest Neighbors library, the core engine of Vertex AI Vector Search, capable of querying 1B+ vectors in under 100ms.

IndexEndpointThe deployed serving resource in Vector Search — a managed endpoint that receives gRPC or REST queries and returns nearest-neighbor results.

Lesson 2 Quiz

Vertex AI Vector Search: Architecture and Indexing

What is the primary reason to use Approximate Nearest Neighbor (ANN) search rather than exact search for large-scale vector retrieval?

Correct. ANN sacrifices a small percentage of recall (e.g., 5–10%) to achieve query times of ~50ms versus minutes for exact search at billion-vector scale. For RAG applications, this tradeoff is almost always worthwhile since LLMs can compensate for minor retrieval imprecision.

Incorrect. ANN is not more accurate than exact search — it's faster at the cost of a small, controlled recall reduction. The speed gain is the entire point at scale.

Which distance measure does Google recommend for use with text-embedding-004 in Vertex AI Vector Search, and why?

Correct. text-embedding-004 outputs unit-normalized vectors. For unit vectors, dot product and cosine similarity are mathematically equivalent, but DOT_PRODUCT_DISTANCE is computationally cheaper since it skips the normalization step during comparison.

Incorrect. Google's recommendation for text-embedding-004 is DOT_PRODUCT_DISTANCE, because the model produces unit-normalized vectors where dot product is equivalent to cosine similarity — and faster to compute.

When should you use "stream update" mode for a Vertex AI Vector Search index?

Correct. Stream update mode (TreeAH) supports incremental insertions and deletions — ideal for live data like support tickets, inventory systems, or event streams. Batch update is better suited for static knowledge bases where a periodic full rebuild is acceptable.

Incorrect. Stream update is specifically for dynamic data that changes frequently. For static datasets, batch update is more cost-effective.

Lab 2: Vector Search Architecture Decisions

Work through real indexing tradeoffs: update modes, machine types, and recall vs. latency.

Your Challenge

Your team is building a Vector Search index for a logistics company's 500-million-document shipment status and regulatory compliance knowledge base. New shipment status records arrive every 30 seconds. Compliance PDFs are updated quarterly. Query latency SLA is 200ms at p99. You have a mixed machine budget.

Ask the assistant how to architect this index: which update modes to use for different data types, which machine types to provision, and how to tune approximate neighbor count to hit your latency SLA. Push into at least 3 substantive exchanges.

Vector Search Architecture Assistant

Vertex AI · Vector Search

Ready to help you architect your Vector Search deployment. Tell me about your data volume, update frequency requirements, and latency constraints — we'll work out the right index configuration, machine types, and ANN parameters together.

Module 4 · Lesson 3

Dataplex and the Knowledge Catalog

Metadata as the backbone of agentic intelligence — how Dataplex gives agents a map of every data asset in the enterprise.

If an agent can retrieve any document in your data lake, how does it know which documents are trustworthy, current, and governed?

In 2022, Google rebranded its unified data governance platform as Dataplex, integrating the Data Catalog (originally launched in 2020) with lake management, data quality, and lineage tracking. By 2024, Dataplex managed metadata for over 100 petabytes of assets across enterprise customers including Deutsche Telekom and Renault. When Renault's analytics teams needed to find which of their 40,000+ BigQuery tables contained reliable vehicle telemetry data — versus prototype tables and deprecated schemas — Dataplex's catalog became the search layer. The same principle applies when an AI agent needs to know which data to trust.

What Dataplex Is

Dataplex is Google Cloud's unified data management platform. Its three core capabilities relevant to agentic workflows are:

1. Unified Metadata Catalog: Dataplex automatically discovers and catalogs assets across BigQuery, Cloud Storage, Spanner, Cloud SQL, and Looker. Every table, bucket, model, and dashboard gets a catalog entry with schema, ownership, data type, and access policy.

2. Data Quality Rules: You define quality checks (column-level completeness, validity ranges, referential integrity) that run on a schedule. Pass/fail scores are attached to catalog entries — an agent can query whether a table's quality score is above a threshold before using it.

3. Data Lineage: Dataplex tracks which jobs produced which tables, which tables feed which dashboards, and which transformations touched which columns. An agent reasoning about whether a metric is reliable can trace its full provenance.

The Catalog as an Agent Tool

In a Vertex AI Agent Builder workflow, the Dataplex catalog functions as a tool — a callable API the agent can invoke to resolve questions about data assets. A well-architected agent workflow typically includes:

Catalog search — The agent calls the Data Catalog search API with natural-language terms. Dataplex returns matching assets with metadata (schema, owner, quality score, last updated).

Quality gate — Before querying a BigQuery table, the agent checks its Dataplex quality score. Tables below threshold are excluded or flagged in the response.

Lineage check — For critical financial or compliance queries, the agent verifies the table's lineage — was it produced by a certified pipeline or an ad-hoc notebook?

Policy enforcement — Dataplex's tag templates carry data sensitivity labels (PII, HIPAA, restricted). The agent uses these to redact or route responses appropriately.

Real Implementation Note

Google's Cloud Next 2024 demonstrations showed Gemini-powered agents using Dataplex search APIs as native tools — the agent would receive a user question like "which tables contain verified sales data for Q3?" and resolve it by calling catalog.search() rather than hallucinating table names. This pattern eliminates an entire class of agent errors where LLMs confidently fabricate schema names that don't exist.

Tag Templates and Business Context

Dataplex tag templates allow your team to attach structured metadata beyond what's auto-discovered. Common enterprise tag fields include:

Governance Tags

Data Owner (person/team)
Sensitivity Level (public/internal/PII/confidential)
Retention Policy (days until deletion)
Regulatory Scope (GDPR, HIPAA, SOX)

Quality Tags

Certification Status (silver/gold/platinum tier)
Last Quality Run (timestamp)
Quality Score (0–100)
Known Issues (free text field)

Combining Catalog and Vector Search

The most powerful agentic architectures combine both services: Vector Search retrieves semantically relevant document chunks, while Dataplex validates that the sources containing those chunks are governed, current, and trusted. An agent that retrieves the right information from an untrusted table is still a liability. An agent that retrieves from certified, quality-scored sources builds organizational trust in AI-generated answers.

Architecture Principle

Treat the Dataplex catalog as your agent's epistemology layer — the mechanism by which it distinguishes what it knows reliably from what it merely retrieved. Vector Search handles relevance; the catalog handles trustworthiness. Together they produce grounded, auditable responses.

DataplexGoogle Cloud's unified data governance platform providing metadata cataloging, data quality scoring, and lineage tracking across all GCP data services.

Tag TemplateA structured schema for attaching custom metadata (governance, quality, sensitivity labels) to catalog entries, readable by agents at query time.

LineageThe tracked provenance of a data asset — which pipelines produced it, which sources fed it, and which outputs depend on it.

Lesson 3 Quiz

Dataplex and the Knowledge Catalog

What role does Dataplex's catalog serve in an agentic workflow?

Correct. The catalog is the trustworthiness layer — agents use it to check quality scores, ownership, lineage, and sensitivity tags before incorporating retrieved data into responses. This is what distinguishes a governed AI system from one that retrieves indiscriminately.

Incorrect. The catalog does not store embeddings — Vector Search handles that. The catalog stores metadata about data assets that agents use to evaluate source reliability.

What is a Dataplex tag template used for?

Correct. Tag templates let teams define structured metadata schemas — fields like certification status, data owner, sensitivity level, regulatory scope — that are attached to catalog entries and readable by agents via API. This is how governance intent becomes machine-readable agent context.

Incorrect. Tag templates are for attaching custom structured metadata to catalog entries — they are not related to BigQuery partitioning or Vector Search configuration.

Why does combining Dataplex with Vector Search produce better enterprise agent outputs than either service alone?

Correct. Relevance and trustworthiness are distinct problems. Vector Search solves relevance — returning semantically similar content. Dataplex solves trustworthiness — confirming the sources are certified, quality-scored, and appropriately governed. Together they produce responses that are both relevant and auditable.

Incorrect. The two services solve complementary problems: relevance (Vector Search) and trustworthiness (Dataplex). Combining them does not reduce storage costs or filter vectors before search.

Lab 3: Designing a Governed RAG Pipeline

Integrate Dataplex quality gates and lineage checks into a Vector Search retrieval workflow.

Your Challenge

A healthcare analytics company is building an agent that answers clinician questions by retrieving from a mix of certified EHR summary tables, research literature chunks, and experimental ML model outputs — all in BigQuery and Cloud Storage. The agent must not expose unvalidated experimental data to clinicians as if it were certified.

Ask the assistant how to design a pipeline that uses Dataplex tags (certification tier, sensitivity label, last quality run) as pre- and post-retrieval filters. Explore how the agent should handle a scenario where Vector Search returns highly relevant chunks from a low-quality-scored source.

Governed RAG Pipeline Assistant

Dataplex · Vector Search · Healthcare

I'm here to help you design a governed RAG pipeline for healthcare data. This is one of the most important architectural challenges in enterprise AI — ensuring that retrieval quality and source trustworthiness are both enforced. What's your first question about integrating Dataplex governance into the retrieval flow?

Module 4 · Lesson 4

RAG Optimization: Chunking, Re-ranking, and Hybrid Search

Beyond basic retrieval — the engineering decisions that determine whether your RAG pipeline is good enough for production.

Your agent retrieves the top 10 chunks by vector similarity. Three are irrelevant. One critical chunk is ranked 12th. What went wrong, and how do you fix it?

When Anthropic published their "Long Context vs. RAG" analysis in late 2023, one finding stood out: simple top-k vector retrieval systematically missed relevant passages that used different vocabulary than the query. A question about "revenue recognition timing" failed to retrieve a document discussing "when to book sales" — semantically identical, lexically divergent. Their recommendation: hybrid retrieval combining dense vector search with sparse keyword (BM25) retrieval, followed by a re-ranking model. This approach, now supported natively in Vertex AI Search, is the current production standard for enterprise RAG.

The Chunking Problem

Before any retrieval happens, your source documents must be split into chunks — the units that get embedded and indexed. Chunking strategy is the most underestimated variable in RAG quality. The wrong chunk size produces retrieval artifacts regardless of how good your vector search is.

Fixed-Size Chunking

Split every N tokens (e.g., 512 tokens, 50 overlap). Simple and fast. Breaks mid-sentence frequently. Loses contextual coherence at chunk boundaries. Acceptable for dense reference documents; poor for narratives or legal text.

Semantic Chunking

Split at semantic boundaries — paragraph breaks, heading changes, topic shifts detected by an embedding model comparing consecutive sentence embeddings. Preserves contextual coherence. More expensive to compute but produces significantly better retrieval results.

Google's Document AI Layout Parser provides a third option: structure-aware chunking that splits by detected layout elements — headers, tables, list items, figure captions. For enterprise PDFs (financial reports, compliance manuals), structure-aware chunking dramatically outperforms fixed-size splitting because it keeps table rows together and separates unrelated sections cleanly.

Hybrid Search: Dense + Sparse

Dense retrieval (vector search) excels at semantic similarity — finding documents that mean the same thing even with different words. Sparse retrieval (BM25, TF-IDF) excels at exact match — finding documents containing specific product codes, person names, technical identifiers, or regulatory article numbers.

In practice, enterprise queries often need both. "Show me GDPR Article 17 violations in customer data tables" requires BM25 to match "Article 17" exactly and dense search to understand "customer data tables" semantically. Vertex AI Search (the search-focused managed service) provides hybrid retrieval natively. For custom Vector Search deployments, you implement BM25 via Elasticsearch or Cloud Search, then merge ranked lists using Reciprocal Rank Fusion (RRF).

Reciprocal Rank Fusion (RRF)

RRF merges ranked lists from multiple retrieval systems. Each document's score is 1/(k + rank), where k is typically 60. Documents appearing high in both lists accumulate high scores. RRF is parameter-light, robust to score-scale differences, and consistently outperforms simple score averaging in multi-retrieval experiments.

Re-ranking with Cross-Encoders

After retrieving a candidate set (typically 20–100 documents), a re-ranker scores each document in the context of the specific query. Unlike bi-encoder embeddings (query and document embedded separately), cross-encoders jointly encode the query and document together — this is much more computationally expensive per candidate but produces significantly more accurate relevance scores.

Vertex AI's Ranking API (launched 2024) exposes Google's production re-ranking model, the same one used internally for Google Search's featured snippets. You pass up to 200 candidates and receive re-ranked scores in a single call. In evaluations on enterprise datasets, Vertex Ranking improved NDCG@5 by 8–12 percentage points over vector similarity alone.

Retrieve broadly — Use hybrid search (dense + BM25) to retrieve 50–100 candidates with high recall. At this stage, precision is less important than not missing relevant documents.

Re-rank precisely — Pass candidates to Vertex Ranking API. The cross-encoder jointly evaluates query + each document, producing calibrated relevance scores.

Apply quality gate — Check Dataplex quality and sensitivity tags on the top-k sources. Exclude below-threshold sources or flag them in the response.

Inject and generate — Pass the final top 5–10 chunks (post-rerank, post-quality-gate) to the LLM with source citations. The model generates from verified, relevance-scored context.

Production Benchmark

In a 2024 Google Cloud reference architecture for financial services RAG, the full pipeline — semantic chunking, hybrid search, Vertex Ranking re-ranking, Dataplex quality gate — achieved 91% precision@5 on an internal benchmark of 10,000 analyst queries. Basic top-k vector search alone achieved 67%. The 24-point gap represents the difference between a useful production tool and an unreliable prototype.

Semantic ChunkingSplitting documents at semantic topic boundaries rather than fixed token counts, preserving contextual coherence within each retrieved chunk.

Reciprocal Rank FusionA method for merging ranked lists from multiple retrieval systems using position-based scoring, robust to differences in score scales between systems.

Vertex Ranking APIGoogle's managed cross-encoder re-ranking service that jointly scores query-document pairs for higher retrieval precision than bi-encoder similarity alone.

Lesson 4 Quiz

RAG Optimization: Chunking, Re-ranking, and Hybrid Search

Why does hybrid search (dense + BM25) outperform dense-only vector search for many enterprise queries?

Correct. Dense retrieval finds semantically similar content; BM25 finds exact keyword matches like product codes, article numbers, or person names. Queries like "GDPR Article 17 violations in customer tables" need both capabilities simultaneously — semantic understanding for "customer tables" and exact matching for "Article 17."

Incorrect. The advantage of hybrid search is complementarity — dense search handles semantic similarity while BM25 handles exact lexical match. Together they cover cases that neither handles alone.

What distinguishes a cross-encoder re-ranker from the bi-encoder embeddings used in vector search?

Correct. Bi-encoders embed query and document independently (enabling fast pre-computed index search). Cross-encoders process the query and document jointly — allowing richer relevance modeling at the cost of per-pair computation. This is why re-ranking runs on a small candidate set (20–100) after initial retrieval, not on millions of documents.

Incorrect. Cross-encoders jointly process query + document pairs, producing higher-quality relevance scores than bi-encoders. They're used for re-ranking because their joint encoding is too expensive to run on an entire index.

According to Google Cloud's 2024 financial services RAG benchmark, what was the precision@5 improvement from using the full optimized pipeline (semantic chunking + hybrid search + re-ranking + quality gate) versus basic top-k vector search?

Correct. The full optimized pipeline achieved 91% precision@5 versus 67% for basic vector search — a 24-point improvement. This benchmark underscores why chunking strategy, hybrid retrieval, re-ranking, and quality gating each contribute meaningful, cumulative gains. No single optimization accounts for the full gap; all four components matter.

Incorrect. The benchmark showed 91% versus 67% — a 24-percentage-point improvement. This represents a meaningful difference between a prototype-quality and production-quality RAG system.

Lab 4: Optimizing a RAG Pipeline

Debug retrieval failures and design chunking, hybrid search, and re-ranking solutions.

Your Challenge

An insurance company's RAG agent is underperforming. Analysis shows: (1) fixed-size 512-token chunks are splitting policy clause tables mid-row, (2) queries for specific policy numbers return semantically similar clauses but miss the exact policy by rank, (3) a key coverage exclusion clause ranks 15th even though it's the most relevant for the query. You need to redesign the pipeline.

Walk the assistant through each failure, ask for the specific fix, and explore how to implement hybrid search with RRF and re-ranking via the Vertex Ranking API for this insurance use case. Aim for at least 3 substantive exchanges covering different failure modes.

RAG Optimization Assistant

Chunking · Hybrid · Re-ranking

Let's debug your insurance RAG pipeline. The three failure modes you described — table fragmentation, exact-ID miss, and critical clause ranking — are each solvable with different techniques. Tell me more about the first one: how are your policy documents structured, and what's happening when the fixed-size chunker hits a table?

Module 4 Test

Vector Search and the Knowledge Catalog — 15 questions · 80% to pass

1. Which Vertex AI embedding model produces 768-dimensional vectors optimized for text retrieval?

Correct. text-embedding-004 is the current production text embedding model on Vertex AI, producing 768-dimensional unit-normalized vectors optimized for retrieval.

Incorrect. text-embedding-004 is the current 768-dimensional text retrieval model on Vertex AI.

2. A cosine similarity of 0.0 between two embedding vectors indicates:

Correct. Cosine similarity of 0.0 means the vectors are perpendicular — pointing in unrelated directions — indicating no semantic relationship. 1.0 means identical direction; negative values indicate opposition.

Incorrect. 0.0 cosine similarity means orthogonal vectors — semantically unrelated. Identical vectors have cosine similarity of 1.0.

3. The task_type parameter RETRIEVAL_DOCUMENT should be assigned to:

Correct. RETRIEVAL_DOCUMENT is for indexed content; RETRIEVAL_QUERY is for user queries. The asymmetry is intentional — it optimizes the embedding space for cross-type matching.

Incorrect. Documents use RETRIEVAL_DOCUMENT; queries use RETRIEVAL_QUERY. Mismatching these is a common bug that silently degrades retrieval quality.

4. Vertex AI Vector Search is powered by which underlying Google library?

Correct. Google's ScaNN library is the core engine of Vertex AI Vector Search. In benchmarks, ScaNN outperformed FAISS, HNSW, and Annoy on the glove-100-angular benchmark by up to 2× in queries per second at equivalent recall.

Incorrect. Vertex AI Vector Search uses Google's own ScaNN (Scalable Nearest Neighbors) library, not FAISS or HNSW.

5. Which distance measure should you use with text-embedding-004 for maximum performance in Vector Search?

Correct. Since text-embedding-004 produces unit-normalized vectors, dot product and cosine similarity are mathematically equivalent — but DOT_PRODUCT_DISTANCE is faster because it skips the per-comparison normalization step.

Incorrect. Google recommends DOT_PRODUCT_DISTANCE for text-embedding-004 because the model produces unit-normalized vectors where dot product equals cosine similarity, and is computationally cheaper.

6. "Stream update" mode in Vector Search (TreeAH) is best suited for:

Correct. TreeAH (stream update mode) supports incremental insertions and deletions without rebuilding the full index — essential for dynamic data sources. Static knowledge bases use batch update, which rebuilds the full index from Cloud Storage on a schedule.

Incorrect. Stream update is for dynamic data requiring continuous updates. Static documents should use batch update mode.

7. Dataplex's three core capabilities relevant to agentic workflows are:

Correct. Dataplex's agent-relevant capabilities are: (1) metadata catalog — discovering and describing all data assets; (2) data quality rules — scoring asset reliability; (3) lineage tracking — tracing data provenance from source to output.

Incorrect. Dataplex's three agent-relevant capabilities are: unified metadata catalog, data quality rules (with scored checks), and data lineage tracking. It does not handle vector indexing or embedding generation.

8. In an agentic workflow, what is the primary purpose of checking Dataplex data quality scores before using a retrieved source?

Correct. Quality gate checks prevent agents from using sources with failed data quality checks — incomplete, corrupted, or schema-drifted tables. An agent that retrieves relevant but low-quality data produces answers that appear authoritative but are factually unreliable.

Incorrect. Quality score checks validate source reliability before agents use retrieved content — preventing factually unreliable outputs from low-quality or corrupted data assets.

9. Dataplex tag templates allow teams to:

Correct. Tag templates define structured metadata schemas — fields like certification status, sensitivity level, known issues — that are attached to catalog entries. Agents read these tags via API to make governance-aware retrieval decisions.

Incorrect. Tag templates attach structured custom metadata to catalog entries for agent-readable governance information. They don't control chunking, ACLs directly, or schedule quality runs.

10. Why does semantic chunking outperform fixed-size chunking for legal and policy documents?

Correct. Legal and policy documents have meaningful semantic units — clauses, provisions, definitions — that fixed-size chunking breaks arbitrarily. Semantic chunking preserves these units by detecting topic shifts, producing chunks whose embeddings are coherent representations of complete concepts.

Incorrect. Semantic chunking's advantage is preserving contextual coherence — splitting at topic/semantic boundaries rather than arbitrary token counts, which is especially important for structured legal and policy text.

11. What does Reciprocal Rank Fusion (RRF) do in a hybrid search pipeline?

Correct. RRF scores each document as 1/(k + rank) from each retrieval system, then sums scores. Documents appearing high in both dense and BM25 ranked lists accumulate the highest scores. With k=60, RRF is robust to score-scale differences and consistently outperforms simple score averaging.

Incorrect. RRF merges ranked lists using position-based scoring — 1/(k+rank) from each retrieval system. It doesn't re-embed, average similarity scores directly, or filter by quality.

12. The Vertex AI Ranking API uses which retrieval architecture?

Correct. The Vertex Ranking API uses a cross-encoder that processes query and document together — richer and more accurate than bi-encoder approaches, but too expensive to run on a full index. It's applied after initial retrieval to re-rank a candidate set of 20–200 documents.

Incorrect. The Vertex Ranking API is a cross-encoder — it processes query and document jointly. Bi-encoders are used for initial retrieval (faster, pre-computable); cross-encoders are used for re-ranking (more accurate, but per-pair computation).

13. In a well-architected RAG pipeline, why do you retrieve 50–100 candidates before re-ranking to select the final top 5–10?

Correct. The two-stage design separates recall from precision. Stage 1 (broad retrieval) ensures relevant documents are in the candidate set. Stage 2 (cross-encoder re-ranking) precisely orders them. Running cross-encoder inference on 50–100 candidates is feasible; running it on 1 million index entries is not.

Incorrect. The two-stage design separates recall (broad initial retrieval) from precision (accurate re-ranking). Cross-encoders are computationally expensive per pair — they can only practically re-rank small candidate sets.

14. Google's Document AI Layout Parser provides which chunking advantage over fixed-size methods?

Correct. Layout Parser understands PDF document structure — detecting tables, headers, figure captions, and list items. For enterprise PDFs, this means table rows stay together in a single chunk rather than being split mid-row, and headers correctly separate unrelated topics.

Incorrect. Document AI Layout Parser's advantage is structure-aware splitting — keeping tables, headers, and lists intact rather than splitting at arbitrary token boundaries.

15. An agent retrieves highly relevant chunks from a BigQuery table that has a Dataplex quality score of 42/100 and a certification tag of "experimental." What should the agent do?

Correct. Governance-aware agents apply policy at retrieval time. If organizational policy requires certified sources for the query type, low-quality sources are excluded. If the agent may use experimental sources, it must disclose their status — "this answer is based partly on experimental data with a quality score of 42/100" — so users can calibrate their trust accordingly.

Incorrect. A governance-aware agent must either exclude low-quality/experimental sources (per policy) or explicitly disclose their quality status. Silently using unvalidated data in a response undermines organizational trust in AI outputs.