In 2023, Spotify's engineering team published details of their "Taste Profile" system — a recommendation engine that represents every song, podcast, and user preference as a dense vector in a shared embedding space. When a user skips death metal but lingers on ambient electronica, the system updates a vector. When that vector drifts near the neighborhood of "focus music," the agent surfacing study playlists needs no explicit rule. The geometry is the rule. The same principle now underlies enterprise data retrieval on Vertex AI.
An embedding is a fixed-length array of floating-point numbers — a vector — produced by a neural encoder model when given text, image, audio, or structured data as input. The encoder is trained so that semantically similar inputs produce vectors that are geometrically close to one another, while dissimilar inputs produce vectors that are far apart.
Google's text-embedding-004 model (available via Vertex AI) produces 768-dimensional vectors for text. That means a sentence like "invoice discrepancy on account" and "billing error charge" each become a point in a 768-dimensional space — and those two points sit very close together.
Cosine similarity is the standard distance metric for text embeddings. It measures the angle between two vectors, ignoring magnitude. A cosine similarity of 1.0 means identical direction (semantically the same); 0.0 means orthogonal (unrelated); negative values indicate opposition.
Vertex AI exposes embedding models through the Vertex AI Model Garden and the Embeddings API. As of 2024, the primary production models are:
A subtlety that catches engineers off-guard: the task_type parameter shifts the embedding toward a specific geometric objective. When you embed a user query with RETRIEVAL_QUERY and documents with RETRIEVAL_DOCUMENT, the model applies asymmetric optimizations — the query vector is shaped to match against document vectors more accurately than if both used the same task type.
Google's internal benchmarks on BEIR (Benchmarking IR) showed that correct task_type assignment improved NDCG@10 by 3–7 percentage points over task-agnostic embedding — a meaningful difference for enterprise retrieval pipelines where precision matters.
Always embed your queries with RETRIEVAL_QUERY and your knowledge-base documents with RETRIEVAL_DOCUMENT. Mixing task types is one of the most common embedding pipeline bugs in agentic systems — it silently degrades retrieval quality without throwing errors.
In an agentic workflow, embeddings serve as the bridge between natural-language questions and structured data stores. When an agent receives a user query, it embeds it, searches for the nearest vectors in a prebuilt index, retrieves the associated documents or records, and injects them into its context window. This pattern — embed → retrieve → inject → generate — is called Retrieval-Augmented Generation (RAG) and is the dominant architecture for grounding LLM agents in real enterprise data.
The quality of the embedding model directly determines what the agent "knows." A weak embedder produces noisy retrievals; the agent hallucinates. A strong, task-typed embedder retrieves precisely relevant documents; the agent reasons from ground truth.
You are architecting a RAG system for a financial services firm. They have 2 million support ticket documents, 50,000 compliance policy PDFs, and product images for 8,000 financial instruments. Users ask natural-language questions; an agent must retrieve relevant documents before generating responses.
In 2022, Google DeepMind researchers published benchmark results for ScaNN (Scalable Nearest Neighbors), showing it outperformed FAISS, HNSW, and Annoy on the glove-100-angular benchmark by up to 2× in queries per second at equivalent recall. Google subsequently productized ScaNN as the core of Vertex AI Vector Search (formerly Matching Engine), making the same approximate nearest-neighbor infrastructure that powers Google Search's semantic features available to enterprise developers — with SLAs, VPC peering, and Terraform support.
Vertex AI Vector Search is a fully managed, high-scale ANN (approximate nearest-neighbor) service. You upload your embeddings, configure an Index, deploy that index to an IndexEndpoint, and query it via gRPC or REST. Google handles the infrastructure — sharding, replication, serving latency, and incremental updates.
Key architectural properties: the service supports both streaming updates (insert/delete in near-real-time) and batch updates (full index rebuilds from Cloud Storage). Batch updates suit static knowledge bases; streaming updates suit inventory systems or live event feeds where vectors change continuously.
Vector Search offers two index update modes:
Stream Update (TreeAH): Uses a tree + asymmetric hashing structure that supports insertions without full rebuilds. Best for data that changes frequently — support tickets, live product catalogs, event streams.
Batch Update (Brute Force or TreeAH): For static datasets, you can request brute-force exact search on small indexes (<10M vectors) or the full ScaNN approximation for larger ones. Brute force is exact but does not scale beyond ~10M vectors economically.
ANN algorithms trade a small amount of recall for enormous speed gains. At 90% recall and 1 billion vectors, ScaNN can return the 10 nearest neighbors in ~50ms. Exact search would require comparing every vector — computationally infeasible at scale. For most enterprise RAG applications, 90–95% recall is sufficient; the LLM that receives the retrieved documents can compensate for minor retrieval imprecision.
Highest performance for text-embedding-004, which produces unit-normalized vectors. When vectors are normalized, dot product equals cosine similarity. Google recommends this for all text retrieval.
Explicitly normalizes during comparison. Use when your embedding model does not guarantee unit-length outputs, or when combining embeddings from multiple models with different magnitude ranges.
Vector Search pricing is based on machine type and hours deployed, plus a per-query charge above certain volumes. For development, deploy a small index to e2-standard-2 and undeploy when not testing. In production, use n1-standard-16 with autoscaling enabled. The index persists in Cloud Storage regardless of deployment state — you only pay for deployed endpoints.
Your team is building a Vector Search index for a logistics company's 500-million-document shipment status and regulatory compliance knowledge base. New shipment status records arrive every 30 seconds. Compliance PDFs are updated quarterly. Query latency SLA is 200ms at p99. You have a mixed machine budget.
In 2022, Google rebranded its unified data governance platform as Dataplex, integrating the Data Catalog (originally launched in 2020) with lake management, data quality, and lineage tracking. By 2024, Dataplex managed metadata for over 100 petabytes of assets across enterprise customers including Deutsche Telekom and Renault. When Renault's analytics teams needed to find which of their 40,000+ BigQuery tables contained reliable vehicle telemetry data — versus prototype tables and deprecated schemas — Dataplex's catalog became the search layer. The same principle applies when an AI agent needs to know which data to trust.
Dataplex is Google Cloud's unified data management platform. Its three core capabilities relevant to agentic workflows are:
1. Unified Metadata Catalog: Dataplex automatically discovers and catalogs assets across BigQuery, Cloud Storage, Spanner, Cloud SQL, and Looker. Every table, bucket, model, and dashboard gets a catalog entry with schema, ownership, data type, and access policy.
2. Data Quality Rules: You define quality checks (column-level completeness, validity ranges, referential integrity) that run on a schedule. Pass/fail scores are attached to catalog entries — an agent can query whether a table's quality score is above a threshold before using it.
3. Data Lineage: Dataplex tracks which jobs produced which tables, which tables feed which dashboards, and which transformations touched which columns. An agent reasoning about whether a metric is reliable can trace its full provenance.
In a Vertex AI Agent Builder workflow, the Dataplex catalog functions as a tool — a callable API the agent can invoke to resolve questions about data assets. A well-architected agent workflow typically includes:
Google's Cloud Next 2024 demonstrations showed Gemini-powered agents using Dataplex search APIs as native tools — the agent would receive a user question like "which tables contain verified sales data for Q3?" and resolve it by calling catalog.search() rather than hallucinating table names. This pattern eliminates an entire class of agent errors where LLMs confidently fabricate schema names that don't exist.
Dataplex tag templates allow your team to attach structured metadata beyond what's auto-discovered. Common enterprise tag fields include:
The most powerful agentic architectures combine both services: Vector Search retrieves semantically relevant document chunks, while Dataplex validates that the sources containing those chunks are governed, current, and trusted. An agent that retrieves the right information from an untrusted table is still a liability. An agent that retrieves from certified, quality-scored sources builds organizational trust in AI-generated answers.
Treat the Dataplex catalog as your agent's epistemology layer — the mechanism by which it distinguishes what it knows reliably from what it merely retrieved. Vector Search handles relevance; the catalog handles trustworthiness. Together they produce grounded, auditable responses.
A healthcare analytics company is building an agent that answers clinician questions by retrieving from a mix of certified EHR summary tables, research literature chunks, and experimental ML model outputs — all in BigQuery and Cloud Storage. The agent must not expose unvalidated experimental data to clinicians as if it were certified.
When Anthropic published their "Long Context vs. RAG" analysis in late 2023, one finding stood out: simple top-k vector retrieval systematically missed relevant passages that used different vocabulary than the query. A question about "revenue recognition timing" failed to retrieve a document discussing "when to book sales" — semantically identical, lexically divergent. Their recommendation: hybrid retrieval combining dense vector search with sparse keyword (BM25) retrieval, followed by a re-ranking model. This approach, now supported natively in Vertex AI Search, is the current production standard for enterprise RAG.
Before any retrieval happens, your source documents must be split into chunks — the units that get embedded and indexed. Chunking strategy is the most underestimated variable in RAG quality. The wrong chunk size produces retrieval artifacts regardless of how good your vector search is.
Split every N tokens (e.g., 512 tokens, 50 overlap). Simple and fast. Breaks mid-sentence frequently. Loses contextual coherence at chunk boundaries. Acceptable for dense reference documents; poor for narratives or legal text.
Split at semantic boundaries — paragraph breaks, heading changes, topic shifts detected by an embedding model comparing consecutive sentence embeddings. Preserves contextual coherence. More expensive to compute but produces significantly better retrieval results.
Google's Document AI Layout Parser provides a third option: structure-aware chunking that splits by detected layout elements — headers, tables, list items, figure captions. For enterprise PDFs (financial reports, compliance manuals), structure-aware chunking dramatically outperforms fixed-size splitting because it keeps table rows together and separates unrelated sections cleanly.
Dense retrieval (vector search) excels at semantic similarity — finding documents that mean the same thing even with different words. Sparse retrieval (BM25, TF-IDF) excels at exact match — finding documents containing specific product codes, person names, technical identifiers, or regulatory article numbers.
In practice, enterprise queries often need both. "Show me GDPR Article 17 violations in customer data tables" requires BM25 to match "Article 17" exactly and dense search to understand "customer data tables" semantically. Vertex AI Search (the search-focused managed service) provides hybrid retrieval natively. For custom Vector Search deployments, you implement BM25 via Elasticsearch or Cloud Search, then merge ranked lists using Reciprocal Rank Fusion (RRF).
RRF merges ranked lists from multiple retrieval systems. Each document's score is 1/(k + rank), where k is typically 60. Documents appearing high in both lists accumulate high scores. RRF is parameter-light, robust to score-scale differences, and consistently outperforms simple score averaging in multi-retrieval experiments.
After retrieving a candidate set (typically 20–100 documents), a re-ranker scores each document in the context of the specific query. Unlike bi-encoder embeddings (query and document embedded separately), cross-encoders jointly encode the query and document together — this is much more computationally expensive per candidate but produces significantly more accurate relevance scores.
Vertex AI's Ranking API (launched 2024) exposes Google's production re-ranking model, the same one used internally for Google Search's featured snippets. You pass up to 200 candidates and receive re-ranked scores in a single call. In evaluations on enterprise datasets, Vertex Ranking improved NDCG@5 by 8–12 percentage points over vector similarity alone.
In a 2024 Google Cloud reference architecture for financial services RAG, the full pipeline — semantic chunking, hybrid search, Vertex Ranking re-ranking, Dataplex quality gate — achieved 91% precision@5 on an internal benchmark of 10,000 analyst queries. Basic top-k vector search alone achieved 67%. The 24-point gap represents the difference between a useful production tool and an unreliable prototype.
An insurance company's RAG agent is underperforming. Analysis shows: (1) fixed-size 512-token chunks are splitting policy clause tables mid-row, (2) queries for specific policy numbers return semantically similar clauses but miss the exact policy by rank, (3) a key coverage exclusion clause ranks 15th even though it's the most relevant for the query. You need to redesign the pipeline.