In 2022, researchers at the Allen Institute for AI benchmarked retrieval on PubMed's 35 million abstracts. A BM25 keyword index β the same kind powering most SQL full-text search β missed 38% of relevant papers that used synonymous phrasing. A vector index on the same corpus missed only 9%. The difference was geometry: vectors encode meaning, not characters.
Relational databases answer exact questions: give me all rows where status = 'active'. Full-text search indexes improve on this by matching individual words and their stems β PostgreSQL's tsvector, Elasticsearch's inverted index, SQLite's FTS5 all work this way. A query for "automobile" does not retrieve documents containing only "car" unless you explicitly add synonyms.
RAG systems face a harder problem. Users ask "what are our climate-related financial exposures?" while the source document says "carbon transition risk in our equity portfolio." No keyword overlap exists. This is the vocabulary mismatch problem, and it is the primary reason vector databases exist.
Vector databases store documents as high-dimensional numerical vectors β the embeddings you learned about in Module 2. Instead of matching tokens, they measure geometric distance between vectors. Two sentences that mean the same thing but share no words will have vectors that are close together. Two sentences that share words but mean opposite things will be far apart.
In every RAG pipeline, retrieval quality is the ceiling on answer quality. A language model cannot cite what the retriever never returns. Vector databases are the infrastructure that makes semantic retrieval possible at scale β from thousands to billions of documents.
A vector database stores three things per document chunk: the vector (an array of 384β3072 floats depending on the embedding model), the payload (the original text and any metadata like source URL, timestamp, author), and a unique ID. At query time, the user's query is embedded into the same vector space, and the database returns the k nearest neighbors by distance.
What separates a vector database from simply storing vectors in a NumPy array is the index structure. A brute-force search over 10 million 1536-dimensional vectors takes seconds per query β unusable in production. Purpose-built databases use approximate nearest neighbor (ANN) algorithms to return results in milliseconds, trading a small accuracy loss for massive speed gains.
Not every RAG deployment needs a dedicated vector database. Several approaches exist on a spectrum of complexity and capability:
| Approach | How It Works | Best For | Limitations |
|---|---|---|---|
| Flat file (NumPy) | Brute-force cosine over all vectors in memory | Prototypes, <50K docs | O(n) latency, no persistence, no filtering |
| pgvector | Vector extension for PostgreSQL; IVFFlat or HNSW indexes | Existing Postgres stack, <1M docs | Slower than purpose-built at scale; memory pressure |
| Dedicated vector DB (Pinecone, Weaviate, Qdrant, Chroma) | ANN indexes (HNSW, IVF), metadata filtering, distributed sharding | Production RAG at any scale | Operational overhead; cost at large scale |
| Hybrid search (Weaviate BM25+vector) | Combines keyword and semantic scores (RRF or weighted) | Mixed retrieval needs, legal/medical domains | More tuning required; two index maintenance paths |
Notion's engineering blog (published May 2023) described their switch from a keyword-only search backend to a vector search layer using Pinecone. Their internal evaluation showed a 2.4Γ improvement in "relevant block retrieved in top-3" rate on their benchmark queries. They retained BM25 for exact phrase matching and combined both signals using reciprocal rank fusion β a hybrid approach covered in Module 5.
You are an AI engineer advising three different teams. Each team has different data volume, existing infrastructure, and latency requirements. The lab assistant will describe each scenario in detail. Your job is to recommend a vector storage approach (flat file, pgvector, dedicated vector DB, or hybrid) and explain your reasoning.
Have at least 3 exchanges to complete this lab. Ask questions, challenge assumptions, and explore trade-offs.
In 2019, Meta's FAISS team published benchmarks showing their IVF index searching 1 billion vectors in under 5 milliseconds on a single GPU β achieving 90% recall compared to exact search. Their paper introduced the combination now used by virtually every production vector database: quantization to compress vectors followed by inverted file structure to narrow the search space.
To find the k nearest neighbors of a query vector exactly, you must compute the distance between the query and every stored vector. For 10 million 1536-dimensional vectors, each distance requires 1536 multiplications and additions. At 1 billion FLOPS/second, that is roughly 15 seconds per query β completely unacceptable. ANN indexes solve this by making a deliberate trade: accept ~1β5% miss rate on the true nearest neighbors in exchange for 1000Γ speed improvement.
HNSW is the dominant index structure in production vector databases (Qdrant, Weaviate, pgvector all support it as default). The core idea: build a multi-layer graph where vectors are nodes. Higher layers have fewer nodes connected by long-range edges; lower layers have all nodes connected by short-range edges. Search starts at the top layer, greedily following the edge to the closest neighbor, then descends through layers, refining until the bottom layer gives the final result.
HNSW's key parameters are M (maximum edges per node, typically 16β64) and ef_construction (size of the dynamic candidate list during index build, typically 100β500). Higher M and ef_construction increase recall and decrease query time but cost more RAM and longer build time. At query time, ef_search controls the candidate list size β raising it increases recall at the cost of latency.
HNSW stores the full float32 vectors plus the graph edges. With M=16, expect roughly 1.1Γ the raw vector storage for the graph structure. For 1M 1536-dim vectors: ~6 GB raw + ~660 MB graph β 6.7 GB total in RAM. This is why quantization matters at scale.
IVF partitions the vector space into nlist Voronoi cells using k-means clustering. Each vector is assigned to its nearest centroid. At query time, the query is compared against centroids, and only the nprobe nearest cells are searched exhaustively. With nlist=1024 and nprobe=32, only 3% of vectors are ever evaluated.
IVF requires a training step (running k-means on a representative sample of your data) before building the index. This makes it less flexible than HNSW for collections that change frequently. However, IVF pairs naturally with product quantization (PQ), which compresses each vector into a compact code, enabling billion-scale search that fits entirely in RAM.
FAISS, Meta's open-source library, is the canonical IVF implementation. Qdrant and Weaviate use HNSW. Pinecone uses a proprietary variant. pgvector offers both IVFFlat and HNSW.
Scalar quantization (SQ8) converts each float32 dimension (4 bytes) to an int8 (1 byte), reducing vector size by 4Γ. Distance computation on int8 values uses integer arithmetic β faster on most CPUs. Recall loss is typically under 1% for well-tuned SQ8.
Product quantization (PQ) is more aggressive: it divides the vector into m sub-vectors and quantizes each sub-vector to one of 256 centroids (1 byte). A 1536-dim float32 vector (6144 bytes) becomes 96 bytes with PQ-96 β a 64Γ compression. Distance lookups use precomputed lookup tables rather than full arithmetic. Recall drops more substantially than SQ8 but enables billion-scale indexes on consumer hardware.
| Index Type | Build Speed | Query Speed | RAM Usage | Recall@10 | Best Use Case |
|---|---|---|---|---|---|
| Flat (exact) | Fast | Very slow at scale | Low (vectors only) | 100% | <100K vectors, offline evals |
| HNSW | Slow (graph build) | Very fast | High (graph + vectors) | 95β99% | Production, <100M, frequent updates |
| IVFFlat | Medium (k-means) | Fast | Medium | 90β97% | Large static collections |
| IVFPQ | Slow (k-means + PQ) | Very fast | Very low | 80β93% | Billion-scale, RAM-constrained |
| HNSW + SQ8 | Slow | Very fast | Medium | 94β98% | Production balance of speed and recall |
For most RAG applications, HNSW with default M=16, ef_construction=128, and ef_search=64 is a safe starting point. Measure recall against a ground-truth dataset of representative queries. If recall is below 95%, increase ef_search first (cheapest), then M at rebuild. If latency exceeds your SLA at your target ef_search, consider SQ8 quantization to reduce the per-vector comparison cost.
Never tune index parameters without a recall benchmark. Weaviate, Qdrant, and Pinecone all provide recall evaluation utilities. The ANN-benchmarks project (ann-benchmarks.com) maintains standardized comparisons across datasets including SIFT-1M, GIST-960, and text-based datasets derived from MS-MARCO.
As of 2024, on the SIFT-1M dataset (1 million 128-dimensional float vectors), HNSW implementations (hnswlib, Qdrant) achieve 99% recall@10 at under 1ms query time using a single CPU core. IVF-PQ achieves 85% recall at 0.3ms β a recall/speed trade-off that matters when cost is the constraint.
Your RAG system is exhibiting two symptoms: poor retrieval recall (the right documents are not appearing in results) and high query latency. The lab assistant plays the role of your on-call infrastructure engineer who has run diagnostics. Work through the index configuration together β identify which parameters are misconfigured and propose remediation steps.
Have at least 3 exchanges. Dig into specific parameter values, expected recall ranges, and the rebuild vs. query-time trade-offs.
Elastic's 2023 engineering post on their internal legal document system described a failure mode: pure vector search returned highly relevant precedents from the wrong jurisdiction. A query about California employment law surfaced federal cases because the embedding model weighted legal language more heavily than geographic tags. The fix was pre-filtering by a jurisdiction metadata field before computing vector similarity β reducing the candidate set from 2 million documents to 40,000 before ANN search ran.
In real deployments, retrieval correctness is not just about semantic similarity. A RAG system for a financial institution must not surface documents from the wrong regulatory regime. A healthcare RAG must not retrieve records from the wrong patient or time period. A multi-tenant SaaS RAG must not leak one customer's documents to another. These constraints are hard requirements, not preferences β they cannot be solved by embedding alone.
Metadata filtering solves this by attaching structured fields to each vector at ingestion time: tenant_id, created_at, document_type, jurisdiction, clearance_level, language. At query time, a filter expression runs before or during ANN search, ensuring that only compliant documents are considered.
Post-filtering runs ANN search first, retrieves k results, then discards those that fail the metadata condition. Simple to implement but dangerous: if the filter removes many results, you may return fewer than k documents, or return nothing. This approach fails when the filtered subset is a small fraction of the collection.
Pre-filtering (also called filtered search) applies the metadata condition before ANN search. Only vectors satisfying the filter are candidates. Most production vector databases (Qdrant, Weaviate, Pinecone) implement this. The challenge: if the filtered subset is very small, HNSW's graph structure is sparse in that region and recall degrades. Qdrant's documentation recommends falling back to brute-force over the filtered subset when it contains fewer than ~10,000 vectors.
In-filter (payload-aware indexing) is Qdrant's approach: it maintains separate HNSW sub-graphs for common filter values, enabling efficient filtered search even on small subsets. This is the state of the art for mixed workloads with diverse metadata filters.
A frequent production bug: applying a date filter that is too narrow (e.g., last 7 days) when the relevant document is slightly older. Always design your filter logic with domain experts. Consider a tiered approach: strict filter first, then relaxed filter if results are insufficient.
Hybrid search combines keyword (BM25) scores with vector similarity scores to produce a unified ranked list. It is particularly effective in domains where exact terminology matters: medical diagnosis codes, legal citations, product SKUs, regulatory references. A query for "ICD-10 code E11.9" benefits from keyword precision; a query for "type 2 diabetes management" benefits from semantic recall.
Reciprocal Rank Fusion (RRF) is the dominant fusion algorithm in production systems. It is parameter-free (uses only result ranks, not raw scores), robust to score scale differences between BM25 and cosine similarity, and performs comparably to or better than learned weighting on most benchmarks. The formula: for each document, sum 1/(rank + 60) across all retrieval lists, then re-rank by descending score.
Weaviate's BM25+vector hybrid was benchmarked in their 2023 BEIR evaluation across 18 retrieval datasets. Hybrid search improved mean NDCG@10 by 6.2 percentage points over pure vector search and by 11.4 points over pure BM25 β particularly on datasets with technical terminology (TREC-COVID, FiQA, SciFact).
Good payload design at ingestion time prevents painful schema migrations later. At minimum, index these fields on every vector collection: source_id (document identifier), chunk_index (position within document), created_at (Unix timestamp for range filters), doc_type (enum for document class), and tenant_id if multi-tenant. Use low-cardinality fields for indexed payloads β high-cardinality string fields (like free-text author names) are better stored but not indexed as filter targets.
In Qdrant, payload indexes are created explicitly with create_payload_index calls specifying the field name and type (keyword, integer, float, datetime, bool, geo). Unindexed payload fields can still be retrieved but cannot be used in filtered searches efficiently.
By pre-filtering on the jurisdiction field before ANN search, Elastic's legal team reduced their false-jurisdiction retrieval rate from 23% to under 2% of returned results. Query latency actually decreased slightly because the candidate pool shrank β the HNSW search over 40,000 jurisdiction-matched vectors ran faster than over 2 million.
You are architecting the vector database layer for a legal technology company building a multi-tenant RAG product. Their users are law firms across three jurisdictions (federal, California, New York). Each firm must only see their own documents. Users also want to filter by case type, date range, and document class. Design the payload schema, filtering strategy, and hybrid search configuration.
Work through at least 3 exchanges. Justify your schema choices, handle edge cases (very small filtered subsets, tenant isolation failures), and propose a fusion strategy.
Qdrant's engineering blog described a customer incident where a batch ingestion job inserted 500,000 vectors in a single transaction without pausing for index optimization. The HNSW graph became fragmented: new vectors were connected only to recently added neighbors rather than being globally integrated into the graph. Recall dropped from 97% to 71% without any error messages. The fix required a manual index optimization call and a policy change to run optimization after every 50,000 inserts.
A production ingestion pipeline for a vector database has five stages: chunking (splitting source documents into appropriately sized segments), embedding (calling the embedding API or local model to generate vectors), batching (grouping vectors for efficient upsert), upsert (inserting or updating vectors in the database), and index optimization (triggering or waiting for the index to absorb new vectors into its graph structure).
Most vector databases accept batch upserts of 100β1000 vectors per API call. Smaller batches increase API overhead; larger batches can cause memory spikes and, in the case of HNSW, graph fragmentation as the Qdrant incident illustrates. The sweet spot for most collections is 256β512 vectors per batch with an optimization trigger every 10,000β50,000 inserts.
Vector databases handle updates differently from relational databases. Most use an upsert model: provide the same ID as an existing vector, and the old record is replaced. The old vector's position in the HNSW graph is marked as deleted (a "tombstone") and the new vector is re-inserted. Heavy update workloads accumulate tombstones that degrade recall until the index is compacted.
Deletions are similarly lazy in most implementations. Pinecone, Qdrant, and Weaviate all use tombstone-based deletion. Qdrant exposes a vectors_count vs. indexed_vectors_count metric in its collection info β a large gap indicates pending optimization work. Always monitor this ratio in production.
For RAG systems that track versioned documents (source updates, regulatory amendments), a best practice is to include a version or updated_at field in the payload and delete old version chunks explicitly during ingestion rather than relying on upsert β this avoids the tombstone accumulation pattern.
If your RAG system ingests updated documents frequently (daily news, changing knowledge bases), schedule index optimization jobs at off-peak hours. Qdrant's optimizer runs automatically but can be triggered manually. Weaviate uses a similar background compaction process. Unchecked tombstone buildup degrades recall silently β the system returns results but misses recently updated content.
When a single-node vector database runs out of RAM or its query latency exceeds SLA at peak load, you need to distribute the collection across multiple nodes. Two approaches: sharding (distributing different vectors across nodes) and replication (copying the same vectors to multiple nodes).
Sharding increases total capacity: a collection sharded across 4 nodes can hold 4Γ as many vectors. But each query must fan out to all shards and aggregate results β increasing latency and adding network overhead. Qdrant uses a consistent hashing scheme for shard assignment; Pinecone abstracts sharding entirely behind their managed service. Weaviate supports both explicit shard configuration and auto-sharding based on collection size.
Replication improves query throughput and availability. With 3 replicas, read queries can be load-balanced across all three, tripling read capacity. Qdrant's write consistency model requires a quorum of replicas to acknowledge a write before returning success β configurable as One, Quorum, or All.
The four metrics every production vector database deployment should track: Recall@k (run periodic ground-truth evaluation with held-out query sets), p99 query latency (set alerts at 2Γ your baseline), indexed vs. total vector count (monitor optimization lag), and collection size vs. RAM headroom (HNSW needs all vectors in RAM; alert at 80% usage).
For recall monitoring specifically, maintain a golden dataset: 200β500 representative queries with human-labeled relevant documents. Run this evaluation daily or after each major ingestion batch. A recall drop of more than 3 percentage points from baseline warrants investigation β it may indicate tombstone buildup, an index fragmentation issue, or an embedding model version mismatch.
One of the most disruptive production events in a vector database deployment is an embedding model upgrade. If you upgrade from OpenAI's text-embedding-ada-002 to text-embedding-3-large mid-deployment, the new query vectors are in a different vector space than the stored document vectors. Results become garbage β high cosine similarity scores for semantically unrelated content.
The correct procedure: provision a new collection, re-embed all documents with the new model, backfill metadata, validate recall on your golden dataset, then cut over query traffic to the new collection. Do not mix embedding model versions within a single collection. Always store the embedding model name and version as a collection-level metadata field so future engineers know exactly how vectors were generated.
Pinecone's documentation and several engineering blog posts (including from Cohere's integration team, 2023) describe using Pinecone namespaces to implement blue/green embedding model upgrades: the "green" namespace contains re-embedded documents with the new model, query traffic is gradually shifted using a feature flag, and the "blue" namespace is deleted once confidence is high. This pattern avoids any service interruption during model migration.
You are the on-call engineer. Your RAG system's recall metric has dropped from 94% to 67% over the past 48 hours. No alerts fired. No exceptions in the logs. Users are complaining that the assistant "doesn't know things it used to know." You have access to Qdrant's collection info API and your application logs. Work with the lab assistant (playing your SRE partner) to diagnose root cause and write a remediation runbook.
Complete at least 3 exchanges. Check every possible cause systematically β tombstones, model version mismatch, shard rebalancing, optimizer lag. Propose a fix and a monitoring improvement to prevent recurrence.