Module 4 · Lesson 1

What Is a Vector Database?

From flat files to geometric search — how similarity replaced exact match

Why can't a traditional database find "documents about climate risk" when the text says "carbon exposure"?

In 2022, researchers at the Allen Institute for AI benchmarked retrieval on PubMed's 35 million abstracts. A BM25 keyword index — the same kind powering most SQL full-text search — missed 38% of relevant papers that used synonymous phrasing. A vector index on the same corpus missed only 9%. The difference was geometry: vectors encode meaning, not characters.

The Core Problem with Traditional Search

Relational databases answer exact questions: give me all rows where status = 'active'. Full-text search indexes improve on this by matching individual words and their stems — PostgreSQL's tsvector, Elasticsearch's inverted index, SQLite's FTS5 all work this way. A query for "automobile" does not retrieve documents containing only "car" unless you explicitly add synonyms.

RAG systems face a harder problem. Users ask "what are our climate-related financial exposures?" while the source document says "carbon transition risk in our equity portfolio." No keyword overlap exists. This is the vocabulary mismatch problem, and it is the primary reason vector databases exist.

Vector databases store documents as high-dimensional numerical vectors — the embeddings you learned about in Module 2. Instead of matching tokens, they measure geometric distance between vectors. Two sentences that mean the same thing but share no words will have vectors that are close together. Two sentences that share words but mean opposite things will be far apart.

Why This Matters for RAG

In every RAG pipeline, retrieval quality is the ceiling on answer quality. A language model cannot cite what the retriever never returns. Vector databases are the infrastructure that makes semantic retrieval possible at scale — from thousands to billions of documents.

Anatomy of a Vector Database

A vector database stores three things per document chunk: the vector (an array of 384–3072 floats depending on the embedding model), the payload (the original text and any metadata like source URL, timestamp, author), and a unique ID. At query time, the user's query is embedded into the same vector space, and the database returns the k nearest neighbors by distance.

What separates a vector database from simply storing vectors in a NumPy array is the index structure. A brute-force search over 10 million 1536-dimensional vectors takes seconds per query — unusable in production. Purpose-built databases use approximate nearest neighbor (ANN) algorithms to return results in milliseconds, trading a small accuracy loss for massive speed gains.

Typical Dimensions

384 – 3,072

Floats per vector. Ada-002 uses 1536; text-embedding-3-large supports up to 3072.

Query Latency (ANN)

< 10 ms

Typical p99 for HNSW indexes on 10M vectors with 1 GPU shard.

Recall vs. Exact Search

95–99%

Well-tuned ANN parameters recover this fraction of exact nearest neighbors.

Storage Overhead

~6 MB/1K docs

1536-dim float32 vectors. Quantization can reduce this 4–8×.

Vector Databases vs. Alternatives

Not every RAG deployment needs a dedicated vector database. Several approaches exist on a spectrum of complexity and capability:

Approach	How It Works	Best For	Limitations
Flat file (NumPy)	Brute-force cosine over all vectors in memory	Prototypes, <50K docs	O(n) latency, no persistence, no filtering
pgvector	Vector extension for PostgreSQL; IVFFlat or HNSW indexes	Existing Postgres stack, <1M docs	Slower than purpose-built at scale; memory pressure
Dedicated vector DB (Pinecone, Weaviate, Qdrant, Chroma)	ANN indexes (HNSW, IVF), metadata filtering, distributed sharding	Production RAG at any scale	Operational overhead; cost at large scale
Hybrid search (Weaviate BM25+vector)	Combines keyword and semantic scores (RRF or weighted)	Mixed retrieval needs, legal/medical domains	More tuning required; two index maintenance paths

Real Deployment — Notion AI (2023)

Notion's engineering blog (published May 2023) described their switch from a keyword-only search backend to a vector search layer using Pinecone. Their internal evaluation showed a 2.4× improvement in "relevant block retrieved in top-3" rate on their benchmark queries. They retained BM25 for exact phrase matching and combined both signals using reciprocal rank fusion — a hybrid approach covered in Module 5.

Key Terms

Vector spaceA mathematical space where each dimension corresponds to a learned feature; semantically similar items cluster geometrically close.

ANN (Approximate Nearest Neighbor)Algorithms that find near-exact nearest neighbors in sub-linear time by trading perfect recall for speed.

Payload / metadataStructured fields stored alongside each vector: source URL, date, document type, section, etc. Used for filtering.

Recall@kThe fraction of true nearest neighbors found in the top-k results returned by an ANN index. Higher = better retrieval.

pgvectorOpen-source PostgreSQL extension that adds vector column types and IVFFlat/HNSW index support directly in SQL.

Lesson 1 Quiz

What Is a Vector Database?

1. What fundamental limitation of traditional full-text search does a vector database overcome?

Correct. The vocabulary mismatch problem — where "automobile" and "car" share no tokens — is what keyword indexes cannot handle. Vectors encode meaning, so semantically similar text clusters nearby regardless of surface wording.

Not quite. Traditional databases can store large documents and support Boolean queries. The core limitation is semantic: they match tokens, not meaning.

2. In a vector database, what does ANN stand for and why is it necessary?

Correct. ANN algorithms like HNSW and IVF return near-exact results in milliseconds by avoiding exhaustive comparison across all stored vectors.

ANN stands for Approximate Nearest Neighbor. Its necessity comes from the O(n) cost of brute-force search: scanning millions of high-dimensional vectors per query is far too slow for production use.

3. According to Notion's 2023 engineering blog, what metric improved 2.4× after adding vector search?

Correct. Notion measured "relevant block retrieved in top-3" on their internal benchmark and saw a 2.4× improvement over their prior keyword-only backend.

Notion measured retrieval quality, not latency or cost. Their benchmark tracked how often a relevant block appeared in the top-3 results — this rate improved 2.4×.

4. What is stored alongside each vector in a vector database?

Correct. Each record in a vector database has three parts: the vector itself, a unique ID, and a payload with the source text and metadata (source URL, date, section, etc.).

Vector databases co-locate the vector with its payload — the original text and structured metadata — so the retriever can return meaningful context without a separate lookup.

Lab 1 — Choosing a Vector Store

Practice selecting the right vector storage strategy for a given scenario

Your Task

You are an AI engineer advising three different teams. Each team has different data volume, existing infrastructure, and latency requirements. The lab assistant will describe each scenario in detail. Your job is to recommend a vector storage approach (flat file, pgvector, dedicated vector DB, or hybrid) and explain your reasoning.

Have at least 3 exchanges to complete this lab. Ask questions, challenge assumptions, and explore trade-offs.

Starter: "Walk me through the first scenario — what does Team A look like?"

Vector Store Selection Lab

Hello! I'm your vector database advisor for this lab. I have three real-world-style scenarios ready for you. Each team has different needs, and your job is to recommend the right vector storage approach. Ready to start? Ask me about Team A — or jump in with your own question about vector store trade-offs.

Module 4 · Lesson 2

Index Structures: HNSW, IVF, and Scalar Quantization

The algorithms that make billion-vector search possible in milliseconds

How does a database find the 5 closest vectors out of 100 million in under 10 milliseconds?

In 2019, Meta's FAISS team published benchmarks showing their IVF index searching 1 billion vectors in under 5 milliseconds on a single GPU — achieving 90% recall compared to exact search. Their paper introduced the combination now used by virtually every production vector database: quantization to compress vectors followed by inverted file structure to narrow the search space.

Why Brute Force Doesn't Scale

To find the k nearest neighbors of a query vector exactly, you must compute the distance between the query and every stored vector. For 10 million 1536-dimensional vectors, each distance requires 1536 multiplications and additions. At 1 billion FLOPS/second, that is roughly 15 seconds per query — completely unacceptable. ANN indexes solve this by making a deliberate trade: accept ~1–5% miss rate on the true nearest neighbors in exchange for 1000× speed improvement.

HNSW: Hierarchical Navigable Small World

HNSW is the dominant index structure in production vector databases (Qdrant, Weaviate, pgvector all support it as default). The core idea: build a multi-layer graph where vectors are nodes. Higher layers have fewer nodes connected by long-range edges; lower layers have all nodes connected by short-range edges. Search starts at the top layer, greedily following the edge to the closest neighbor, then descends through layers, refining until the bottom layer gives the final result.

HNSW's key parameters are M (maximum edges per node, typically 16–64) and ef_construction (size of the dynamic candidate list during index build, typically 100–500). Higher M and ef_construction increase recall and decrease query time but cost more RAM and longer build time. At query time, ef_search controls the candidate list size — raising it increases recall at the cost of latency.

HNSW Memory Footprint

HNSW stores the full float32 vectors plus the graph edges. With M=16, expect roughly 1.1× the raw vector storage for the graph structure. For 1M 1536-dim vectors: ~6 GB raw + ~660 MB graph ≈ 6.7 GB total in RAM. This is why quantization matters at scale.

IVF: Inverted File Index

IVF partitions the vector space into nlist Voronoi cells using k-means clustering. Each vector is assigned to its nearest centroid. At query time, the query is compared against centroids, and only the nprobe nearest cells are searched exhaustively. With nlist=1024 and nprobe=32, only 3% of vectors are ever evaluated.

IVF requires a training step (running k-means on a representative sample of your data) before building the index. This makes it less flexible than HNSW for collections that change frequently. However, IVF pairs naturally with product quantization (PQ), which compresses each vector into a compact code, enabling billion-scale search that fits entirely in RAM.

FAISS, Meta's open-source library, is the canonical IVF implementation. Qdrant and Weaviate use HNSW. Pinecone uses a proprietary variant. pgvector offers both IVFFlat and HNSW.

Scalar Quantization and Product Quantization

Scalar quantization (SQ8) converts each float32 dimension (4 bytes) to an int8 (1 byte), reducing vector size by 4×. Distance computation on int8 values uses integer arithmetic — faster on most CPUs. Recall loss is typically under 1% for well-tuned SQ8.

Product quantization (PQ) is more aggressive: it divides the vector into m sub-vectors and quantizes each sub-vector to one of 256 centroids (1 byte). A 1536-dim float32 vector (6144 bytes) becomes 96 bytes with PQ-96 — a 64× compression. Distance lookups use precomputed lookup tables rather than full arithmetic. Recall drops more substantially than SQ8 but enables billion-scale indexes on consumer hardware.

Index Type	Build Speed	Query Speed	RAM Usage	Recall@10	Best Use Case
Flat (exact)	Fast	Very slow at scale	Low (vectors only)	100%	<100K vectors, offline evals
HNSW	Slow (graph build)	Very fast	High (graph + vectors)	95–99%	Production, <100M, frequent updates
IVFFlat	Medium (k-means)	Fast	Medium	90–97%	Large static collections
IVFPQ	Slow (k-means + PQ)	Very fast	Very low	80–93%	Billion-scale, RAM-constrained
HNSW + SQ8	Slow	Very fast	Medium	94–98%	Production balance of speed and recall

Choosing Index Parameters in Practice

For most RAG applications, HNSW with default M=16, ef_construction=128, and ef_search=64 is a safe starting point. Measure recall against a ground-truth dataset of representative queries. If recall is below 95%, increase ef_search first (cheapest), then M at rebuild. If latency exceeds your SLA at your target ef_search, consider SQ8 quantization to reduce the per-vector comparison cost.

Never tune index parameters without a recall benchmark. Weaviate, Qdrant, and Pinecone all provide recall evaluation utilities. The ANN-benchmarks project (ann-benchmarks.com) maintains standardized comparisons across datasets including SIFT-1M, GIST-960, and text-based datasets derived from MS-MARCO.

Real Benchmark — ann-benchmarks.com

As of 2024, on the SIFT-1M dataset (1 million 128-dimensional float vectors), HNSW implementations (hnswlib, Qdrant) achieve 99% recall@10 at under 1ms query time using a single CPU core. IVF-PQ achieves 85% recall at 0.3ms — a recall/speed trade-off that matters when cost is the constraint.

HNSWHierarchical Navigable Small World — a multi-layer graph index offering high recall and fast query time; standard in most production vector databases.

IVFInverted File Index — partitions vectors into Voronoi cells; only searches the nprobe nearest cells at query time.

ef_searchHNSW query-time parameter controlling candidate list size. Higher values increase recall and latency; tunable without rebuilding the index.

Product QuantizationLossy vector compression that encodes sub-vectors as centroid IDs; enables 64× memory reduction at the cost of ~7–15% recall loss.

Recall@kFraction of true nearest neighbors recovered in the top-k ANN results. The primary metric for index quality.

Lesson 2 Quiz

Index Structures

1. What is the primary trade-off that ANN index structures make compared to exact nearest neighbor search?

Correct. ANN indexes accept a ~1–5% miss rate on true nearest neighbors to achieve 100–1000× speed improvements over exhaustive search.

The trade-off is recall versus speed. ANN indexes may miss a small fraction of the true nearest neighbors but return results in milliseconds instead of seconds.

2. In HNSW, what does increasing ef_search accomplish — and what does it cost?

Correct. ef_search is a query-time parameter — no rebuild needed. A larger candidate list explores more of the graph, recovering more true nearest neighbors at the cost of slower queries.

ef_search controls the dynamic candidate list size during graph traversal at query time. Raising it improves recall but increases per-query latency. No rebuild is required.

3. Which compression technique divides a vector into sub-vectors and encodes each as a centroid ID, achieving up to 64× compression?

Correct. PQ splits each vector into m sub-vectors, trains a codebook of 256 centroids per sub-vector, and stores only the centroid index. A 6144-byte float32 vector becomes 96 bytes with PQ-96.

Product Quantization (PQ) is the technique. SQ8 only reduces float32 to int8 (4× compression). PQ's aggressive sub-vector encoding achieves 64× or more.

4. According to the FAISS team's 2019 research, what recall did their IVF index achieve searching 1 billion vectors in under 5ms on a single GPU?

Correct. Meta AI's FAISS benchmarks showed 90% recall at under 5ms on 1 billion vectors — a demonstration that practical billion-scale ANN search was achievable.

The FAISS 2019 paper reported 90% recall at sub-5ms latency on 1 billion vectors. This result validated IVF+quantization as the foundation for production-scale vector search.

Lab 2 — Tuning Index Parameters

Practice diagnosing and fixing recall and latency problems in vector indexes

Your Task

Your RAG system is exhibiting two symptoms: poor retrieval recall (the right documents are not appearing in results) and high query latency. The lab assistant plays the role of your on-call infrastructure engineer who has run diagnostics. Work through the index configuration together — identify which parameters are misconfigured and propose remediation steps.

Have at least 3 exchanges. Dig into specific parameter values, expected recall ranges, and the rebuild vs. query-time trade-offs.

Starter: "What diagnostics have you run and what did they show?"

Index Tuning Lab

I've just finished running the diagnostic suite on your HNSW index. Ready to walk you through what I found — it's not great. Ask me what I found, or dive straight into any specific parameter you're worried about.

Module 4 · Lesson 3

Metadata Filtering and Hybrid Search

Combining structured constraints with semantic similarity — the retrieval pattern that powers enterprise RAG

How do you retrieve documents that are both semantically relevant and from the correct department, date range, or access tier?

Elastic's 2023 engineering post on their internal legal document system described a failure mode: pure vector search returned highly relevant precedents from the wrong jurisdiction. A query about California employment law surfaced federal cases because the embedding model weighted legal language more heavily than geographic tags. The fix was pre-filtering by a jurisdiction metadata field before computing vector similarity — reducing the candidate set from 2 million documents to 40,000 before ANN search ran.

Why Metadata Matters in Production RAG

In real deployments, retrieval correctness is not just about semantic similarity. A RAG system for a financial institution must not surface documents from the wrong regulatory regime. A healthcare RAG must not retrieve records from the wrong patient or time period. A multi-tenant SaaS RAG must not leak one customer's documents to another. These constraints are hard requirements, not preferences — they cannot be solved by embedding alone.

Metadata filtering solves this by attaching structured fields to each vector at ingestion time: tenant_id, created_at, document_type, jurisdiction, clearance_level, language. At query time, a filter expression runs before or during ANN search, ensuring that only compliant documents are considered.

Pre-filter vs. Post-filter vs. In-filter

Post-filtering runs ANN search first, retrieves k results, then discards those that fail the metadata condition. Simple to implement but dangerous: if the filter removes many results, you may return fewer than k documents, or return nothing. This approach fails when the filtered subset is a small fraction of the collection.

Pre-filtering (also called filtered search) applies the metadata condition before ANN search. Only vectors satisfying the filter are candidates. Most production vector databases (Qdrant, Weaviate, Pinecone) implement this. The challenge: if the filtered subset is very small, HNSW's graph structure is sparse in that region and recall degrades. Qdrant's documentation recommends falling back to brute-force over the filtered subset when it contains fewer than ~10,000 vectors.

In-filter (payload-aware indexing) is Qdrant's approach: it maintains separate HNSW sub-graphs for common filter values, enabling efficient filtered search even on small subsets. This is the state of the art for mixed workloads with diverse metadata filters.

Common Mistake — Over-Filtering

A frequent production bug: applying a date filter that is too narrow (e.g., last 7 days) when the relevant document is slightly older. Always design your filter logic with domain experts. Consider a tiered approach: strict filter first, then relaxed filter if results are insufficient.

Hybrid Search: BM25 + Vector

Hybrid search combines keyword (BM25) scores with vector similarity scores to produce a unified ranked list. It is particularly effective in domains where exact terminology matters: medical diagnosis codes, legal citations, product SKUs, regulatory references. A query for "ICD-10 code E11.9" benefits from keyword precision; a query for "type 2 diabetes management" benefits from semantic recall.

Reciprocal Rank Fusion (RRF) is the dominant fusion algorithm in production systems. It is parameter-free (uses only result ranks, not raw scores), robust to score scale differences between BM25 and cosine similarity, and performs comparably to or better than learned weighting on most benchmarks. The formula: for each document, sum 1/(rank + 60) across all retrieval lists, then re-rank by descending score.

Weaviate's BM25+vector hybrid was benchmarked in their 2023 BEIR evaluation across 18 retrieval datasets. Hybrid search improved mean NDCG@10 by 6.2 percentage points over pure vector search and by 11.4 points over pure BM25 — particularly on datasets with technical terminology (TREC-COVID, FiQA, SciFact).

RRF Constant

k = 60

Standard constant in RRF formula. Prevents very high rank-1 documents from dominating. Empirically robust across datasets.

BEIR Improvement (Weaviate)

+6.2 NDCG pts

Hybrid over pure vector, averaged across 18 BEIR benchmark datasets (2023).

Brute-force Threshold (Qdrant)

~10,000 vectors

Below this filtered subset size, HNSW degrades; Qdrant falls back to exact search automatically.

Common Filter Fields

6–12 fields

Typical production schemas: tenant_id, doc_type, created_at, language, clearance_level, region.

Designing Your Payload Schema

Good payload design at ingestion time prevents painful schema migrations later. At minimum, index these fields on every vector collection: source_id (document identifier), chunk_index (position within document), created_at (Unix timestamp for range filters), doc_type (enum for document class), and tenant_id if multi-tenant. Use low-cardinality fields for indexed payloads — high-cardinality string fields (like free-text author names) are better stored but not indexed as filter targets.

In Qdrant, payload indexes are created explicitly with create_payload_index calls specifying the field name and type (keyword, integer, float, datetime, bool, geo). Unindexed payload fields can still be retrieved but cannot be used in filtered searches efficiently.

Real Deployment — Elastic Legal RAG 2023

By pre-filtering on the jurisdiction field before ANN search, Elastic's legal team reduced their false-jurisdiction retrieval rate from 23% to under 2% of returned results. Query latency actually decreased slightly because the candidate pool shrank — the HNSW search over 40,000 jurisdiction-matched vectors ran faster than over 2 million.

Pre-filteringApplying metadata conditions before ANN graph traversal, so only compliant vectors are considered as candidates.

Post-filteringRunning ANN first, then discarding results that fail metadata conditions. Risky when the filter removes many results.

Reciprocal Rank Fusion (RRF)A rank-based fusion algorithm that combines multiple retrieval lists without requiring score normalization. Standard formula uses constant k=60.

Payload indexA structured index on a metadata field that enables efficient filtered vector search. Must be created explicitly in most vector databases.

BEIRBenchmarking IR benchmark suite with 18 heterogeneous retrieval datasets. Standard evaluation framework for RAG retrieval components.

Lesson 3 Quiz

Metadata Filtering and Hybrid Search

1. Why did Elastic's legal document system switch from pure vector search to pre-filtered vector search?

Correct. The embedding model weighted legal terminology more heavily than geographic metadata, so California law queries returned federal cases. Pre-filtering on the jurisdiction field fixed the compliance failure.

The problem was a compliance failure, not performance. Semantically similar documents from the wrong jurisdiction were returned because embedding alone cannot enforce hard categorical constraints.

2. What is the key risk of post-filtering compared to pre-filtering?

Correct. If the filter condition is highly selective, most ANN results may be discarded, leaving you with fewer than the requested k documents — or none at all.

The core risk is result sparsity: when many ANN results fail the filter, the final result set shrinks below k. This is a silent failure — the system returns results but far fewer than expected.

3. What constant appears in the standard Reciprocal Rank Fusion formula, and what is its purpose?

Correct. The RRF formula is 1/(rank + 60). The constant 60 softens the scoring so that rank-1 documents do not receive a disproportionately large boost over rank-2 and rank-3 documents.

The constant in RRF is 60. It appears in the denominator: score = 1/(rank + 60). This dampens the advantage of very top ranks, making fusion more robust to noise in either retrieval list.

4. According to Weaviate's 2023 BEIR evaluation, by how many NDCG@10 points did hybrid search outperform pure vector search on average?

Correct. Weaviate's evaluation across 18 BEIR datasets showed hybrid BM25+vector search improving mean NDCG@10 by 6.2 percentage points over pure vector search.

Weaviate's 2023 BEIR evaluation found a 6.2 NDCG@10 improvement for hybrid over pure vector search. The benefit was especially large on technical terminology datasets like TREC-COVID and SciFact.

Lab 3 — Designing Payload Schemas and Hybrid Search

Practice building the metadata structure and search logic for a multi-tenant RAG system

Your Task

You are architecting the vector database layer for a legal technology company building a multi-tenant RAG product. Their users are law firms across three jurisdictions (federal, California, New York). Each firm must only see their own documents. Users also want to filter by case type, date range, and document class. Design the payload schema, filtering strategy, and hybrid search configuration.

Work through at least 3 exchanges. Justify your schema choices, handle edge cases (very small filtered subsets, tenant isolation failures), and propose a fusion strategy.

Starter: "What are the non-negotiable constraints for this system before we touch any schema?"

Payload Schema Design Lab

Great starting point. Let's lock down the non-negotiables before we design anything. I'll play the role of the product manager who has been working with the law firms. Ask me what constraints I've gathered, or tell me what you think the hard requirements must be.

Module 4 · Lesson 4

Production Operations: Ingestion, Updates, and Scale

What nobody tells you about running a vector database at production scale

What breaks first when your vector collection grows from 100,000 to 10 million documents — and how do you catch it before users do?

Qdrant's engineering blog described a customer incident where a batch ingestion job inserted 500,000 vectors in a single transaction without pausing for index optimization. The HNSW graph became fragmented: new vectors were connected only to recently added neighbors rather than being globally integrated into the graph. Recall dropped from 97% to 71% without any error messages. The fix required a manual index optimization call and a policy change to run optimization after every 50,000 inserts.

The Ingestion Pipeline

A production ingestion pipeline for a vector database has five stages: chunking (splitting source documents into appropriately sized segments), embedding (calling the embedding API or local model to generate vectors), batching (grouping vectors for efficient upsert), upsert (inserting or updating vectors in the database), and index optimization (triggering or waiting for the index to absorb new vectors into its graph structure).

Most vector databases accept batch upserts of 100–1000 vectors per API call. Smaller batches increase API overhead; larger batches can cause memory spikes and, in the case of HNSW, graph fragmentation as the Qdrant incident illustrates. The sweet spot for most collections is 256–512 vectors per batch with an optimization trigger every 10,000–50,000 inserts.

Handling Updates and Deletions

Vector databases handle updates differently from relational databases. Most use an upsert model: provide the same ID as an existing vector, and the old record is replaced. The old vector's position in the HNSW graph is marked as deleted (a "tombstone") and the new vector is re-inserted. Heavy update workloads accumulate tombstones that degrade recall until the index is compacted.

Deletions are similarly lazy in most implementations. Pinecone, Qdrant, and Weaviate all use tombstone-based deletion. Qdrant exposes a vectors_count vs. indexed_vectors_count metric in its collection info — a large gap indicates pending optimization work. Always monitor this ratio in production.

For RAG systems that track versioned documents (source updates, regulatory amendments), a best practice is to include a version or updated_at field in the payload and delete old version chunks explicitly during ingestion rather than relying on upsert — this avoids the tombstone accumulation pattern.

Tombstone Accumulation Warning

If your RAG system ingests updated documents frequently (daily news, changing knowledge bases), schedule index optimization jobs at off-peak hours. Qdrant's optimizer runs automatically but can be triggered manually. Weaviate uses a similar background compaction process. Unchecked tombstone buildup degrades recall silently — the system returns results but misses recently updated content.

Sharding and Replication

When a single-node vector database runs out of RAM or its query latency exceeds SLA at peak load, you need to distribute the collection across multiple nodes. Two approaches: sharding (distributing different vectors across nodes) and replication (copying the same vectors to multiple nodes).

Sharding increases total capacity: a collection sharded across 4 nodes can hold 4× as many vectors. But each query must fan out to all shards and aggregate results — increasing latency and adding network overhead. Qdrant uses a consistent hashing scheme for shard assignment; Pinecone abstracts sharding entirely behind their managed service. Weaviate supports both explicit shard configuration and auto-sharding based on collection size.

Replication improves query throughput and availability. With 3 replicas, read queries can be load-balanced across all three, tripling read capacity. Qdrant's write consistency model requires a quorum of replicas to acknowledge a write before returning success — configurable as One, Quorum, or All.

Monitoring a Production Vector Database

The four metrics every production vector database deployment should track: Recall@k (run periodic ground-truth evaluation with held-out query sets), p99 query latency (set alerts at 2× your baseline), indexed vs. total vector count (monitor optimization lag), and collection size vs. RAM headroom (HNSW needs all vectors in RAM; alert at 80% usage).

For recall monitoring specifically, maintain a golden dataset: 200–500 representative queries with human-labeled relevant documents. Run this evaluation daily or after each major ingestion batch. A recall drop of more than 3 percentage points from baseline warrants investigation — it may indicate tombstone buildup, an index fragmentation issue, or an embedding model version mismatch.

Optimal Batch Size

256–512 vectors

Per upsert call. Smaller = API overhead. Larger = memory spikes and graph fragmentation.

Optimization Trigger

Every 10K–50K inserts

Prevents HNSW graph fragmentation and tombstone accumulation from degrading recall.

RAM Alert Threshold

80% usage

HNSW indexes must fit in RAM. Alert before the node OOMs and the service crashes.

Golden Dataset Size

200–500 queries

Minimum for statistically meaningful recall monitoring. Run daily or post-ingestion.

Embedding Model Versioning

One of the most disruptive production events in a vector database deployment is an embedding model upgrade. If you upgrade from OpenAI's text-embedding-ada-002 to text-embedding-3-large mid-deployment, the new query vectors are in a different vector space than the stored document vectors. Results become garbage — high cosine similarity scores for semantically unrelated content.

The correct procedure: provision a new collection, re-embed all documents with the new model, backfill metadata, validate recall on your golden dataset, then cut over query traffic to the new collection. Do not mix embedding model versions within a single collection. Always store the embedding model name and version as a collection-level metadata field so future engineers know exactly how vectors were generated.

Real Pattern — Pinecone Namespaces for Blue/Green

Pinecone's documentation and several engineering blog posts (including from Cohere's integration team, 2023) describe using Pinecone namespaces to implement blue/green embedding model upgrades: the "green" namespace contains re-embedded documents with the new model, query traffic is gradually shifted using a feature flag, and the "blue" namespace is deleted once confidence is high. This pattern avoids any service interruption during model migration.

TombstoneA soft-delete marker on a vector record. Tombstoned vectors are excluded from results but still occupy graph edges until compaction runs.

Index optimizationA background (or manually triggered) process that integrates new/updated vectors into the HNSW graph and removes tombstones.

ShardingDistributing different vectors across multiple nodes to increase total collection capacity beyond a single machine's RAM.

Golden datasetA curated set of representative queries with human-labeled relevant documents; used for ongoing recall monitoring.

Blue/green deploymentRunning two parallel collection versions (old and new embedding model) and cutting over traffic after validating the new version.

Lesson 4 Quiz

Production Operations

1. According to Qdrant's engineering blog incident, what caused recall to drop from 97% to 71% with no error messages?

Correct. Mass insertion without optimization left new vectors poorly integrated into the HNSW graph, so they were rarely reached during traversal — a silent recall collapse.

The incident was caused by HNSW graph fragmentation from inserting 500,000 vectors without triggering index optimization. New nodes were only locally connected, so graph traversal rarely reached them.

2. Why should you never mix two different embedding model versions within a single vector collection?

Correct. Each embedding model learns its own vector space. A query embedded with model B will have high cosine similarity to garbage documents embedded with model A — the geometric relationship is meaningless across models.

The problem is the vector space itself. Two different embedding models learn two different spaces. A query vector from model B bears no meaningful geometric relationship to document vectors from model A.

3. What Qdrant collection metric indicates that optimization work is pending (tombstone cleanup and graph integration)?

Correct. Qdrant's collection info exposes both counts. When indexed_vectors_count lags vectors_count, the optimizer has not yet integrated all new vectors into the HNSW graph.

Qdrant exposes vectors_count (total upserted) and indexed_vectors_count (fully integrated into HNSW). A growing gap between these means the optimizer is behind — recall may be degraded for the unindexed fraction.

4. What is the recommended size for a "golden dataset" used for ongoing recall monitoring, and how often should it be evaluated?

Correct. 200–500 representative queries with ground-truth labels provides statistically meaningful recall estimates without being prohibitively expensive to maintain.

The recommended golden dataset is 200–500 queries with human-labeled relevant documents. Running it daily or post-ingestion ensures you catch recall degradation before users notice it.

Lab 4 — Production Incident: Silent Recall Collapse

Diagnose and remediate a real-world vector database production failure

Your Task

You are the on-call engineer. Your RAG system's recall metric has dropped from 94% to 67% over the past 48 hours. No alerts fired. No exceptions in the logs. Users are complaining that the assistant "doesn't know things it used to know." You have access to Qdrant's collection info API and your application logs. Work with the lab assistant (playing your SRE partner) to diagnose root cause and write a remediation runbook.

Complete at least 3 exchanges. Check every possible cause systematically — tombstones, model version mismatch, shard rebalancing, optimizer lag. Propose a fix and a monitoring improvement to prevent recurrence.

Starter: "Pull the collection info and tell me exactly what the metrics show."

Production Incident Lab

I've just pulled the collection info. Here are the raw numbers: vectors_count: 4,182,000 — indexed_vectors_count: 2,891,000 — segments: 47 — optimizer_status: indexing — last_optimization_completed: 61 hours ago. The embedding model field in our config reads "text-embedding-ada-002" but our ingestion logs from 52 hours ago show a switch to "text-embedding-3-small" with no collection migration. Where do you want to start?

Module 4 Test — Vector Databases

15 questions · Score 80% or above to pass

1. A RAG system fails to retrieve documents about "carbon transition risk" when the user asks about "climate financial exposure." What is the root cause?

Correct. This is the classic vocabulary mismatch problem that vector search solves — semantic equivalence with zero token overlap.

The problem is vocabulary mismatch. A keyword index cannot connect "carbon transition risk" and "climate financial exposure" because they share no tokens. Vector search encodes meaning, not characters.

2. What three components does a vector database store per document chunk?

Correct. Every record has: a unique ID, the float vector, and a payload with source text and structured metadata for filtering.

Each record stores: a unique ID, the vector (float array), and a payload containing the original text plus structured metadata fields.

3. Notion's 2023 engineering blog reported a 2.4× improvement in what metric after adding Pinecone vector search?

Correct. Notion benchmarked "relevant block in top-3" on internal query sets and saw a 2.4× improvement over their prior keyword-only backend.

Notion measured retrieval quality. Their "relevant block retrieved in top-3" rate improved 2.4× after adding vector search via Pinecone.

4. HNSW builds a multi-layer graph. What do higher layers contain compared to lower layers?

Correct. HNSW's hierarchy: top layers have few nodes with long-range connections for coarse navigation; bottom layer has all nodes with short-range connections for fine search.

Higher HNSW layers have fewer nodes connected by long-range edges — they enable fast, coarse navigation. The bottom layer contains all nodes with dense local connections for precise search.

5. What are the two HNSW parameters set at build time that most affect recall, and how do they differ from ef_search?

Correct. M and ef_construction are fixed at build time; changing them requires rebuilding the index. ef_search is a query-time parameter that can be tuned without a rebuild.

HNSW's build-time parameters are M (max edges per node) and ef_construction (candidate list during build). Unlike ef_search (query-time, no rebuild needed), these require a full index rebuild to change.

6. Meta's FAISS team published 2019 benchmarks showing 90% recall searching 1 billion vectors in under 5ms. Which index type achieved this?

Correct. FAISS combined IVF (Voronoi cell partitioning) with product quantization (compressed sub-vectors) on GPU to achieve billion-scale search.

The FAISS result used IVF+PQ (Inverted File with Product Quantization) on a GPU. This combination narrows the search to a small fraction of cells and uses compressed vectors for fast distance computation.

7. Scalar Quantization (SQ8) converts each float32 dimension to int8. What is the compression ratio and typical recall impact?

Correct. float32 (4 bytes) → int8 (1 byte) = 4× compression. Distance arithmetic on int8 is faster, and recall loss is typically under 1% for well-tuned SQ8.

SQ8 compresses 4× (float32 to int8). Recall loss is typically under 1%. Product Quantization achieves 64× compression but with larger (~7–15%) recall loss.

8. Elastic's legal RAG system pre-filtered by jurisdiction field before ANN search. What was the false-jurisdiction retrieval rate before and after?

Correct. The false-jurisdiction rate dropped from 23% to under 2% after pre-filtering on the jurisdiction field was implemented.

Elastic's blog reported the false-jurisdiction rate dropped from 23% to under 2% after implementing pre-filtering — and query latency actually improved because the candidate pool shrank.

9. What is the standard constant k in the Reciprocal Rank Fusion formula, and what is its function?

Correct. RRF formula: 1/(rank + 60). The 60 softens rank-1 dominance and makes fusion robust to noise in either the BM25 or vector ranked list.

In RRF, the score formula is 1/(rank + k) where k=60. The constant prevents items at rank 1 from having a disproportionately large score advantage over rank 2 or 3.

10. When is post-filtering dangerous in a vector database query?

Correct. Post-filtering discards ANN results that fail the condition. If the filter is very selective, the post-filter step may eliminate almost all ANN results, returning nearly nothing.

Post-filtering is dangerous when the filter is selective — it removes many of the ANN results, leaving fewer than k valid documents. This is a silent failure: no error, just sparse results.

11. Weaviate's 2023 BEIR evaluation found hybrid BM25+vector search improved NDCG@10 by 6.2 points over pure vector search. On what types of datasets was the improvement largest?

Correct. Technical domains where exact terminology matters (medical, financial, scientific) benefit most from the BM25 component in hybrid search.

The hybrid advantage was largest on technical terminology datasets (TREC-COVID, FiQA, SciFact) where exact terms like disease names or financial product codes matter alongside semantic meaning.

12. What is the recommended batch size per upsert call in a production vector database ingestion pipeline?

Correct. 256–512 per batch avoids excessive API call overhead (too small) and memory spikes or HNSW graph fragmentation (too large).

The practical sweet spot is 256–512 vectors per batch. Too small wastes API calls; too large causes memory spikes and HNSW graph fragmentation as the Qdrant incident demonstrated.

13. What does the gap between vectors_count and indexed_vectors_count in Qdrant's collection info indicate?

Correct. This gap represents optimizer lag — vectors exist in the collection but are not yet reachable through normal HNSW traversal, degrading recall for recently ingested content.

The gap = optimizer lag. Vectors have been upserted but the background optimizer hasn't integrated them into the HNSW graph yet. These vectors may be missed during queries, silently degrading recall.

14. What is the correct procedure when upgrading from one embedding model to another (e.g., ada-002 to text-embedding-3-large)?

Correct. Mixed embedding model versions produce meaningless cosine similarities. Always fully re-embed into a new collection and validate before cutting over traffic.

Never mix embedding model versions in one collection. Provision a new collection, re-embed everything, validate recall on your golden dataset, then perform a traffic cutover (Pinecone namespaces make this a blue/green deployment).

15. Qdrant's payload-aware indexing maintains separate HNSW sub-graphs for common filter values. What problem does this solve?

Correct. Standard HNSW has poor recall on very small filtered subsets because the graph is sparse in those regions. Payload-aware sub-graphs maintain dense local connectivity within each filter partition.

Payload-aware indexing solves the small-subset problem. Standard pre-filtering on a very small subset (e.g., a single tenant with few documents) yields poor HNSW recall because the graph edges don't densely cover that region. Separate sub-graphs fix this.