In September 2023, Databricks published benchmarks showing their RAG-enabled DBRX model answering questions about proprietary internal documentation with 78% accuracy — compared to 31% accuracy from the same base model without retrieval. The difference wasn't a smarter model. It was the pipeline feeding the model the right chunks of text at the right moment.
That same year, Klarna deployed a RAG system over 200+ product policy documents. Customer service resolution time dropped from 11 minutes to under 2 minutes. The AI wasn't hallucinating policies — it was reading them.
Large language models are trained on a snapshot of the world. That snapshot has a cutoff date. It contains no knowledge of your internal documents, your company's pricing, your users' history, or anything that happened after training ended. When you ask a model about these things without RAG, it either refuses or — far more dangerously — confabulates a plausible-sounding answer.
Retrieval-Augmented Generation (RAG) solves this by separating the knowledge store from the reasoning engine. At inference time, a retrieval system fetches relevant document chunks, injects them into the model's context, and the model reasons over real evidence rather than learned weights. The model becomes a reader and reasoner; your database becomes the source of truth.
Fine-tuning bakes knowledge into weights. Updating it requires retraining — expensive, slow, and error-prone. RAG updates instantly: change a document in your store, and the next query reflects it. Fine-tuning is for teaching style and capability; RAG is for teaching facts.
Every production RAG system, from LlamaIndex deployments to enterprise Vertex AI Search, shares the same fundamental pipeline in two phases: indexing time and query time.
The simplest RAG implementation — embed everything, retrieve top-5, stuff into prompt — works for demos. Production systems that handle thousands of queries per day require more sophistication. Advanced RAG introduces query rewriting, hybrid search (dense + sparse/BM25), re-ranking models, and multi-step retrieval.
Cohere's 2024 RAG Evaluation Report found that adding a cross-encoder reranker on top of baseline retrieval improved answer accuracy by 19 percentage points on enterprise document QA benchmarks, with no change to the underlying LLM.
Morgan Stanley built a RAG system over 100,000+ research reports and internal documents using GPT-4 and custom embeddings. Advisors query the system for client-specific market intelligence. The bank publicly cited it as a key AI productivity initiative at their 2023 investor day. The system retrieves and cites source reports rather than generating financial claims from model weights — a critical compliance distinction.
Poor chunking is the most common reason RAG systems fail in production. Too large: retrieval precision drops, context window fills with irrelevant content. Too small: retrieved chunks lack the surrounding context needed to answer the question.
Fixed-size chunking (simple, fast) splits every N tokens regardless of semantic boundaries. Works for uniform documents. Sentence/paragraph chunking respects natural language boundaries. Semantic chunking (used by LlamaIndex's SemanticSplitter) embeds sentences and splits when cosine similarity between adjacent sentences drops below a threshold. Hierarchical chunking stores both summary-level and detail-level chunks, enabling coarse-to-fine retrieval.
The startup has 50,000 contract documents (NDAs, MSAs, SOWs) in PDF format. Users need to query these with natural language questions like "What are the termination clauses in our enterprise agreements?" Your AI lab partner will help you design the full pipeline — from chunking strategy to vector store choice to query architecture.
Work through the design decisions step by step. Ask about chunking, embedding choices, vector DB selection, hybrid search, and how you'd handle metadata filtering for date ranges or contract types.
In October 2023, the ANN-Benchmarks project published results showing Qdrant achieving 1.2 million queries per second on a 10M-vector SIFT dataset with 95% recall at 1ms p99 latency on a single machine. Pinecone's managed service processed over 1 billion vector queries per day by Q4 2023 across all customers. These are not edge cases — they define what production looks like.
By 2024, the vector database market had fragmented into four categories: dedicated vector stores, vector-capable traditional databases, cloud-managed services, and in-process libraries. Each has genuine trade-offs that matter for RAG deployments.
Exact nearest neighbor search across millions of vectors is prohibitively slow — O(n·d) per query where d is embedding dimensionality. All production vector databases use Approximate Nearest Neighbor (ANN) algorithms that trade a small accuracy loss for orders-of-magnitude speedup.
HNSW (Hierarchical Navigable Small World) is the dominant algorithm in 2024. It builds a multi-layer graph where upper layers provide coarse navigation and lower layers refine to exact neighborhoods. Qdrant, Weaviate, and pgvector all use HNSW by default. Key parameters: ef_construction (graph quality at index time, higher = slower build, better recall) and ef (search quality at query time, higher = slower query, better recall).
IVF-PQ (Inverted File with Product Quantization) is used by Faiss and Milvus for billion-scale. Compresses vectors to reduce memory footprint — critical when 100M 1536-dim float32 vectors would require 600GB RAM without compression.
The MTEB (Massive Text Embedding Benchmark) leaderboard, maintained on HuggingFace, is the authoritative source for embedding model comparison. As of early 2024, the top performers on the retrieval subtask included text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), and open-source models like bge-large-en-v1.5 (BAAI) and e5-mistral-7b-instruct (Microsoft).
Critical: embedding models must match at indexing and query time. You cannot index with OpenAI embeddings and query with Cohere. If you switch models, you must re-embed your entire corpus — plan for this from day one.
Pure semantic search is rarely enough in production. Users need "contracts signed after Jan 2023" or "NDAs involving counterparty Acme Corp." Vector databases support metadata filtering — attached structured fields that constrain the ANN search space before or during vector comparison.
Implementation strategy matters: pre-filter (filter then search — fast, can miss results if filter is too narrow) vs. post-filter (search then filter — can miss top-k results) vs. in-filter (Qdrant's approach — interleaves filtering during HNSW graph traversal for best accuracy/speed balance).
When you update your embedding model (new version, better accuracy), your index is stale. The new model's vector space may not align with the old one. Production systems need a re-indexing strategy: background job to re-embed all chunks, swap index atomically when complete. Pinterest engineering blog (2023) described this as their #1 operational pain point in their recommendation RAG system.
At 1M vectors with 1536 dims in float32: 6GB RAM minimum just for vectors, before index overhead. HNSW typically adds 1.2–2x memory overhead for the graph structure. At this scale, Qdrant or Pinecone handle it fine. At 100M+ vectors, you need quantization (Scalar Quantization cuts to 1.5GB, Product Quantization further) or distributed deployment (Milvus sharding).
Your startup's RAG prototype works great at 10,000 documents using Chroma + OpenAI embeddings. You've just closed a Series A and need to scale to 10 million documents within 6 months. Your AI partner will help you architect the production vector infrastructure — covering database selection, HNSW tuning, quantization, and embedding model trade-offs at scale.
In 2023, Notion's AI team published a technical blog detailing their RAG system for the Notion AI product. Their biggest accuracy gain came not from a better LLM but from query rewriting: before hitting the vector store, user questions were rewritten into 3 alternative phrasings by a small LLM, all three were retrieved against, and results were merged. This HyDE-adjacent technique improved answer relevance on internal evaluations by approximately 22% compared to single-query retrieval.
Standard RAG retrieves the top-k most semantically similar chunks to a query. This works well when the user's phrasing closely matches document language. It fails when:
Vocabulary mismatch: The user asks "how do I cancel my subscription" but documents say "termination procedures for recurring billing." Semantically similar, but embedding distance may be high. Multi-hop reasoning: The answer requires facts from two different documents — neither alone is sufficient. Ambiguous queries: "What's our policy on this?" has no grounding unless the query is expanded. Long documents with sparse relevance: The relevant sentence is buried in a 10,000-word report; the chunk containing it may not rank in the top-k.
Before retrieval, a small LLM transforms the user query to improve retrieval performance. Several strategies exist with different trade-offs:
After first-pass retrieval returns candidate chunks, a cross-encoder reranker re-scores each candidate by processing the query and candidate together in a single forward pass. This is far more accurate than bi-encoder cosine similarity (which processes them independently) but too slow for first-pass retrieval across millions of vectors — hence the two-stage approach.
Cohere Rerank API (2023): the most widely used managed reranking service. BGE-Reranker (BAAI): state-of-the-art open-source cross-encoder. ms-marco-MiniLM: lightweight cross-encoder trained on Microsoft's MARCO QA dataset, fast enough for real-time reranking of top-20 candidates.
Complex queries sometimes cannot be answered with a single retrieval step. Iterative RAG uses the LLM's output to generate follow-up retrieval queries. FLARE (Forward-Looking Active REtrieval, Jiang et al. 2023) generates answers token-by-token, triggering retrieval whenever the model's confidence drops below a threshold — essentially retrieving exactly when the model needs it.
LlamaIndex's Sub-Question Query Engine decomposes complex questions into sub-questions, dispatches retrieval for each, then synthesizes. This is the architecture behind many enterprise "research assistant" RAG deployments documented in 2023–2024.
You cannot improve what you don't measure. The RAGAs framework (open-source, 2023) provides automated evaluation across four dimensions without requiring ground-truth labels. ARES (Stanford, 2023) uses a trained LLM judge to evaluate context relevance and answer faithfulness against a small labeled dataset. Both are widely used in production RAG evaluation pipelines.
You have a working RAG system for a healthcare benefits platform. Users query a corpus of insurance policy documents. Your baseline (simple top-5 retrieval, no reranking) scores 68% on your internal QA evaluation set. Your CTO has set a 85% target before launch. Work with your AI partner to diagnose the failure modes and implement a multi-technique improvement strategy.
In July 2024, Microsoft Research published GraphRAG, an open-source system combining knowledge graph extraction with traditional vector retrieval. On benchmark "global reasoning" queries — questions requiring synthesis across many documents — GraphRAG scored 72% win rate against naive RAG on community-level sensemaking tasks. The improvement came from extracted entity relationships, not raw text retrieval.
That same quarter, LlamaIndex released LlamaCloud citing a recurring pattern from enterprise customers: RAG systems that scored 85%+ on launch-day evaluations degraded to 70% within 90 days as underlying documents changed and query distributions shifted. Continuous evaluation pipelines had become non-negotiable.
Production RAG evaluation operates at three layers. Offline evaluation runs a curated QA benchmark before every deployment — similar to a test suite. Online evaluation scores live queries using LLM judges and tracks metrics over time. Human evaluation samples a percentage of production queries weekly for expert review — the ground truth check.
RAG systems fail silently. Unlike a 500 error, a hallucinated answer looks identical to a correct one from the outside. Production observability requires instrumenting the full trace — not just the final answer.
LangSmith (LangChain's observability platform, launched 2023) logs the full LLM call tree including retrieval steps, latencies, and prompt/response content. Phoenix by Arize provides embedding visualization and retrieval quality metrics. Langfuse (open-source alternative) supports cost tracking and custom evaluation scoring via API.
Retrieval latency p95/p99 — degradation here often precedes accuracy issues (index fragmentation, DB under load). Context length utilization — are retrieved chunks using your full context window or wasting it? Reranking score distribution — a narrowing score gap between top-1 and top-5 reranked candidates often signals that retrieval is returning irrelevant results. Query refusal rate — the model saying "I don't have enough information" is informative signal, not a failure.
Standard RAG retrieves text chunks. GraphRAG extracts a knowledge graph from documents — entities, relationships, communities — and adds graph-structured retrieval alongside vector retrieval. This fundamentally changes what kinds of questions can be answered.
Vector RAG excels at: "What does document X say about topic Y?" — local, specific retrieval. GraphRAG excels at: "What is the overall theme connecting these 50 documents?" or "Which entities appear most frequently in the context of risk?" — global, relational queries.
Microsoft's GraphRAG open-source release (July 2024) uses a two-phase approach: offline graph construction (entity extraction → relationship extraction → community detection) and online query routing (local search for specific facts, global search for thematic synthesis). The community detection step (using Leiden algorithm) creates hierarchical summaries at multiple granularities.
Production knowledge systems must handle the full document lifecycle: ingestion (new documents added), updates (existing documents revised — requires deleting old chunks and re-embedding), and deletion (documents removed — requires removing associated vectors and metadata). Vector stores that lack delete-by-metadata capability create technical debt that compounds over time.
The most robust pattern (used by Notion AI, Morgan Stanley, and Glean according to their respective engineering blogs) is document-level versioning: each document has a version ID as metadata, chunks are tagged with their parent document ID and version, and updates delete all chunks with the old document ID before re-indexing.
Glean, the enterprise search company valued at $4.6B in 2024, published engineering posts describing their RAG infrastructure. Key features: per-user and per-department access control enforced at the vector search layer via metadata filters (not post-retrieval); real-time document ingestion pipelines targeting under 60-second latency from document update to searchable chunk; and hybrid retrieval (BM25 + dense) across 100+ enterprise SaaS connectors. Document lifecycle management — handling version updates across Confluence, Salesforce, Slack, etc. — is described as their primary engineering complexity.
A mature production RAG system layers all the techniques from this module: hybrid retrieval as the first pass, cross-encoder reranking as the second pass, query rewriting to handle vocabulary gaps, metadata filtering for access control and relevance scoping, continuous evaluation via RAGAs on a golden dataset, full trace observability in LangSmith or Phoenix, and document lifecycle management with versioned chunks. GraphRAG adds an optional third retrieval path for global/relational queries.
You're the ML engineer at a hedge fund that has deployed a RAG system over 10 years of analyst reports, earnings transcripts, and SEC filings. The system is live with 50 analyst users. Your task: design the full evaluation and observability stack that will catch regressions before users do, handle quarterly document updates cleanly, and decide whether to add GraphRAG for cross-corpus thematic queries.