In November 2023, Klarna publicly described its AI assistant β built on a retrieval-augmented stack β handling over 2.3 million conversations in its first month. The prototype that preceded it had been tested at roughly 500 queries per day. The gap between those two numbers is where production RAG architecture lives.
Most RAG prototypes share a single-process architecture: one embedding model, one vector store connection, one LLM call per request β all synchronous, all in sequence. At low volume this is fine. Under load, three failure modes emerge simultaneously.
Embedding bottleneck. Embedding a user query requires a model inference call. At 10 queries per second, a 40ms embedding latency consumes 400ms of CPU time per second β already near saturation on a single core. At 100 QPS the queue grows unbounded.
Vector store connection exhaustion. Pinecone, Weaviate, and Qdrant all enforce per-account connection limits. A naive implementation opens a new connection per request; under load, connection setup latency (10β50ms) compounds with query latency and the connection pool exhausts.
LLM rate limits. OpenAI's GPT-4 Turbo has a default tokens-per-minute (TPM) limit of 450,000 for new tier accounts. A RAG response averaging 1,200 tokens means roughly 375 requests per minute before rate-limit errors begin. No queue means dropped requests.
When Notion AI launched in November 2022, users reported multi-second latencies and partial outages. Notion's engineering team later attributed the instability partly to synchronous embedding calls blocking response threads β a classic single-process bottleneck at scale.
A production-grade RAG system separates concerns into at least four independently scalable tiers: an ingestion pipeline, a retrieval service, a generation service, and an orchestration layer. Each can scale horizontally without affecting the others.
The core shift in production RAG is moving from synchronous to asynchronous processing. A message queue β Redis Streams, Amazon SQS, or Apache Kafka β decouples request acceptance from request processing. The API gateway acknowledges the request immediately; a pool of workers picks it up for processing.
For the embedding service specifically, batching transforms economics dramatically. Sending 32 queries to OpenAI's text-embedding-3-small in a single API call costs the same as sending 1 query but produces 32 embeddings. With a queue, a worker can collect 32 pending queries before issuing the batch call β achieving near-linear throughput improvement at no additional cost.
Worker pool sizing follows Little's Law: L = Ξ»W, where L is the average number of items in the system, Ξ» is throughput, and W is average processing time. If embedding takes 80ms and you target 500 QPS, you need at least 40 concurrent embedding workers just for that tier.
Production RAG stacks typically implement two distinct caches. The semantic cache stores (query-embedding β retrieved-chunks) mappings. If a new query's embedding is within cosine distance 0.05 of a cached query, the retrieval step is skipped entirely. This is the approach used by GPTCache, open-sourced by Zilliz in 2023.
The response cache stores (query-hash β final-LLM-response) mappings for exact or near-exact repeated queries. In enterprise deployments where many users ask the same FAQ-style questions, response cache hit rates of 30β60% are common, eliminating LLM calls entirely for those requests.
Each tier in production RAG should have its own health check, circuit breaker, and independent scaling policy. A spike in LLM latency should never cause the embedding service to queue-starve or the vector store to timeout. Isolation is the foundation of resilience.
You are an engineer at a mid-size SaaS company. Your RAG-powered support assistant prototype works well at 50 queries per day. The product team has committed to a public launch expected to drive 5,000 queries per hour at peak. You need to redesign the architecture.
Discuss infrastructure choices, bottleneck analysis, caching strategies, and worker pool sizing with the lab assistant. Complete at least 3 exchanges to finish the lab.
In March 2023, Air Canada's RAG-based support chatbot told a passenger that a bereavement discount could be claimed retroactively β a policy that did not exist. The system had retrieved an outdated policy document and the LLM generated a confident, incorrect response. The case went to the British Columbia Civil Resolution Tribunal, which ruled against Air Canada. The airline was ordered to pay damages.
The failure was not in the LLM's reasoning. It was in the absence of any monitoring that would have flagged which documents were being retrieved for policy queries β and whether those documents were current.
Production RAG observability operates at three distinct levels that must be monitored independently. Infrastructure metrics tell you whether the system is running. Pipeline metrics tell you how well individual components are performing. Quality metrics tell you whether the answers are actually good.
Most teams instrument infrastructure metrics β latency, error rates, CPU usage β well from day one. Pipeline and quality metrics are where production RAG systems consistently fail silently.
Every production RAG request should emit a structured trace containing: the original query, the query embedding (or a hash of it), the top-k retrieved chunk IDs and their similarity scores, the assembled context window, the LLM prompt, and the final response. This trace is the foundation of all downstream debugging and quality analysis.
Frameworks built for this purpose include LangSmith (LangChain's observability platform, released publicly in 2023), Arize Phoenix, and Weights & Biases Weave. Each captures span-level traces across the full pipeline. Without traces, diagnosing a quality regression reduces to guesswork.
The critical trace field that most teams initially omit is chunk provenance β specifically which document, which version of that document, and which ingestion timestamp each retrieved chunk came from. The Air Canada failure was precisely a provenance failure: a stale document version was in the index with no mechanism to detect or flag it.
Elastic's internal RAG deployment (described at their 2023 engineering summit) stores a last_modified timestamp and a source_hash alongside each vector in their Elasticsearch store. A daily job flags chunks older than 90 days for human review. Any query that retrieves a flagged chunk automatically appends a confidence warning to the response.
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework published by Explorazure in 2023 that computes four automated metrics without requiring human-labeled ground truth for every query.
Faithfulness measures whether every factual claim in the generated answer can be traced to the retrieved context β catching hallucinations introduced by the LLM despite having correct source material. Answer Relevancy measures whether the answer actually addresses the question. Context Precision measures whether retrieved chunks are signal or noise. Context Recall measures whether all information needed to answer the question was actually retrieved.
Running RAGAS on a random sample of 100β500 queries per day provides a continuous quality signal that can alert engineers before users surface problems. Databricks reported in 2024 that automated RAGAS monitoring caught a 12-percentage-point drop in faithfulness scores caused by an inadvertent index update β 18 hours before any user complaints arrived.
Effective RAG observability requires setting alert thresholds at each layer. A reasonable baseline for a customer-facing system: P99 latency above 4 seconds triggers a page; MRR below 0.6 triggers a Slack alert; RAGAS faithfulness below 0.75 triggers an engineering review; daily CSAT correlation below 0.5 triggers a model/pipeline audit.
These thresholds are starting points, not universal standards. They should be calibrated against your baseline during the first 30 days of production operation, then tightened as you understand your system's normal variance.
Your RAG-powered internal knowledge assistant just went live for 800 employees. Within two weeks, the HR team reports that answers about benefits policies seem outdated. You have no current monitoring in place β just basic server uptime checks.
Work with the lab assistant to design a comprehensive observability strategy covering infrastructure, pipeline quality, and automated alerting. Complete at least 3 exchanges to finish the lab.
In February 2024, Samsung's internal ChatGPT deployment β used by engineers to query internal documentation β became the subject of a data leak investigation. Engineers had pasted confidential source code into prompts and internal meeting notes into the system, which then stored that content in external model training pipelines. Samsung subsequently banned the use of generative AI tools on company networks.
A self-hosted RAG system with proper data isolation would have prevented the external exposure. But self-hosted RAG introduces its own security surface: who can retrieve what. Without row-level security on the vector store, every user can potentially retrieve every document.
Standard vector similarity search is inherently authorization-blind. When you query a vector database for the top-k most similar chunks to a given embedding, the database returns whichever chunks score highest β regardless of who originally had access to the source document. In an enterprise context where the index contains documents from Legal, HR, Finance, and Engineering, this means every user potentially reaches every document.
This is not a theoretical risk. When Salesforce deployed an internal RAG assistant in 2022, engineers discovered in internal testing that queries about compensation packages were retrieving confidential HR salary band documents that had been inadvertently ingested into the shared index β accessible to all employees.
All major vector databases β Pinecone, Weaviate, Qdrant, Chroma, and pgvector β support metadata filters applied at query time. Every chunk, at ingestion, receives metadata tags including owner_department, clearance_level, allowed_user_ids, or allowed_groups depending on the access model.
At query time, the retrieval service injects a metadata filter based on the authenticated user's attributes. The vector similarity search only scores chunks whose metadata satisfies the filter. This approach is called pre-retrieval filtering and is the most reliable pattern because unauthorized documents are never scored β they are excluded before the similarity computation even runs.
The alternative β post-retrieval filtering, where results are retrieved and then filtered by authorization β is significantly weaker because it reduces effective k (you may retrieve k=10 but return only k=3 after filtering) and introduces information leakage risk if filter logic has bugs.
For hard multi-tenant isolation β where different organizations or business units must never share any vector space β namespace isolation is the appropriate pattern. Pinecone's namespaces and Qdrant's collection partitioning allow completely separate vector spaces within the same physical infrastructure, with no cross-namespace query possible at the database level.
Namespace isolation eliminates the metadata filter injection requirement β there is simply no mechanism to query across namespaces β at the cost of higher operational complexity (each tenant's namespace must be independently managed for ingestion, indexing, and deletion).
A practical rule: use metadata filtering for role-based access within a single organization; use namespace isolation for true multi-tenant deployments serving different legal entities.
Production RAG systems face a novel attack surface: adversarial content embedded in documents that, when retrieved, manipulates the LLM's behavior. This is called indirect prompt injection. In 2023, researchers at ETH Zurich demonstrated that injecting hidden instructions into documents indexed by a RAG system could cause the LLM to leak other retrieved documents to the attacker through the generated response.
Defenses include: context sanitization (stripping instruction-like patterns from retrieved text before insertion into the prompt), privilege separation (the LLM layer should never have write access to the vector store or any persistent storage), and response auditing (flagging responses that contain structural patterns resembling data exfiltration β e.g., JSON blobs or base64 strings in answers to natural language questions).
The OWASP Top 10 for LLM Applications, published in 2023, lists "Insecure Output Handling" and "Sensitive Information Disclosure" among the top vulnerabilities. Both are directly applicable to RAG systems: outputs that include retrieved confidential material and systems that retrieve more data than the user is authorized to see.
You are the security engineer for a 2,000-person company deploying a unified RAG assistant over documents from Legal, HR, Finance, Engineering, and Sales. All documents are currently in a single unfiltered index. A routine audit has flagged that any employee can potentially retrieve any document.
Work through the security architecture with the lab assistant: access control model, namespace vs. metadata filtering decisions, prompt injection defenses, and audit logging. Complete at least 3 exchanges to finish the lab.
In January 2024, Morgan Stanley disclosed that its OpenAI-powered financial advisor assistant β which had been trained on over 100,000 proprietary research documents β required a dedicated team of four ML engineers to manage index freshness, handle document retirement, and run continuous evaluation. The system that launched in 2023 shared little more than architecture with the system running twelve months later.
What had changed was not the model. It was the corpus, the evaluation benchmarks, and the retrieval strategy β all evolved through a structured feedback loop that treated production traffic as a continuous source of training signal.
The most valuable signal for improving a production RAG system is structured user feedback. A minimal implementation captures binary thumbs-up/thumbs-down on each response. More valuable is a correction capture β when a user indicates an answer is wrong and provides the correct information, that pair (query, correct answer) becomes a labeled evaluation example.
Over 90 days of production operation, even a system handling 1,000 queries per day with a 5% explicit feedback rate generates 4,500 labeled examples. This corpus enables offline evaluation: replay historical queries through updated pipeline configurations and measure quality improvement before deploying changes.
Bing's AI team described this approach at the 2023 ACL workshop on industrial NLP: they maintained a "golden set" of queries with human-verified correct answers, growing it continuously from user feedback, and used it to gate any change to the retrieval pipeline before deployment.
Document indices drift from reality along three axes: deletion drift (source documents are removed or superseded but their chunks remain indexed), update drift (documents are revised but old chunks persist alongside new ones), and growth drift (the index grows so large that relevant documents are diluted by noise, lowering precision).
Managing deletion drift requires a delete-propagation pipeline: when a document is removed from the source system, all chunks derived from it must be deleted from the vector index. This requires maintaining a document_id β [chunk_ids] mapping at ingestion time β a step that is frequently omitted in prototype RAG systems and then painful to retrofit.
Update drift is handled by treating document updates as delete-then-reinsert operations. When a document is modified, all its old chunks are deleted and replaced with chunks from the new version. This maintains chunk-level consistency without requiring complex differential update logic.
Glean, the enterprise search startup, built their production RAG ingestion pipeline to listen to webhook events from Confluence and Notion. Any page update or deletion triggers a real-time reindexing job. Their 2023 engineering blog post reported that without webhook-driven reindexing, their index had 23% stale content within two weeks of initial ingestion at a 500-employee company.
Production RAG systems typically evolve through three retrieval strategy phases. Dense-only retrieval (pure vector similarity) is where most systems start. Hybrid retrieval (dense + BM25 sparse) is added when users report that exact-match queries β product names, error codes, specific dates β are failing because dense embeddings don't preserve lexical identity well. Reranking is added when retrieval precision is acceptable but ordering is poor β the right documents are in the top-20 but not consistently in the top-3.
The decision to add each layer should be data-driven: a drop in MRR below 0.65 for a class of queries is the signal to investigate hybrid retrieval; a gap between context recall and faithfulness (chunks are retrieved but answers are unfaithful) suggests reranker addition. These thresholds are not guesses β they are derived from the labeled evaluation set built through user feedback.
Any change to a production RAG pipeline β new embedding model, updated chunking strategy, additional reranker β should be validated through A/B testing before full deployment. The traffic split sends a percentage of queries through the new configuration; automated RAGAS scoring and user feedback rates are compared between control and treatment.
LangSmith's production deployment dashboard, introduced in late 2023, supports this pattern natively: you define two pipeline configurations, route traffic fractions to each, and view side-by-side quality metric comparisons. Cloudflare's AI team described using a similar pattern when migrating their internal documentation RAG system from Ada-002 to text-embedding-3-large in early 2024 β the A/B test ran for 72 hours over 18,000 queries before they committed to the migration.
Production RAG maintenance should follow a structured weekly cadence: daily automated RAGAS scores reviewed for threshold violations; weekly review of low-rated user feedback examples (bottom 50 by score); monthly retrieval strategy review against the golden evaluation set; quarterly full corpus freshness audit and embedding model evaluation against newer alternatives.
This cadence is not optional overhead β it is the mechanism by which a RAG system remains useful rather than drifting into confident incorrectness, which is the failure mode that erodes user trust faster than any single outage.
Every user feedback example you capture, every evaluation run you conduct, every index freshness audit you complete compounds. Teams that establish these practices in the first 90 days of production operation consistently outperform teams that retrofit them at month six β not because the technology is different, but because they have accumulated ground truth that cannot be reconstructed retrospectively.
You launched a RAG-powered product documentation assistant six months ago. At launch it answered 78% of queries correctly (by RAGAS faithfulness). Today that number is 61%. Support tickets have increased 40%. Leadership wants a plan to recover quality and prevent future degradation.
Work with the lab assistant to diagnose the degradation, design a feedback capture system, establish an index maintenance strategy, and build an operational cadence. Complete at least 3 exchanges to finish the lab.