In 2023, Elastic's engineering team published a post-mortem on a customer's internal legal search system that had switched entirely to dense vector retrieval. Lawyers searching for the precise clause number "§ 12.4(b)" received documents about broadly similar liability concepts — semantically close, but not the right clause. The system had no mechanism to match the exact string. Cases were prepared using the wrong contract versions. The fix required re-introducing BM25 alongside the embeddings.
Dense vector retrieval — embedding documents and queries into high-dimensional space, then retrieving by cosine similarity — has genuine power. It captures paraphrase, synonymy, and cross-lingual meaning in ways that keyword search cannot. But it has a systematic blind spot: exact lexical identity.
When a user queries for a product SKU like "B08N5WRWNW", a model identifier like "GPT-4-turbo-preview", or a legal citation like "17 U.S.C. § 512(c)", the embedding of that string encodes whatever distributional context the model saw during training. It does not guarantee that the token sequence itself is prioritized over semantically similar but textually different strings. In high-stakes retrieval, this distinction matters enormously.
The failure mode has a name in information retrieval research: vocabulary mismatch in reverse. Traditional keyword search fails when the user uses different words than the document. Vector search fails when the user uses the exact same words but the embedding space clusters those words near others that lack them.
Pinecone's 2024 benchmark study on retrieval quality found that dense-only retrieval scored significantly lower than hybrid retrieval on queries containing named entities, product codes, and technical identifiers — categories where BM25 has a natural advantage due to exact token matching.
There are three recurring situations where dense-only retrieval consistently underperforms:
Best Match 25 (BM25), introduced by Robertson and Spärck Jones in the 1990s, remains one of the most robust lexical retrieval algorithms ever developed. It extends TF-IDF with two key improvements: term frequency saturation (diminishing returns for repeated terms) and document length normalization (penalizing retrieval of long documents that only incidentally contain the query terms).
BM25 does not need a GPU. It requires no embedding model. It is deterministic, interpretable, and extremely fast over inverted indexes. Elasticsearch and OpenSearch have shipped BM25 as their default similarity function for years. The 2024 BEIR benchmark — the standard evaluation suite for information retrieval — continued to show BM25 outperforming many embedding models on specific retrieval tasks involving technical and biomedical documents.
The conclusion from both academic benchmarks and production postmortems is consistent: no single retrieval method dominates across all query types. Hybrid search — combining BM25 scores with dense vector similarity scores — consistently outperforms either method alone on diverse query sets.
Microsoft's Azure Cognitive Search team published results in 2023 showing that hybrid retrieval with Reciprocal Rank Fusion improved NDCG@10 by 8–12 percentage points over dense-only retrieval on enterprise document corpora. Weaviate's 2024 benchmarks showed similar gains for e-commerce and legal retrieval use cases.
The intuition is simple: let BM25 anchor on exact terms, let the embedding model handle paraphrase and intent, and combine the evidence from both channels into a single ranked list.
Hybrid search is not a compromise — it is a principled ensemble. The rest of this module covers how to implement it: from score fusion mathematics, to sparse vector representations that bridge the two worlds, to re-rankers that operate as a final precision layer.
You are a retrieval engineer auditing a dense-only RAG system that keeps failing on specific query types. Use this session to explore the failure taxonomy — exact strings, rare terms, negation — and discuss when BM25 would outperform embeddings.
When Elasticsearch shipped native hybrid search in version 8.9 (August 2023), the engineering team faced a core question: how do you merge a BM25 score (which might range from 0 to 25 for a typical corpus) with a cosine similarity score (which ranges from −1 to 1) into a single ranked list? Simply summing them would let whichever scale is larger dominate. Their solution — and the one that has become an industry default — was Reciprocal Rank Fusion, a rank-based method that ignores raw scores entirely and works only with positions in each ranked list.
BM25 and cosine similarity operate on fundamentally different scales with different statistical properties. BM25 scores grow roughly logarithmically with document relevance and depend heavily on corpus statistics (IDF values change as documents are added or removed). Cosine similarity is bounded to [−1, 1] but in practice clusters between 0.7 and 0.99 for most retrieval scenarios.
Normalizing both to [0, 1] before combining them is one approach, but it introduces sensitivity to outliers — a single very high BM25 score can compress all others near zero after normalization. Min-max normalization requires knowing the score distribution, which changes with every new query.
This is why rank-based fusion methods, which discard raw scores and work only with ordinal positions, have become preferred in production systems.
RRF was introduced by Cormack, Clarke, and Buettcher at SIGIR 2009. The core formula is remarkably simple: each document receives a score from each ranked list equal to 1/(k + rank), where k is a constant (typically 60) that dampens the influence of very high ranks.
The k=60 constant was chosen empirically by Cormack et al. to balance two competing pressures: making top-ranked documents in any single list matter (k too large dilutes everything) while ensuring that rank 1 in one list doesn't dominate over rank 2 in both lists. The constant has proven remarkably robust across different corpora and query types, which is why it became the default in Elasticsearch, Weaviate, and Qdrant.
An alternative to RRF is direct weighted combination of normalized scores. This requires normalizing both score distributions first — typically using min-max normalization within a query's result set — then computing a weighted sum:
The practical tradeoff: weighted combination is more tunable and can reflect domain-specific knowledge (an e-commerce system might want more BM25 weight for SKU searches; a Q&A system might want more vector weight for intent matching). But it requires careful normalization and can behave unpredictably when score distributions shift — for example, when adding a large batch of new documents changes the IDF landscape.
| Property | RRF | Weighted Score Fusion |
|---|---|---|
| Score normalization needed? | No — rank-based only | Yes — min-max or z-score |
| Sensitivity to outliers | Low — ranks are robust | High — one outlier compresses others |
| Tunability | k parameter (rarely needs changing) | α requires validation set to tune |
| Interpretability | Moderate — rank arithmetic | High — weighted sum is intuitive |
| Best for | General-purpose, cold-start systems | Domain-specific with labeled data |
| Used in production by | Elasticsearch, Weaviate, Qdrant | Azure Cognitive Search, custom pipelines |
Both RRF and weighted fusion generalize to more than two ranked lists. A production hybrid system might combine BM25, dense embeddings, and a sparse learned representation (like SPLADE). RRF simply sums 1/(k + rank) across all participating lists. Weighted fusion adds a third or fourth term with its own α coefficient, subject to the constraint that all weights sum to 1.
Cohere's ReRank API and Jina AI's reranker sit downstream of this fusion step — they receive the fused top-N results and perform a more expensive cross-attention comparison between the query and each candidate. This separation of concerns (cheap fusion, expensive reranking) is the architecture we'll build toward in Lesson 4.
In Elasticsearch 8.9+, hybrid search with RRF is a first-class API feature: set rank: {rrf: {window_size: 100, rank_constant: 60}} in your search request alongside both a knn clause and a standard query clause. The engine handles fusion internally without requiring two separate API calls.
You are advising an engineering team that needs to fuse BM25 and vector search results for a legal document retrieval system. They have no labeled relevance data yet. Work through the decision: RRF or weighted fusion? What k value? How would you validate the choice once data becomes available?
In 2021, researchers at INRIA Paris published SPLADE: Sparse Lexical and Expansion Model for First Stage Retrieval. The paper introduced a model that produces sparse vectors over the full vocabulary — like a weighted inverted index entry — but uses a transformer encoder to decide which vocabulary terms to activate and with what weight. A document about "heart attacks" would activate not just those tokens but also "myocardial infarction," "cardiac," and "chest pain" — terms that appear in the vocabulary but not necessarily in the document itself. This is query and document expansion baked into the representation.
SPLADE builds on a BERT-class encoder. For each token position in the input, the model outputs a logit over the full vocabulary (typically 30,522 tokens for BERT's WordPiece vocabulary). A ReLU activation zeroes out negative logits, and a log(1 + x) transformation compresses the scale. The maximum pooling over all positions aggregates these per-token vocabulary vectors into a single sparse document representation.
The result is a vector in vocabulary-dimensional space (dim ≈ 30,000) where most values are exactly zero (true sparsity) but non-zero values reflect both the terms actually present and semantically related terms the encoder has learned to associate.
Dense vectors require approximate nearest neighbor (ANN) search — algorithms like HNSW or IVF-PQ that trade some recall for speed. ANN indexes are fast but not exact, and they scale poorly with very high dimensionality if density is uniform.
Sparse vectors, despite living in a 30,000-dimensional space, are stored and queried using inverted indexes — the same data structure underlying all traditional search engines. A document's SPLADE vector might have only 50–200 non-zero dimensions. Retrieval scans only those posting lists, making it dramatically faster than ANN over dense 768-dim vectors at scale.
In 2022 benchmarks on the MSMARCO passage retrieval dataset, SPLADE-v2 achieved higher MRR@10 than BM25 while remaining fully compatible with inverted index infrastructure. Pinecone integrated SPLADE-style sparse vectors in their "hybrid indexes" feature launched in 2023, allowing users to store a dense vector and a sparse vector per document in the same index.
DocT5Query (Nogueira & Lin, 2019) takes a different approach: rather than learning sparse representations end-to-end, it uses a T5 model to generate synthetic queries that each document might answer, then appends those queries to the document before indexing with BM25. This is document expansion without a new index format — the output is still a plain inverted index.
DeepImpact (Mallia et al., 2021) learns term importance weights using a BERT model but stores them in a standard inverted index, replacing raw term frequency with learned impact scores. This achieves better ranking than BM25 without the full complexity of SPLADE's expansion.
Pinecone's sparse-dense hybrid index (2023) natively supports these patterns: each document stores both a dense embedding and a sparse vector (which can be generated by BM25, SPLADE, or any sparse encoder). A single query returns fused results from both index types.
SPLADE requires running inference at index time — every document must pass through the encoder to generate its sparse vector. For corpora of millions of documents, this is a non-trivial compute cost. Many teams pre-compute and cache SPLADE vectors, storing them alongside BM25 statistics. The Hugging Face model hub hosts several SPLADE variants under the naver/splade-* namespace with benchmarked performance on MSMARCO and BEIR.
For most production RAG systems, the practical question is whether to use SPLADE or standard BM25 as the sparse retrieval channel in a hybrid pipeline. SPLADE dominates on precision-at-depth metrics for general retrieval, but requires model inference at both index time and query time. BM25 is zero-shot, requires no GPU, and is easier to maintain.
A pragmatic approach adopted by several teams at major tech companies: start with BM25 as the sparse channel, measure recall on your actual query distribution, and introduce SPLADE only for query types where BM25 measurably underperforms. The cost of SPLADE inference must be justified by measurable retrieval quality gains on your specific data.
Your team is building a hybrid RAG system for a biomedical literature search engine. The corpus has 5 million PubMed abstracts. You need to choose between BM25 and SPLADE for the sparse retrieval channel. Consider: inference cost, domain vocabulary, query types your users submit, and how you would measure whether SPLADE is worth the overhead.
When Cohere launched their Rerank API in 2023, they documented a case study with a financial services firm whose RAG system was returning factually correct but poorly ranked passages — the most relevant excerpt was often at position 4 or 5 in a 5-result context window, after which GPT-4 would weight it less heavily. Adding Cohere Rerank as a post-retrieval step reduced this issue substantially: the API performs cross-attention between the query and each retrieved passage and reorders them by relevance. The LLM now consistently saw the most relevant passage first. Answer accuracy on the firm's internal benchmark improved by 17% without changing the retrieval index or the generation model.
Bi-encoders (the architecture behind dense retrieval) encode the query and each document independently, then compare their embeddings via cosine similarity. This independence is their speed advantage: you can pre-compute all document embeddings and run ANN search at query time without re-encoding documents. But independence is also their limitation — the model never sees the query and document together, so subtle interactions between query terms and document content are invisible.
Cross-encoders receive the query and a candidate document concatenated as a single input. The transformer's self-attention mechanism operates across both, enabling every query token to attend to every document token and vice versa. This produces a much more accurate relevance estimate — but it cannot be pre-computed. Every query-document pair requires a fresh forward pass.
The standard production pattern is a two-stage pipeline where speed and precision are separated into distinct components:
BM25 + dense vector search, fused via RRF, retrieves top-100 to top-200 candidates from the full corpus. This stage is fast (ANN + inverted index) and optimized for high recall — we want the truly relevant documents to be somewhere in this set.
The cross-encoder scores each of the top-100 candidates against the query with full attention. This is slower (100 forward passes) but operates on a small set, making it tractable. Output is a reordered list of top-5 or top-10 candidates for the LLM context.
The top-K re-ranked passages are assembled into the LLM prompt in ranked order. Because the LLM sees the most relevant passage first, generation quality and faithfulness improve compared to randomly ordered or poorly ranked context.
Several production-ready reranker options are available as of 2024. The choice depends on latency budget, data privacy requirements, and the domain of your documents.
| Option | Type | Latency (50 docs) | Best For |
|---|---|---|---|
| Cohere Rerank v3 | API (managed) | ~200–400ms | General-purpose, enterprise RAG |
| Jina Reranker v2 | API + self-hostable | ~150–300ms | Long-context documents (8K tokens) |
| ms-marco-MiniLM-L-6-v2 | Open-source (HF) | ~80–150ms (GPU) | Self-hosted, cost-sensitive deployments |
| BGE-Reranker-Large | Open-source (HF) | ~200–350ms (GPU) | Multilingual, high-accuracy needs |
| FlashRank (CPU) | Open-source | ~20–50ms (CPU) | CPU-only deployments, edge inference |
The standard evaluation metric for reranking quality is NDCG@K (Normalized Discounted Cumulative Gain at K) — a measure that rewards placing highly relevant documents at higher positions with a logarithmic discount. A reranker that moves a highly relevant document from position 5 to position 1 produces a significant NDCG gain.
In practice, many teams also measure faithfulness and answer accuracy on their RAG system — metrics that are downstream of retrieval. Cohere's 2023 case study cited a 17% answer accuracy improvement. A 2024 study by Arize AI on enterprise RAG systems found that adding a reranker improved faithfulness scores by 12–22% across tested domains, with legal and technical documentation showing the largest gains.
A 200ms re-ranking call on 100 candidates is acceptable in most enterprise search contexts (total query latency might be 400–600ms end-to-end including retrieval and generation). But for real-time applications, limit re-ranking to 25–50 candidates and use a smaller cross-encoder model. FlashRank on CPU can score 50 candidates in under 50ms, making it viable for latency-sensitive deployments.
A complete hybrid search pipeline with reranking integrates all components covered in Module 5:
Hybrid search is the current state of the art for production RAG retrieval. BM25 handles exact terms. Dense embeddings handle semantic intent. SPLADE bridges both via learned sparse representations. RRF fuses the ranked lists without requiring score normalization. A cross-encoder reranker adds a precision layer that dramatically improves the quality of what the LLM sees. Each component has documented production evidence behind it — this is not theory.
You are the lead engineer designing a RAG system for an enterprise knowledge base with 500,000 internal documents (policies, technical specs, meeting notes). The system must return answers in under 1 second end-to-end. Design the complete retrieval pipeline: sparse channel, dense channel, fusion method, reranker choice, and candidate set sizes at each stage.