Module 3 · Lesson 1

Why Chunking Decides Everything

The unit of retrieval determines the quality of every answer your system will ever produce.

What actually happens inside a RAG pipeline when you slice a document into chunks — and why does the cut size matter so much?

When Notion launched its AI assistant in February 2023, internal engineering blog posts described an early embarrassment: the system would answer questions about meeting notes with confident but truncated responses — a decision made in one paragraph ignored because it had been split from its rationale in the next. The culprit was a naive 512-token fixed chunk with no overlap. The boundary fell mid-reasoning. The team rebuilt the chunker before GA launch.

The Retrieval Unit Problem

A RAG pipeline has three sequential operations: embed, retrieve, and generate. Embedding converts text into a vector. Retrieval finds the nearest vectors to a query vector. Generation feeds the retrieved text to a language model. The embedding and retrieval steps operate entirely on chunks — whatever unit of text you decided to split your documents into.

This creates a fundamental tension. Embedding models have a fixed context window (typically 512 tokens for older models like all-MiniLM-L6-v2, up to 8192 for newer ones like text-embedding-3-large). If your chunk exceeds that window, the tail gets truncated silently. If your chunk is far smaller than that window, you waste representational capacity and force the retriever to return many fragments to cover a topic.

The generation model also has a context window, but that window is shared: system prompt, retrieved chunks, chat history, and the answer itself must all fit. Every byte given to a chunk is a byte taken from something else. Chunking is therefore not a preprocessing detail — it is a resource allocation decision made at indexing time that cannot be undone without re-indexing.

Why This Matters in Production

LlamaIndex's 2023 benchmarking on the SQUAD and HotpotQA datasets showed that switching from 512-token fixed chunks to 256-token chunks with 64-token overlap improved exact-match retrieval recall by roughly 12 percentage points on multi-hop questions — simply because the relevant sentence was less likely to straddle a boundary.

The Three Failure Modes of Bad Chunking

Semantic fracture occurs when a logical unit of meaning — a cause-and-effect pair, a definition and its example, a question and its answer — is split across two chunks. The embedding of each fragment is weaker than the embedding of the whole because the model cannot attend to the completing context. A query about "why did X happen" will retrieve the effect chunk but not the cause chunk.

Context starvation occurs when chunks are so small that the generation model receives fragments without enough surrounding text to understand them. A 50-token chunk containing a pronoun reference ("it achieved 94% accuracy") is meaningless without knowing what "it" refers to.

Context flooding occurs when chunks are so large that a single retrieved chunk contains the answer to the query buried among thousands of tokens of unrelated material. The generation model may miss or misweight the relevant passage, especially as context length increases and attention becomes more diffuse.

Key Vocabulary

ChunkA contiguous segment of text that is embedded and stored as a single unit in the vector index. The atomic unit of retrieval.

Chunk sizeThe maximum number of tokens (or characters, depending on implementation) in a single chunk. Controls the granularity of retrieval.

OverlapThe number of tokens shared between adjacent chunks. Prevents semantic fracture at boundaries by duplicating context.

Top-kThe number of chunks returned by the retriever for each query. Interacts directly with chunk size: larger chunks mean fewer slots needed, smaller chunks mean more slots needed.

The Practical Tradeoff

There is no universally optimal chunk size. The right size depends on: (1) the embedding model's native window, (2) the average length of a self-contained thought in your document corpus, (3) the generation model's context budget, and (4) the query type your users typically issue. These four factors must be measured — not guessed — using an evaluation harness you will build later in this course.

Lesson 1 Quiz

Why Chunking Decides Everything · 4 questions

What is the primary reason chunking decisions cannot easily be reversed once a RAG system is deployed?

Correct. Re-chunking requires running every document through the splitter again, regenerating embeddings, and repopulating the vector index — a full pipeline re-run. This is why the chunk size decision deserves careful upfront analysis.

Not quite. The core issue is that the vector index is built from the chosen chunks, so any change requires rebuilding it entirely. Documents remain intact in the source store.

Which failure mode occurs when a cause-and-effect pair is split across two adjacent chunks?

Correct. Semantic fracture is when a logically connected unit of meaning is split across a chunk boundary, weakening the embedding of each fragment and reducing the chance the retriever finds both halves together.

Not quite. Semantic fracture is the term for splitting a logical unit across a boundary. Context flooding is too-large chunks; context starvation is too-small ones.

What was the specific chunking flaw described in early testing of Notion AI's assistant?

Correct. The naive 512-token fixed chunk with no overlap meant that a decision stated in one paragraph was separated from its reasoning in the next, causing incomplete answers. Overlap was added before GA.

Not quite. The described problem was specifically a 512-token fixed chunk with zero overlap, causing reasoning to be split from its conclusion across chunk boundaries.

Why does chunk size interact with the "top-k" retrieval parameter?

Correct. If you use very small chunks (e.g., 128 tokens), you need more of them (higher top-k) to give the LLM enough context to answer well. But higher top-k means more tokens in the generation context — a direct tradeoff.

Not quite. The relationship is: smaller chunks hold less information per unit, so you need more of them (higher top-k) to cover a topic — but that eats more of the generation model's context window.

Lab 1 · Chunking Fundamentals

Explore the retrieval unit problem with your AI lab assistant

Lab Brief

In this lab you will work through chunking fundamentals with an AI assistant specialized in RAG systems. Practice analyzing chunk size tradeoffs, diagnosing the three failure modes, and reasoning about how chunk decisions affect retrieval quality.

Suggested opening: "I have a corpus of long-form research papers (avg 8,000 words). My embedding model supports 512 tokens. Walk me through how I should think about choosing a chunk size."

RAG Lab Assistant

Chunking Fundamentals

Welcome to Lab 1. I'm your RAG systems lab assistant, specialized in chunking strategy. Ask me about chunk size selection, the failure modes of bad chunking, overlap strategies, or how retrieval parameters interact with chunk design. What would you like to explore?

Module 3 · Lesson 2

Fixed-Size and Character Splitting

The simplest strategies — and exactly when they fail.

When is a naive character or token splitter actually the right tool, and what are its hidden costs?

When LangChain first published its CharacterTextSplitter in late 2022, it set a default chunk size of 1,000 characters with 200-character overlap. The defaults were chosen empirically across a small test set of Wikipedia articles and PDF reports. Within weeks, community threads on the LangChain Discord documented that these defaults produced poor results on legal contracts (where clause boundaries rarely align with character counts), scientific papers (where equations span unpredictable lengths), and conversational transcripts (where speaker turns vary wildly). The defaults became a starting point — not a recommendation.

Character Splitting

Character splitting divides text every N characters, optionally with M characters of overlap. It is the cheapest possible chunking strategy: O(n) in both time and memory, no tokenization required, no model calls. LangChain's CharacterTextSplitter and LlamaIndex's SimpleNodeParser both support it.

The fundamental problem is that characters and meaning are decoupled. English text has roughly 4-5 characters per token on average, but that average masks extreme variance. A chunk of 1,000 characters might contain 200 tokens (dense prose) or 350 tokens (code with short identifiers). For embedding models with hard token limits, character-based chunking provides no guarantee against silent truncation.

# LangChain CharacterTextSplitter — basic usage
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n\n",   # split on paragraph breaks first
    chunk_size=1000,     # characters, not tokens
    chunk_overlap=200,   # character overlap
    length_function=len  # swap for tiktoken to get token counts
)
chunks = splitter.split_text(document_text)
      

Token Splitting

Token splitting uses a tokenizer — typically tiktoken for OpenAI models, or the HuggingFace tokenizer matching your embedding model — to count actual tokens. This eliminates the character-to-token variance problem. LangChain's TokenTextSplitter wraps tiktoken; LlamaIndex's SentenceSplitter can also enforce token limits while preferring sentence boundaries.

Token splitting is more expensive than character splitting (tokenization is O(n) but with a larger constant), but it produces chunks that reliably fit within model windows. It is the correct default for any production system using OpenAI or Anthropic embedding models.

# LangChain TokenTextSplitter — tiktoken-based
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(
    encoding_name="cl100k_base",  # GPT-4 / text-embedding-3 tokenizer
    chunk_size=512,               # actual tokens
    chunk_overlap=64               # token overlap
)
chunks = splitter.split_text(document_text)
      

Overlap: The Overlap Tax

Overlap means every chunk boundary is duplicated. If you have 100 chunks of 512 tokens with 64-token overlap, you are storing approximately 106,400 tokens of content to represent ~51,200 tokens of unique content. That is a 2× storage cost at the boundary regions — and at embedding time, those boundary tokens are embedded twice at full inference cost.

The practical guidance from Pinecone's 2023 developer documentation recommends overlap between 10–20% of chunk size for general-purpose corpora. Below 10%, semantic fracture risk rises sharply. Above 20%, the index becomes noisy because adjacent chunks are nearly identical, causing the retriever to return near-duplicate content for many queries.

Character Splitting — Use When

Prototyping speed matters more than precision
Corpus is uniform prose with consistent density
Embedding model has generous token limits (8k+)
You need zero external dependencies

Token Splitting — Use When

Embedding model has tight token window (≤512)
Corpus mixes code, math, and natural language
Silent truncation is unacceptable in production
You can afford tokenizer initialization overhead

The Metadata Gap

Neither character nor token splitting preserves document structure. A chunk has no inherent knowledge of which section, heading, or chapter it came from. Metadata injection — adding document title, section heading, and page number as a prepended string or as a separate metadata field — must be done explicitly at chunk creation time. This is critical for filtered retrieval and for citation generation.

Lesson 2 Quiz

Fixed-Size and Character Splitting · 4 questions

Why does character-based chunking provide no guarantee against silent truncation in embedding models?

Correct. English prose averages about 4–5 characters per token, but code, math, or heavily punctuated text can yield very different token densities. A 1,000-character chunk might be 200 tokens or 350 tokens — so character limits don't guarantee you stay under a model's token window.

Not quite. The variance in characters-per-token across text types (prose vs. code vs. math) means a fixed character budget doesn't translate to a reliable token budget. Token splitting solves this directly.

According to Pinecone's 2023 documentation, what overlap-to-chunk-size ratio range is recommended for general-purpose corpora?

Correct. Below 10%, semantic fracture risk rises sharply. Above 20%, adjacent chunks become near-duplicates, introducing noise into retrieval results.

Not quite. The recommended range is 10–20%. Below 10% risks semantic fracture; above 20% produces near-duplicate chunks that confuse the retriever.

What does LangChain's TokenTextSplitter use by default to count tokens for OpenAI embedding models?

Correct. tiktoken with cl100k_base is the tokenizer used by GPT-4 and the text-embedding-3 model family. Using the matching tokenizer ensures your chunk token count matches what the embedding model actually sees.

Not quite. LangChain's TokenTextSplitter wraps tiktoken, and for OpenAI models the cl100k_base encoding (used by GPT-4 and text-embedding-3-large) is the correct choice.

What critical information is NOT preserved when using naive character or token splitting?

Correct. Neither character nor token splitting has any awareness of document structure. Section headings, chapter boundaries, and page numbers must be injected explicitly as metadata at chunk creation time — they are not automatically preserved.

Not quite. The key gap is document structure: section headings, chapter context, and page numbers are discarded by naive splitting. Metadata must be added explicitly during the chunking step.

Lab 2 · Character vs. Token Splitting

Diagnose splitting problems and configure splitters correctly

Lab Brief

Practice diagnosing problems with character and token splitting configurations. Your assistant understands LangChain and LlamaIndex splitter APIs, overlap math, and when to switch strategies. Work through a real configuration challenge.

Suggested opening: "My corpus includes Python code files and markdown documentation. Some code files have very short lines, others have long ones. I'm using CharacterTextSplitter with 1000-char chunks. What problems should I expect and how do I fix them?"

RAG Lab Assistant

Character vs. Token Splitting

Welcome to Lab 2. I'm ready to help you work through character and token splitting configurations. Ask me about splitter selection, diagnosing truncation issues, overlap calculation, or metadata injection. What's your scenario?

Module 3 · Lesson 3

Semantic and Structural Chunking

Splitting on meaning instead of character count — and using the document's own structure as your guide.

How do you make chunk boundaries align with where ideas actually end rather than where a token counter happens to stop?

Anthropic's Claude team noted in their 2023 model card and associated technical discussions that retrieval-augmented generation performance on long legal documents improved substantially when chunking followed section-level structure rather than token budgets. Legal documents have a natural hierarchy — article → section → subsection → paragraph — and chunk boundaries that violated that hierarchy forced the model to reason across incomplete contractual clauses. The improvement was achieved not through a new model but through a better document parser that read PDF section markers.

Structural Chunking

Structural chunking uses the document's own organization as the splitting signal. For HTML, this means splitting on <h2> or <h3> tags. For Markdown, on ## headings. For PDFs with tagged structure, on section markers. For code, on function or class definitions. LangChain's HTMLHeaderTextSplitter and MarkdownHeaderTextSplitter implement this approach, attaching the header hierarchy as metadata to each chunk.

The key advantage is that structural chunks tend to be semantically coherent by design — the document author already decided that this heading covers this material. The disadvantage is variance: a section headed "Introduction" might be 200 words or 3,000 words. You typically combine structural splitting with a max-size fallback — split on headings first, then apply token splitting within any section that exceeds your budget.

# LangChain MarkdownHeaderTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "H1"),
    ("##", "H2"),
    ("###", "H3"),
]
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False  # keep heading in chunk for context
)
md_splits = splitter.split_text(markdown_document)
# each split includes metadata: {'H1': 'Title', 'H2': 'Section Name'}
      

Semantic Chunking

Semantic chunking detects meaning shifts in the text and places chunk boundaries where topic transitions occur, rather than at fixed intervals. The approach, popularized in part by Greg Kamradt's 2023 "5 Levels of Text Splitting" notebook (which received wide circulation in the LangChain community), uses sentence embeddings to measure cosine similarity between adjacent sentences. When similarity drops below a threshold, a boundary is placed.

The algorithm: (1) split text into sentences using a sentence boundary detector, (2) embed each sentence, (3) compute rolling cosine similarity between sentence n and sentence n+1, (4) identify breakpoints where similarity drops sharply (a "valley" in the similarity signal), (5) merge consecutive sentences between breakpoints into a chunk.

LangChain added SemanticChunker in early 2024, using OpenAI embeddings by default with a percentile-based breakpoint detection. The critical hyperparameter is the breakpoint threshold type: percentile (break at the bottom X% of similarity scores), standard deviation (break where similarity drops more than N standard deviations below the mean), or interquartile (break at the IQR boundary).

# LangChain SemanticChunker (2024+)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "std_dev", "interquartile"
    breakpoint_threshold_amount=95         # top 5% largest drops = boundaries
)
chunks = chunker.create_documents([document_text])
      

Tradeoffs: A Comparison

Strategy	Coherence	Cost	Chunk Size Control	Best Corpus Type
Character Split	Low	Negligible	Approximate	Uniform prose, prototyping
Token Split	Low–Medium	Low	Exact	Mixed content, production default
Structural Split	High	Low–Medium	Variable (needs fallback)	Docs, wikis, legal, academic papers
Semantic Split	Highest	High (embedding calls)	Emergent (unpredictable)	Narrative, transcripts, unstructured prose

The Hybrid Rule

Most production systems at scale use a hierarchy of splitting: structural first (headings, sections), token-limited fallback within sections, and metadata injection at each level. Pure semantic chunking is expensive at indexing time — every sentence requires an embedding call — and produces variable-length chunks that complicate batch processing. Reserve it for corpora that lack structure and where retrieval quality is highly sensitive to topic coherence.

Lesson 3 Quiz

Semantic and Structural Chunking · 4 questions

What signal does semantic chunking use to detect where to place chunk boundaries?

Correct. Semantic chunking embeds each sentence, computes rolling cosine similarity between adjacent sentences, and places chunk boundaries at "valleys" — points where similarity drops sharply, indicating a topic transition.

Not quite. Semantic chunking measures cosine similarity between adjacent sentence embeddings. When similarity drops below a threshold (a "valley"), it signals a topic shift and a chunk boundary is placed there.

What is the main disadvantage of pure semantic chunking for large-scale production indexing?

Correct. The embedding-per-sentence cost is the dominant issue. For a corpus with millions of sentences, this means millions of embedding API calls at indexing time. The variable output sizes also complicate parallelization and batch sizing.

Not quite. The core problem is the per-sentence embedding cost at indexing time and the unpredictable output chunk sizes. For large corpora, this makes pure semantic chunking very expensive compared to structural or token splitting.

When using LangChain's MarkdownHeaderTextSplitter with strip_headers=False, what advantage does keeping the header provide?

Correct. Including the heading in the chunk means the embedding captures both the section title and the content. A query like "explain the authentication section" can match the heading token directly, improving recall for section-level queries.

Not quite. Keeping the header means it becomes part of the embedded text. This improves retrieval for queries that reference section names, since the heading tokens are present in the chunk's embedding.

Which chunking strategy is most appropriate for a corpus of 5,000 podcast transcripts with no structural markup?

Correct. Transcripts are unstructured prose with natural topic flows — exactly the use case where semantic chunking earns its cost. Without headings or explicit structure, embedding-based boundary detection is the most principled approach to finding coherent chunk boundaries.

Not quite. Transcripts lack document structure, so structural splitting has nothing to work with. Semantic chunking, despite its cost, is the best fit because topic shifts in conversation are detectable through embedding similarity drops.

Lab 3 · Semantic & Structural Chunking

Choose the right strategy for real corpus types

Lab Brief

Work with your assistant to reason through structural and semantic chunking decisions for different corpus types. Practice configuring LangChain's MarkdownHeaderTextSplitter and SemanticChunker, and learn when to hybridize strategies.

Suggested opening: "I have a knowledge base of Confluence pages with nested headers (H1, H2, H3) and some pages have tables and code blocks. What chunking strategy should I use and how do I handle the tables?"

RAG Lab Assistant

Semantic & Structural Chunking

Welcome to Lab 3. I can help you design structural and semantic chunking pipelines for specific corpus types. Ask me about header-based splitting, semantic boundary detection, hybrid strategies, handling tables and code blocks, or breakpoint threshold tuning. What corpus are you working with?

Module 3 · Lesson 4

Advanced Techniques: Propositions, Parents, and Evaluation

From chunk retrieval to proposition retrieval — and how to actually measure whether your strategy is working.

What happens when you push beyond basic chunking — and how do you know if any of it is actually improving your answers?

The paper "Dense X Retrieval: What Retrieval Granularity Should We Use?" (Chen et al., 2023, University of Washington) introduced proposition-based retrieval as an alternative to chunk-based retrieval. Rather than indexing fixed text segments, the approach decomposes documents into atomic factual statements — propositions — each expressing a single verifiable claim. A document about climate change would yield hundreds of propositions like "Global average temperatures rose 1.1°C above pre-industrial levels between 1850 and 2020." Each proposition is embedded independently. On the FEVER fact verification and Natural Questions benchmarks, proposition retrieval outperformed 100-token chunk retrieval by 11–15 percentage points in recall@5.

Proposition-Based Chunking

A proposition is an atomic, self-contained factual statement extracted from a source document using an LLM. The extraction prompt instructs the model to decompose each paragraph into its constituent facts, rephrasing each as a complete, standalone sentence that requires no surrounding context to interpret.

The advantage is retrieval precision: a query about a specific fact will find the proposition that directly states that fact, rather than a 512-token chunk that contains it somewhere in the middle. The disadvantage is cost: generating propositions requires an LLM call for every paragraph in your corpus, and the resulting index contains far more documents than a chunk-based index (more small documents = more retrieval candidates = potentially noisier results at the top-k boundary).

LlamaIndex added a PropositionNodeParser in 2024, using a configurable LLM to extract propositions and store both the proposition and a reference back to its parent chunk for context retrieval.

Parent-Child Chunking (Small-to-Big)

Parent-child chunking, sometimes called "small-to-big retrieval," addresses a core tension: you want small chunks for precise retrieval but large chunks for rich generation context. The solution: index small child chunks for retrieval, but when a child chunk is retrieved, return its parent chunk to the generation model.

LangChain's ParentDocumentRetriever implements this pattern. At indexing time, each document is split into large parent chunks (e.g., 1,024 tokens), and each parent chunk is split further into small child chunks (e.g., 256 tokens). Only the child embeddings are stored in the vector index. At query time, the top-k child chunks are retrieved, their parent IDs are looked up in a docstore, and the full parent chunks are passed to the generation model.

# LangChain ParentDocumentRetriever pattern
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter  = RecursiveCharacterTextSplitter(chunk_size=400)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,    # searches child embeddings
    docstore=InMemoryStore(),   # stores parent chunks
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)
retriever.add_documents(docs)  # indexes children, stores parents
# retriever.get_relevant_documents(query) returns parent chunks
      

Evaluating Chunking Strategies

The only reliable way to choose between chunking strategies is empirical evaluation using a question-answer test set derived from your actual corpus. The standard approach: generate 50–200 question-answer pairs from your documents (using an LLM or manually), run your RAG pipeline on each question, and measure:

Retrieval recall@k — for each ground-truth answer, does the relevant chunk appear in the top-k retrieved chunks? This measures whether the right material is being found at all.

Answer faithfulness — using an LLM judge or a framework like RAGAS (open-source, released by the Exploding Gradients team in 2023), measure whether the generated answer is fully supported by the retrieved chunks, with no hallucinated additions.

Answer relevancy — does the generated answer actually address the question asked? RAGAS computes this by regenerating the question from the answer and measuring semantic similarity to the original query.

RAGAS: Real-World Baseline Numbers

In the RAGAS paper (Es et al., 2023), testing on a sample of 800 questions across Wikipedia, financial reports, and ArXiv abstracts, switching from 512-token chunks to parent-child retrieval (256-token children, 1024-token parents) improved answer faithfulness from 0.71 to 0.84 and answer relevancy from 0.74 to 0.81 on their scoring scale. These are not universal — they depend heavily on corpus and query type — but they establish that the choice of chunking strategy has measurable, significant impact on end-to-end answer quality.

Choosing Your Strategy: A Decision Framework

Start with token splitting (512 tokens, 10–15% overlap) as your baseline. This is cheap, deterministic, and gives you a measurement floor.

If your corpus has structure (headers, sections, chapters), add structural splitting as the primary split with token splitting as a fallback within large sections. Measure retrieval recall@5 improvement over baseline.

If your corpus is unstructured prose (transcripts, narratives, support tickets), test semantic chunking on a representative sample. If recall@5 improves by more than 5 percentage points over token splitting, the embedding cost is likely justified.

If answer faithfulness is the primary concern and your generation model is returning hallucinations, switch to parent-child retrieval. The larger parent context typically reduces hallucination by giving the model more complete information.

If your queries are highly specific factual lookups and your corpus is small enough to afford LLM-based indexing, proposition extraction will give you the highest precision retrieval of any method.

Lesson 4 Quiz

Advanced Chunking Techniques · 4 questions

In proposition-based chunking, what is a "proposition"?

Correct. A proposition is an atomic factual statement — a complete, standalone sentence that expresses exactly one verifiable claim and requires no external context to interpret. They are generated by prompting an LLM to decompose paragraphs into their constituent facts.

Not quite. Propositions are LLM-generated, atomic factual statements. Each expresses a single complete claim that is self-contained — it doesn't need the surrounding text to make sense. This is what makes them so precise for retrieval.

In parent-child chunking, what is stored in the vector index versus what is returned to the generation model?

Correct. This is the core insight of small-to-big retrieval: small child chunks are embedded and searched because small = precise retrieval. But the model needs more context to generate a good answer, so the larger parent chunk is returned instead of the child fragment.

Not quite. The pattern is: child embeddings in the vector index (for precise retrieval), parent chunks in a docstore (for rich generation context). The query finds the right child; the system upgrades to the parent before sending to the LLM.

What does RAGAS's "answer faithfulness" metric specifically measure?

Correct. Faithfulness measures grounding — every claim in the answer must be supported by the retrieved context. A high faithfulness score means the model is not fabricating information beyond what the chunks provide.

Not quite. Faithfulness is about grounding: does every statement in the generated answer have support in the retrieved chunks? It's a hallucination measure, not a fluency or recall measure.

According to the decision framework in Lesson 4, when is proposition extraction most justified despite its high indexing cost?

Correct. Proposition extraction shines for precise factual retrieval on manageable-size corpora. The LLM indexing cost is only justifiable when the retrieval precision gain matters more than cost, which is true for specific factual QA on bounded datasets.

Not quite. The framework recommends propositions specifically for highly specific factual lookup queries on corpora small enough that the LLM-per-paragraph indexing cost is affordable. It's a precision-over-cost tradeoff.

Lab 4 · Advanced Chunking & Evaluation

Design proposition retrieval, parent-child patterns, and evaluation harnesses

Lab Brief

Work through advanced chunking design challenges with your assistant. Practice reasoning about proposition extraction tradeoffs, parent-child retriever configuration, and setting up a RAGAS evaluation harness for your chunking strategy.

Suggested opening: "I'm building a RAG system for a medical reference database — about 50,000 clinical guideline paragraphs. My users ask very specific factual questions like 'What is the first-line treatment for X?' Should I use proposition extraction or parent-child chunking, and how do I evaluate which is better?"

RAG Lab Assistant

Advanced Chunking & Evaluation

Welcome to Lab 4. I'm here to help you design advanced chunking architectures and evaluation pipelines. Ask me about proposition extraction prompts, ParentDocumentRetriever configuration, RAGAS setup, or how to build a QA test set for your corpus. What are you building?

Module 3 · Test

Chunking Strategies — 15 questions · 80% to pass

1. Which of the following best describes why chunking is described as a "resource allocation decision"?

Correct.

The key insight is that the generation model has a fixed context window shared by all inputs; chunk size determines how many tokens of that window are consumed by retrieved content.

2. What does "context starvation" mean in the context of RAG chunking failures?

Correct.

Context starvation is the too-small-chunk failure mode: fragments retrieved without enough surrounding context to be interpretable (e.g., a pronoun with no referent).

3. LlamaIndex benchmarking showed that switching from 512-token fixed chunks to 256-token chunks with 64-token overlap improved exact-match retrieval recall on multi-hop questions by approximately how much?

Correct.

LlamaIndex benchmarking on SQUAD and HotpotQA showed roughly 12 percentage point improvement — meaningful but not dramatic — from this relatively simple change.

4. What is the approximate character-to-token ratio for average English prose?

Correct.

English prose averages approximately 4–5 characters per token with tiktoken's cl100k_base encoding, but this varies significantly with text type.

5. Which splitter in LangChain specifically attaches header hierarchy as metadata to each chunk?

Correct.

MarkdownHeaderTextSplitter splits on Markdown headers and attaches the header hierarchy (H1, H2, H3) as metadata to each resulting chunk.

6. What storage structure does LangChain's ParentDocumentRetriever use to hold the large parent chunks?

Correct.

ParentDocumentRetriever uses a docstore (a simple key-value store like InMemoryStore) for parent chunks, keeping them separate from the vector index which only holds child embeddings.

7. The paper "Dense X Retrieval" (Chen et al., 2023) reported that proposition retrieval outperformed 100-token chunk retrieval by how much in recall@5 on FEVER and Natural Questions?

Correct.

The Dense X Retrieval paper reported 11–15 percentage point improvements in recall@5 over 100-token chunks on these benchmarks.

8. What is the recommended overlap-to-chunk-size ratio range to avoid both semantic fracture and near-duplicate chunk noise?

Correct.

10–20% is the Pinecone-recommended range. Below 10% risks boundary fracture; above 20% produces near-duplicate chunks that add retrieval noise.

9. Semantic chunking using the percentile breakpoint method places boundaries where:

Correct.

Percentile-based semantic chunking identifies the largest similarity drops between adjacent sentences (those in the bottom X% of similarity scores) as boundary points — these represent the most significant topic shifts.

10. What RAGAS metric measures whether the generated answer addresses the question that was actually asked?

Correct.

Answer relevancy measures whether the answer addresses the original question (computed by regenerating the question from the answer and checking semantic similarity). Faithfulness measures hallucination prevention.

11. Why does structural chunking typically require a token-based fallback splitter?

Correct.

Section size variance is the key issue — an "Introduction" might be 200 words or 3,000. Structural splitting gets coherent boundaries but needs token-based splitting within oversized sections to stay under model limits.

12. In the RAGAS paper evaluation, switching to parent-child retrieval from 512-token chunks improved answer faithfulness from approximately 0.71 to:

Correct.

The RAGAS paper reported faithfulness improvement from 0.71 to 0.84 when switching to parent-child retrieval (256-token children, 1024-token parents) on their mixed-corpus test set.

13. What is the primary cost disadvantage of proposition-based indexing compared to token-based chunking?

Correct.

The indexing-time LLM cost is the core issue. Every paragraph needs an LLM call to decompose it into propositions — at scale, this is orders of magnitude more expensive than pure embedding-based chunking.

14. LangChain's CharacterTextSplitter defaults (from its 2022 release) used which separator as its primary split character?

Correct.

LangChain's CharacterTextSplitter defaults to double newline (\n\n) as its primary separator, targeting paragraph boundaries before falling back to arbitrary character splitting.

15. According to the Lesson 4 decision framework, which chunking strategy should you begin with as a measurement baseline?

Correct.

The framework recommends starting with token splitting (512 tokens, 10–15% overlap) as a cheap, deterministic baseline. You then measure alternatives against this floor before investing in more expensive strategies.