When Notion launched its AI assistant in February 2023, internal engineering blog posts described an early embarrassment: the system would answer questions about meeting notes with confident but truncated responses β a decision made in one paragraph ignored because it had been split from its rationale in the next. The culprit was a naive 512-token fixed chunk with no overlap. The boundary fell mid-reasoning. The team rebuilt the chunker before GA launch.
A RAG pipeline has three sequential operations: embed, retrieve, and generate. Embedding converts text into a vector. Retrieval finds the nearest vectors to a query vector. Generation feeds the retrieved text to a language model. The embedding and retrieval steps operate entirely on chunks β whatever unit of text you decided to split your documents into.
This creates a fundamental tension. Embedding models have a fixed context window (typically 512 tokens for older models like all-MiniLM-L6-v2, up to 8192 for newer ones like text-embedding-3-large). If your chunk exceeds that window, the tail gets truncated silently. If your chunk is far smaller than that window, you waste representational capacity and force the retriever to return many fragments to cover a topic.
The generation model also has a context window, but that window is shared: system prompt, retrieved chunks, chat history, and the answer itself must all fit. Every byte given to a chunk is a byte taken from something else. Chunking is therefore not a preprocessing detail β it is a resource allocation decision made at indexing time that cannot be undone without re-indexing.
LlamaIndex's 2023 benchmarking on the SQUAD and HotpotQA datasets showed that switching from 512-token fixed chunks to 256-token chunks with 64-token overlap improved exact-match retrieval recall by roughly 12 percentage points on multi-hop questions β simply because the relevant sentence was less likely to straddle a boundary.
Semantic fracture occurs when a logical unit of meaning β a cause-and-effect pair, a definition and its example, a question and its answer β is split across two chunks. The embedding of each fragment is weaker than the embedding of the whole because the model cannot attend to the completing context. A query about "why did X happen" will retrieve the effect chunk but not the cause chunk.
Context starvation occurs when chunks are so small that the generation model receives fragments without enough surrounding text to understand them. A 50-token chunk containing a pronoun reference ("it achieved 94% accuracy") is meaningless without knowing what "it" refers to.
Context flooding occurs when chunks are so large that a single retrieved chunk contains the answer to the query buried among thousands of tokens of unrelated material. The generation model may miss or misweight the relevant passage, especially as context length increases and attention becomes more diffuse.
There is no universally optimal chunk size. The right size depends on: (1) the embedding model's native window, (2) the average length of a self-contained thought in your document corpus, (3) the generation model's context budget, and (4) the query type your users typically issue. These four factors must be measured β not guessed β using an evaluation harness you will build later in this course.
In this lab you will work through chunking fundamentals with an AI assistant specialized in RAG systems. Practice analyzing chunk size tradeoffs, diagnosing the three failure modes, and reasoning about how chunk decisions affect retrieval quality.
When LangChain first published its CharacterTextSplitter in late 2022, it set a default chunk size of 1,000 characters with 200-character overlap. The defaults were chosen empirically across a small test set of Wikipedia articles and PDF reports. Within weeks, community threads on the LangChain Discord documented that these defaults produced poor results on legal contracts (where clause boundaries rarely align with character counts), scientific papers (where equations span unpredictable lengths), and conversational transcripts (where speaker turns vary wildly). The defaults became a starting point β not a recommendation.
Character splitting divides text every N characters, optionally with M characters of overlap. It is the cheapest possible chunking strategy: O(n) in both time and memory, no tokenization required, no model calls. LangChain's CharacterTextSplitter and LlamaIndex's SimpleNodeParser both support it.
The fundamental problem is that characters and meaning are decoupled. English text has roughly 4-5 characters per token on average, but that average masks extreme variance. A chunk of 1,000 characters might contain 200 tokens (dense prose) or 350 tokens (code with short identifiers). For embedding models with hard token limits, character-based chunking provides no guarantee against silent truncation.
Token splitting uses a tokenizer β typically tiktoken for OpenAI models, or the HuggingFace tokenizer matching your embedding model β to count actual tokens. This eliminates the character-to-token variance problem. LangChain's TokenTextSplitter wraps tiktoken; LlamaIndex's SentenceSplitter can also enforce token limits while preferring sentence boundaries.
Token splitting is more expensive than character splitting (tokenization is O(n) but with a larger constant), but it produces chunks that reliably fit within model windows. It is the correct default for any production system using OpenAI or Anthropic embedding models.
Overlap means every chunk boundary is duplicated. If you have 100 chunks of 512 tokens with 64-token overlap, you are storing approximately 106,400 tokens of content to represent ~51,200 tokens of unique content. That is a 2Γ storage cost at the boundary regions β and at embedding time, those boundary tokens are embedded twice at full inference cost.
The practical guidance from Pinecone's 2023 developer documentation recommends overlap between 10β20% of chunk size for general-purpose corpora. Below 10%, semantic fracture risk rises sharply. Above 20%, the index becomes noisy because adjacent chunks are nearly identical, causing the retriever to return near-duplicate content for many queries.
Neither character nor token splitting preserves document structure. A chunk has no inherent knowledge of which section, heading, or chapter it came from. Metadata injection β adding document title, section heading, and page number as a prepended string or as a separate metadata field β must be done explicitly at chunk creation time. This is critical for filtered retrieval and for citation generation.
Practice diagnosing problems with character and token splitting configurations. Your assistant understands LangChain and LlamaIndex splitter APIs, overlap math, and when to switch strategies. Work through a real configuration challenge.
Anthropic's Claude team noted in their 2023 model card and associated technical discussions that retrieval-augmented generation performance on long legal documents improved substantially when chunking followed section-level structure rather than token budgets. Legal documents have a natural hierarchy β article β section β subsection β paragraph β and chunk boundaries that violated that hierarchy forced the model to reason across incomplete contractual clauses. The improvement was achieved not through a new model but through a better document parser that read PDF section markers.
Structural chunking uses the document's own organization as the splitting signal. For HTML, this means splitting on <h2> or <h3> tags. For Markdown, on ## headings. For PDFs with tagged structure, on section markers. For code, on function or class definitions. LangChain's HTMLHeaderTextSplitter and MarkdownHeaderTextSplitter implement this approach, attaching the header hierarchy as metadata to each chunk.
The key advantage is that structural chunks tend to be semantically coherent by design β the document author already decided that this heading covers this material. The disadvantage is variance: a section headed "Introduction" might be 200 words or 3,000 words. You typically combine structural splitting with a max-size fallback β split on headings first, then apply token splitting within any section that exceeds your budget.
Semantic chunking detects meaning shifts in the text and places chunk boundaries where topic transitions occur, rather than at fixed intervals. The approach, popularized in part by Greg Kamradt's 2023 "5 Levels of Text Splitting" notebook (which received wide circulation in the LangChain community), uses sentence embeddings to measure cosine similarity between adjacent sentences. When similarity drops below a threshold, a boundary is placed.
The algorithm: (1) split text into sentences using a sentence boundary detector, (2) embed each sentence, (3) compute rolling cosine similarity between sentence n and sentence n+1, (4) identify breakpoints where similarity drops sharply (a "valley" in the similarity signal), (5) merge consecutive sentences between breakpoints into a chunk.
LangChain added SemanticChunker in early 2024, using OpenAI embeddings by default with a percentile-based breakpoint detection. The critical hyperparameter is the breakpoint threshold type: percentile (break at the bottom X% of similarity scores), standard deviation (break where similarity drops more than N standard deviations below the mean), or interquartile (break at the IQR boundary).
| Strategy | Coherence | Cost | Chunk Size Control | Best Corpus Type |
|---|---|---|---|---|
| Character Split | Low | Negligible | Approximate | Uniform prose, prototyping |
| Token Split | LowβMedium | Low | Exact | Mixed content, production default |
| Structural Split | High | LowβMedium | Variable (needs fallback) | Docs, wikis, legal, academic papers |
| Semantic Split | Highest | High (embedding calls) | Emergent (unpredictable) | Narrative, transcripts, unstructured prose |
Most production systems at scale use a hierarchy of splitting: structural first (headings, sections), token-limited fallback within sections, and metadata injection at each level. Pure semantic chunking is expensive at indexing time β every sentence requires an embedding call β and produces variable-length chunks that complicate batch processing. Reserve it for corpora that lack structure and where retrieval quality is highly sensitive to topic coherence.
Work with your assistant to reason through structural and semantic chunking decisions for different corpus types. Practice configuring LangChain's MarkdownHeaderTextSplitter and SemanticChunker, and learn when to hybridize strategies.
The paper "Dense X Retrieval: What Retrieval Granularity Should We Use?" (Chen et al., 2023, University of Washington) introduced proposition-based retrieval as an alternative to chunk-based retrieval. Rather than indexing fixed text segments, the approach decomposes documents into atomic factual statements β propositions β each expressing a single verifiable claim. A document about climate change would yield hundreds of propositions like "Global average temperatures rose 1.1Β°C above pre-industrial levels between 1850 and 2020." Each proposition is embedded independently. On the FEVER fact verification and Natural Questions benchmarks, proposition retrieval outperformed 100-token chunk retrieval by 11β15 percentage points in recall@5.
A proposition is an atomic, self-contained factual statement extracted from a source document using an LLM. The extraction prompt instructs the model to decompose each paragraph into its constituent facts, rephrasing each as a complete, standalone sentence that requires no surrounding context to interpret.
The advantage is retrieval precision: a query about a specific fact will find the proposition that directly states that fact, rather than a 512-token chunk that contains it somewhere in the middle. The disadvantage is cost: generating propositions requires an LLM call for every paragraph in your corpus, and the resulting index contains far more documents than a chunk-based index (more small documents = more retrieval candidates = potentially noisier results at the top-k boundary).
LlamaIndex added a PropositionNodeParser in 2024, using a configurable LLM to extract propositions and store both the proposition and a reference back to its parent chunk for context retrieval.
Parent-child chunking, sometimes called "small-to-big retrieval," addresses a core tension: you want small chunks for precise retrieval but large chunks for rich generation context. The solution: index small child chunks for retrieval, but when a child chunk is retrieved, return its parent chunk to the generation model.
LangChain's ParentDocumentRetriever implements this pattern. At indexing time, each document is split into large parent chunks (e.g., 1,024 tokens), and each parent chunk is split further into small child chunks (e.g., 256 tokens). Only the child embeddings are stored in the vector index. At query time, the top-k child chunks are retrieved, their parent IDs are looked up in a docstore, and the full parent chunks are passed to the generation model.
The only reliable way to choose between chunking strategies is empirical evaluation using a question-answer test set derived from your actual corpus. The standard approach: generate 50β200 question-answer pairs from your documents (using an LLM or manually), run your RAG pipeline on each question, and measure:
Retrieval recall@k β for each ground-truth answer, does the relevant chunk appear in the top-k retrieved chunks? This measures whether the right material is being found at all.
Answer faithfulness β using an LLM judge or a framework like RAGAS (open-source, released by the Exploding Gradients team in 2023), measure whether the generated answer is fully supported by the retrieved chunks, with no hallucinated additions.
Answer relevancy β does the generated answer actually address the question asked? RAGAS computes this by regenerating the question from the answer and measuring semantic similarity to the original query.
In the RAGAS paper (Es et al., 2023), testing on a sample of 800 questions across Wikipedia, financial reports, and ArXiv abstracts, switching from 512-token chunks to parent-child retrieval (256-token children, 1024-token parents) improved answer faithfulness from 0.71 to 0.84 and answer relevancy from 0.74 to 0.81 on their scoring scale. These are not universal β they depend heavily on corpus and query type β but they establish that the choice of chunking strategy has measurable, significant impact on end-to-end answer quality.
Start with token splitting (512 tokens, 10β15% overlap) as your baseline. This is cheap, deterministic, and gives you a measurement floor.
If your corpus has structure (headers, sections, chapters), add structural splitting as the primary split with token splitting as a fallback within large sections. Measure retrieval recall@5 improvement over baseline.
If your corpus is unstructured prose (transcripts, narratives, support tickets), test semantic chunking on a representative sample. If recall@5 improves by more than 5 percentage points over token splitting, the embedding cost is likely justified.
If answer faithfulness is the primary concern and your generation model is returning hallucinations, switch to parent-child retrieval. The larger parent context typically reduces hallucination by giving the model more complete information.
If your queries are highly specific factual lookups and your corpus is small enough to afford LLM-based indexing, proposition extraction will give you the highest precision retrieval of any method.
Work through advanced chunking design challenges with your assistant. Practice reasoning about proposition extraction tradeoffs, parent-child retriever configuration, and setting up a RAGAS evaluation harness for your chunking strategy.