Guide
RAG chunking strategies explained
A support bot returns the pricing page footer instead of the refund policy because your chunker split mid-paragraph. A legal assistant cites section 4.2 without the definitions from section 1 that make it intelligible. A code-search RAG retrieves a function body but not the import that explains the type error. These failures rarely trace back to the embedding model or the LLM — they start at chunking, the step that decides what text becomes a retrievable unit. This guide explains why RAG chunking strategies matter more than model choice for many production systems; compares fixed-size, semantic, and structure-aware approaches; covers overlap, metadata, and parent-child indexes; walks through contextual retrieval enrichment; sizes chunks against embedding limits; applies the patterns to a Harbor Support knowledge base; compares strategies in a decision table; lists common pitfalls; and ends with a production checklist. For the full RAG pipeline, see our RAG overview; for measuring whether chunks help, see RAG evaluation and semantic search.
What chunking does in a RAG pipeline
Retrieval-augmented generation indexes documents as chunks — strings small enough to embed, store in a vector database, and fit into an LLM context window alongside the user question. Chunking is the boundary-drawing step between raw files (PDFs, HTML, Markdown, tickets, code) and those index rows.
Each chunk should be:
- Self-contained enough to answer a narrow question when retrieved alone.
- Focused enough that its embedding vector represents one topic, not five.
- Small enough to leave room for other chunks, system prompts, and the answer.
- Large enough to preserve definitions, qualifiers, and negations that change meaning.
There is no universal optimal size. Chunking is a retrieval design problem: you are choosing the granularity at which similarity search operates. Change the granularity and recall@k, faithfulness, and latency all move — often more than swapping from one embedding model to another.
Fixed-size chunking
The default in most tutorials: split text every N tokens or characters,
optionally with overlap. Libraries like LangChain’s
RecursiveCharacterTextSplitter try split points in order (paragraph,
newline, space) before hard-cutting.
When it works
Homogeneous prose — blog posts, news articles, uniform help articles — where sections are roughly similar length and topics do not span pages. Fast to implement, deterministic, easy to reproduce in eval pipelines.
Typical parameters
- Chunk size: 256–512 tokens for narrow factual lookup; 512–1,024 for explanatory content; up to 2,048 when the generator has a large context window and questions need surrounding paragraphs.
- Overlap: 10–20% of chunk size so sentences split at boundaries still appear whole in at least one chunk. More overlap increases index size and duplicate hits; less loses context at seams.
Failure modes
Tables split across chunks lose headers. Bullet lists become orphaned items. Code blocks break mid-function. Legal and policy docs separate “shall not” conditions from the nouns they modify. Fixed-size chunking is a baseline, not a finish line.
Structure-aware chunking
Respect document structure before token counts. The splitter follows the outline the author already wrote.
Markdown and HTML
Split on heading levels (h1–h3), then sub-split oversized
sections with fixed-size fallback. Prepend the heading trail to each chunk:
Refund Policy > Eligibility > Digital goods. That breadcrumb becomes
critical embedding signal when many sections share vocabulary (“eligible,”
“within 30 days”).
PDFs and office documents
Use layout-aware parsers (Unstructured, Docling, Adobe Extract) that preserve headings, lists, and table boundaries. Never feed raw PDF text order when columns or footers interleave. For tables, store as Markdown or HTML inside a single chunk, or index row summaries separately with a pointer to the full table chunk.
Code repositories
Chunk by function, class, or module — not by line count. Include imports, docstrings, and the signature in the same unit. For large files, parent-child patterns (below) let you retrieve a tight function while expanding to surrounding context at generation time.
Tickets and chat logs
One ticket = one chunk, or one chunk per agent–customer turn pair with ticket metadata (product, severity, resolution code). Thread boundaries matter more than token length.
Semantic chunking
Instead of a fixed ruler, detect topic shifts. Embed sentences or paragraphs, measure cosine similarity between consecutive units, and split when similarity drops below a threshold. Tools like LlamaIndex’s semantic splitter automate this loop.
Semantic chunking produces variable-size segments that align with conceptual boundaries — useful for long reports, research PDFs, and transcripts where section headers are missing or unreliable. Cost: extra embedding calls at index time and sensitivity to threshold tuning per corpus.
Hybrid approach many teams use in production: structure-aware first pass (split on headings), semantic second pass only for sections exceeding a max token budget.
Parent-child and small-to-big retrieval
Small chunks embed well — tight vectors, high precision. Large chunks read well — the LLM sees enough context to answer. Parent-child indexing resolves the trade-off:
- Index child chunks (128–256 tokens) for search.
- Store each child’s parent chunk (full section, 1–2k tokens).
- On retrieval, match children, then pass the parent text (or parent + siblings) to the LLM.
Variants include sentence-window retrieval (expand each hit by ±N sentences) and auto-merging retrievers that climb the tree until a similarity budget fills. The pattern appears in LlamaIndex, LangChain’s parent-document retriever, and bespoke support stacks.
Watch index size: you store two representations per logical section. Deduplicate parent text at generation time so the prompt does not repeat the same paragraph three times when multiple children hit.
Contextual retrieval and metadata enrichment
Anthropic’s contextual retrieval prepends a short LLM-generated summary to each chunk before embedding: “This chunk is from Harbor’s refund policy, section on digital goods, explaining the 14-day window for subscriptions.” The extra context lands in the vector, improving recall when chunks alone are ambiguous.
Cheaper alternatives that work well:
- Heading breadcrumbs prepended to chunk text (no LLM cost).
- Structured metadata filters: product, locale, doc version, effective date.
- Keywords and entities extracted once at index time for hybrid BM25 + vector search (see hybrid search).
- Source URI and anchor for citation links in the final answer.
Store metadata in the vector DB payload, not only in the embedded string, so you can filter (“only 2026 policy docs”) without re-embedding.
Choosing chunk size: practical rules
Match chunk size to question type and embedding model limits:
- Fact lookup (“What is the withdrawal fee?”) — smaller chunks (200–400 tokens), high precision.
- How-to and troubleshooting — medium chunks (400–800 tokens) or parent-child so steps stay together.
- Synthesis across sections (“Compare plan A vs plan B”) — larger parents, multi-query retrieval, or agentic RAG that fetches several chunks.
- Embedding model context — stay under the model’s max input (often 512 or 8,192 tokens); leave headroom for prepended context strings.
- Generator context budget — if top-k = 5 and each chunk is 2,000 tokens, you may exhaust the window before the question fits.
Tune with data: build a golden set of 50–100 question–document pairs, sweep chunk size and overlap, measure context recall and answer faithfulness. The peak is rarely at the extremes.
Worked example: Harbor Support knowledge base
Harbor Support indexes 4,200 articles across billing, shipping, and account security. Initial launch used 512-token fixed chunks with 64-token overlap. Context recall on their eval set was 71% — acceptable for FAQs, failing on policy questions that spanned sections.
Audit findings:
- Refund eligibility conditions split from the exception list in 18% of policy articles.
- Billing tables lost column headers, causing wrong answers on prorated credits.
- Duplicate near-identical chunks from overlap inflated top-k with redundant hits.
Revised pipeline:
- Parse HTML with heading-aware splitter; max section size 900 tokens before sub-split.
- Prepend
Article title > H2 > H3breadcrumb to every chunk. - Index 256-token children; store 900-token parents keyed by section ID.
- Tables rendered to Markdown in a single non-split chunk under 1,500 tokens.
- Retrieve k=8 children, dedupe to 3 parents, pass parents to GPT-4o-mini for answering.
Context recall rose to 89%; faithfulness (LLM-judged) from 82% to 91%. Index storage grew ~40% — acceptable versus retraining or accepting wrong refund guidance.
Strategy decision table
| Strategy | Best for | Trade-off |
|---|---|---|
| Fixed-size + overlap | Uniform articles, quick prototypes, eval baselines | Breaks tables, code, and cross-paragraph dependencies |
| Structure-aware (headings) | Docs, wikis, help centers, Markdown repos | Needs reliable structure; weak on scanned PDFs |
| Semantic splits | Long unstructured prose, transcripts, research | Extra embed cost; threshold tuning per corpus |
| Parent-child | Precision-critical search + context-heavy answers | Larger index; dedupe logic at generation time |
| Contextual enrichment | Ambiguous chunks, similar wording across sections | LLM cost at index time (or manual breadcrumbs) |
| One-doc-one-chunk | Very short pages (< 300 tokens total) | Does not scale; poor granularity for long pages |
Common pitfalls
- Chunking after aggressive cleaning that removes headings, list markers, or table structure.
- Same parameters for every doc type — legal, code, and chat need different splitters.
- Ignoring negation boundaries — “except when” clauses stranded in the next chunk.
- Embedding boilerplate — nav bars, copyright footers, and “Was this helpful?” pollute vectors.
- No version metadata — retrieving superseded policy alongside current policy.
- Evaluating only end-to-end answers without measuring retrieval-stage context recall.
- Re-chunking without re-embedding when chunk boundaries change — stale index rows.
- Massive overlap — top-k filled with duplicates, wasting context window.
Production checklist
- Inventory document types (HTML, PDF, code, tickets) and assign a splitter per type.
- Strip boilerplate and normalize encoding before splitting.
- Prepend heading breadcrumbs or contextual summaries to chunk text.
- Set chunk size and overlap from eval sweeps, not defaults copied from a tutorial.
- Store source URI, version, and section ID in vector payload metadata.
- Consider parent-child when precision and context both matter.
- Handle tables and code as atomic units where possible.
- Version the chunking config alongside embedding model in your index manifest.
- Measure context recall and faithfulness when chunk params change.
- Re-index atomically on chunking migrations; keep rollback snapshots.
Key takeaways
- Chunking defines retrieval granularity — it often beats embedding model swaps for measurable gains.
- Structure-aware splits outperform naive fixed-size on real docs with headings, tables, and lists.
- Parent-child indexing separates search precision from generation context.
- Metadata and breadcrumbs are cheap contextual retrieval without extra LLM calls.
- Tune with eval data — chunk size is a hyperparameter, not a constant.
Related reading
- RAG explained — full retrieve-and-generate pipeline
- RAG evaluation explained — context recall, faithfulness, and golden sets
- Semantic search explained — embeddings, ANN indexes, and hybrid fusion
- Agentic RAG explained — multi-step retrieval when one chunk is not enough