Note

Most RAG Failures Start in the Documents

Chunking, titles, metadata, parent-child structure, reranking, and corpus QA for RAG systems.

February 12, 20266 min readBy Alex Chernysh

RAGRetrievalData QualityAI Engineering

Jump to section

When a RAG system fails, the model usually gets the blame. The documents usually had a head start.

Flat corpus

Documents are dumped into the index with minimal structure.

weak titles
generic chunking
missing metadata
duplicates and stale versions survive ingestion

Prepared corpus

The corpus preserves identity, structure, and retrieval hints.

chunks know what document they belong to
titles carry real signal
metadata and filters narrow the search space
stale and low-quality inputs are caught early

Ingestion path

The useful RAG pipeline starts before embeddings.

1. Chunking is not a clerical step

Teams still talk about chunking as if it were a preprocessing chore. It is usually one of the main retrieval decisions.

Good chunks do three jobs at once:

they are small enough to match the question precisely
they preserve enough context to remain intelligible
they carry enough identity to be trusted later

That balance is why fixed-size chunking ages badly.

The better starting point is structural chunking:

Markdown by headings
policies by sections
contracts by clauses and definitions
case law by facts, reasoning, and holdings
docs with tables or figures by layout-aware extraction where possible

You can still backstop this with token limits. But the structure should speak first.

2. Strong titles quietly improve retrieval

One of the simplest upgrades is also one of the most neglected: give chunks useful titles.

A chunk called "Section 4" tells the retriever almost nothing.

A chunk called "Notice periods for termination in enterprise plans" gives both the retriever and the later generator a better chance.

This is not glamorous. It is effective.

A surprising amount of retrieval quality comes from small signals like:

document title
section path
subsection name
version or effective date
source type

If those fields are noisy, the vector index has to work harder than it should.

3. Parent-child structure is still one of the best trade-offs

Small chunks retrieve better. Larger chunks are easier for the model to read.

That tension never went away.

Parent-child retrieval remains one of the cleanest compromises:

index smaller child chunks for matching
return the larger parent section for reading
preserve the link between them all the way to generation

This gives you better recall without forcing the model to answer from a pile of disconnected sentence fragments.

It also keeps provenance cleaner, because the retrieved evidence still belongs to a recognizable section rather than an orphaned paragraph with good embeddings and no life.

4. Metadata narrows the search space before the model has to clean up the mess

RAG systems often look smarter when they are simply allowed to search a smaller, more relevant space.

Useful metadata usually includes:

source or repository
document type
effective date or version date
language
team, product, or policy domain
confidentiality level where relevant

Once you have this, natural-language questions can be paired with structured filtering. That reduces the burden on reranking and generation.

Without this layer, the model spends expensive tokens sorting out mistakes the ingestion pipeline should have prevented.

5. Sentence-window retrieval is useful, but only after the corpus is sane

Sentence-window and other fine-grained retrieval patterns can improve precision when a small factual span matters.

They help most when:

the corpus has already been cleaned
chunk identity is preserved
the pipeline can expand from the hit sentence to its local context

They help less when the underlying corpus is duplicated, stale, or structurally broken. In those cases the system becomes very precise about the wrong thing.

That is not progress.

6. Reranking is usually the second retrieval stage, not the first miracle

A reranker can materially improve quality. It cannot redeem a bad corpus.

The healthy order is:

clean and structure the data
retrieve a broad but plausible top-k
rerank for final relevance
pass only the smallest defensible set to generation

If you skip the first step, the reranker ends up choosing from junk with excellent confidence.

7. Corpus QA deserves its own checklist

Every serious RAG system should have a corpus-quality pass that is separate from answer evals.

I want explicit checks for:

duplicate documents
stale superseded versions
broken OCR or parse failures
missing titles or headings
chunks with no document identity
malformed tables or invisible text
missing effective dates where the domain depends on them

This is tedious work. It is also cheaper than endlessly tuning prompts around a bad index.

8. Most retrieval problems are ingestion problems wearing nicer clothes

When people say:

the retriever is inconsistent
the reranker is unstable
the model is missing the point

what they often mean is:

the documents were chunked badly
the titles were weak
duplicates survived
metadata was absent
document identity got lost

The pipeline then looks mysterious. It is not mysterious. It is underprepared.

9. The right question is not “what chunk size?”

The better question is:

What document unit should still make sense when retrieved on its own?

That answer changes by domain. The point is to make the decision consciously.

For technical docs, a heading-bounded section may be right. For contracts, it may be a clause plus local definitions. For regulations, it may be a section path with effective-date metadata.

There is no universal chunk size. There is only a better or worse fit for the corpus you actually have.

What I would fix first

If a RAG system felt shaky and I had one morning, I would start here:

remove duplicates and stale versions
improve chunk titles and section paths
preserve parent-child identity
add metadata filters for the main corpus dimensions
inspect retrieval failures before touching the prompt

The model may still need work. The documents usually go first.