Back to notes

Note

Most RAG Failures Start in the Documents

Chunking, titles, metadata, parent-child structure, reranking, and corpus QA for RAG systems.

6 min readBy Alex Chernysh
RAGRetrievalData QualityAI Engineering

When a RAG system fails, the model usually gets the blame. The documents usually had a head start.

Flat corpus

Documents are dumped into the index with minimal structure.

  • weak titles
  • generic chunking
  • missing metadata
  • duplicates and stale versions survive ingestion

Prepared corpus

The corpus preserves identity, structure, and retrieval hints.

  • chunks know what document they belong to
  • titles carry real signal
  • metadata and filters narrow the search space
  • stale and low-quality inputs are caught early
Ingestion path
The useful RAG pipeline starts before embeddings.

1. Chunking is not a clerical step

Teams still talk about chunking as if it were a preprocessing chore. It is usually one of the main retrieval decisions.

Good chunks do three jobs at once:

  • they are small enough to match the question precisely
  • they preserve enough context to remain intelligible
  • they carry enough identity to be trusted later

That balance is why fixed-size chunking ages badly.

The better starting point is structural chunking:

  • Markdown by headings
  • policies by sections
  • contracts by clauses and definitions
  • case law by facts, reasoning, and holdings
  • docs with tables or figures by layout-aware extraction where possible

You can still backstop this with token limits. But the structure should speak first.

2. Strong titles quietly improve retrieval

One of the simplest upgrades is also one of the most neglected: give chunks useful titles.

A chunk called "Section 4" tells the retriever almost nothing.

A chunk called "Notice periods for termination in enterprise plans" gives both the retriever and the later generator a better chance.

This is not glamorous. It is effective.

A surprising amount of retrieval quality comes from small signals like:

  • document title
  • section path
  • subsection name
  • version or effective date
  • source type

If those fields are noisy, the vector index has to work harder than it should.

3. Parent-child structure is still one of the best trade-offs

Small chunks retrieve better. Larger chunks are easier for the model to read.

That tension never went away.

Parent-child retrieval remains one of the cleanest compromises:

  • index smaller child chunks for matching
  • return the larger parent section for reading
  • preserve the link between them all the way to generation

This gives you better recall without forcing the model to answer from a pile of disconnected sentence fragments.

It also keeps provenance cleaner, because the retrieved evidence still belongs to a recognizable section rather than an orphaned paragraph with good embeddings and no life.

4. Metadata narrows the search space before the model has to clean up the mess

RAG systems often look smarter when they are simply allowed to search a smaller, more relevant space.

Useful metadata usually includes:

  • source or repository
  • document type
  • effective date or version date
  • language
  • team, product, or policy domain
  • confidentiality level where relevant

Once you have this, natural-language questions can be paired with structured filtering. That reduces the burden on reranking and generation.

Without this layer, the model spends expensive tokens sorting out mistakes the ingestion pipeline should have prevented.

5. Sentence-window retrieval is useful, but only after the corpus is sane

Sentence-window and other fine-grained retrieval patterns can improve precision when a small factual span matters.

They help most when:

  • the corpus has already been cleaned
  • chunk identity is preserved
  • the pipeline can expand from the hit sentence to its local context

They help less when the underlying corpus is duplicated, stale, or structurally broken. In those cases the system becomes very precise about the wrong thing.

That is not progress.

6. Reranking is usually the second retrieval stage, not the first miracle

A reranker can materially improve quality. It cannot redeem a bad corpus.

The healthy order is:

  1. clean and structure the data
  2. retrieve a broad but plausible top-k
  3. rerank for final relevance
  4. pass only the smallest defensible set to generation

If you skip the first step, the reranker ends up choosing from junk with excellent confidence.

7. Corpus QA deserves its own checklist

Every serious RAG system should have a corpus-quality pass that is separate from answer evals.

I want explicit checks for:

  • duplicate documents
  • stale superseded versions
  • broken OCR or parse failures
  • missing titles or headings
  • chunks with no document identity
  • malformed tables or invisible text
  • missing effective dates where the domain depends on them

This is tedious work. It is also cheaper than endlessly tuning prompts around a bad index.

8. Most retrieval problems are ingestion problems wearing nicer clothes

When people say:

  • the retriever is inconsistent
  • the reranker is unstable
  • the model is missing the point

what they often mean is:

  • the documents were chunked badly
  • the titles were weak
  • duplicates survived
  • metadata was absent
  • document identity got lost

The pipeline then looks mysterious. It is not mysterious. It is underprepared.

9. The right question is not “what chunk size?”

The better question is:

What document unit should still make sense when retrieved on its own?

That answer changes by domain. The point is to make the decision consciously.

For technical docs, a heading-bounded section may be right. For contracts, it may be a clause plus local definitions. For regulations, it may be a section path with effective-date metadata.

There is no universal chunk size. There is only a better or worse fit for the corpus you actually have.

What I would fix first

If a RAG system felt shaky and I had one morning, I would start here:

  1. remove duplicates and stale versions
  2. improve chunk titles and section paths
  3. preserve parent-child identity
  4. add metadata filters for the main corpus dimensions
  5. inspect retrieval failures before touching the prompt

The model may still need work. The documents usually go first.

Further reading