Back to notes

Note

Which Query Transformation Techniques Actually Help RAG?

Query rewrite, decomposition, step-back prompting, HyDE, fusion, and when each one is worth the extra latency.

7 min readBy Alex Chernysh
RAGRetrievalPromptingArchitecture

Query transformation is useful when it fixes a specific retrieval failure. It becomes expensive theatre the moment it is added because the architecture diagram looked lonely.

Targeted transformation

The query is reshaped to solve a known retrieval problem.

  • better recall on underspecified questions
  • better routing to the right corpus slice
  • measurable gain in top-k quality

Transformation by habit

The system adds more steps because more steps look advanced.

  • latency goes up
  • failure analysis gets murkier
  • the retriever still misses for the old reasons

1. Query transformation is not one technique

People often talk about query transformation as if it were a single pattern. It is not.

The common families do different jobs:

  • rewrite the query into a clearer version
  • decompose one question into several smaller ones
  • create a more abstract step-back question
  • generate a hypothetical answer or document, as in HyDE
  • run several retrieval variants and fuse the results

If you treat them as interchangeable, you end up comparing methods that solve different problems and then drawing very confident nonsense from the result.

2. Rewrite first if the user query is the actual problem

The simplest case is still common: the user asks a vague, shorthand, or context-dependent question.

Examples:

  • "What changed after the last one?"
  • "Can we do that under the policy?"
  • "How long is it now?"

These are hard to retrieve against directly. A rewrite can help by restoring missing nouns, narrowing time references, or making the target object explicit.

This is usually the cheapest transformation. It is also the easiest to overuse.

If the original query is already specific, a rewrite often adds latency without adding signal.

3. Decomposition helps when one answer needs several retrieval passes

Decomposition is useful when the user thinks they asked one question but the corpus requires several lookup moves.

Typical cases:

  • compare two policies
  • answer a question with both definition and exception paths
  • compute a result from several retrieved facts

In these cases, a single retrieval pass may underperform because each sub-question has a different evidence locus.

The catch is obvious: more retrieval passes mean more latency, more fusion logic, and more ways to contaminate the final context with irrelevant material.

I use decomposition when the task genuinely needs several evidence pulls. I avoid it when the real issue is simply poor corpus preparation.

4. Step-back prompting is for concept-level retrieval, not for drama

Step-back prompting works by first asking a broader or more abstract question, then retrieving against that abstraction as well as the original query.

This is useful when the direct query is too concrete and misses the concept that actually governs the answer.

For example, a narrow operational question may retrieve better once the system also asks a broader question about the policy principle or legal category involved.

The gain is usually conceptual recall. The cost is another model call and another retrieval branch.

If the corpus is already well-structured and the original query is good, step-back often does very little. If the user is asking around a concept they cannot quite name, it can help a lot.

5. HyDE is a retrieval trick, not a truth trick

HyDE works by generating a hypothetical answer or document, embedding that synthetic text, and retrieving based on it.

Its use case is straightforward: sometimes the user query is too short or awkward to anchor good semantic retrieval, but a plausible synthetic answer produces a better embedding target.

This can improve recall. It can also retrieve beautifully around the wrong idea if the hypothetical answer drifts.

That is why I treat HyDE as a retrieval aid, not as a smartness multiplier. It should be measured against top-k quality, not admired in the abstract.

6. Fusion is valuable when several weak retrieval views become one strong set

Fusion methods combine several retrieval branches and merge results, often with reciprocal-rank-style logic.

This is attractive when different query variants surface different relevant chunks.

It is less attractive when:

  • all branches mostly retrieve the same material
  • the corpus is small enough that one good retrieval pass already covers the space
  • reranking is strong enough that fusion adds little except cost

Fusion can absolutely work. It just has a habit of looking useful in architecture diagrams long before it proves useful in production.

7. Measure the right thing: retrieval gain per unit of latency

The practical metric here is not “did a clever transformation run?”

It is something closer to:

How much top-k evidence quality did we buy per additional millisecond and per extra failure mode?

For each transformation, I want to know:

  • top-k recall or hit quality before and after
  • reranker lift before and after
  • latency added
  • failure classes improved
  • failure classes introduced

Without that, teams end up shipping query pipelines that are verbose, slow, and only spiritually better.

8. Most systems should use fewer techniques than they think

If the corpus is well prepared and the query is usually decent, the default stack can stay small:

  • direct retrieval
  • optional rewrite for low-quality user phrasing
  • rerank
  • answer

Only add more when a specific class of misses persists.

The usual order I trust is:

  1. improve corpus quality
  2. improve direct retrieval
  3. add reranking
  4. only then test transformations selectively

This is less exciting than a diagram with five branches. It is also easier to debug.

9. A useful first matrix

If I had to choose quickly:

SymptomBetter first move
query is vague or ellipticalrewrite
one answer depends on several distinct factsdecomposition
direct question misses governing conceptstep-back
semantic recall is weak on short or awkward queriesHyDE
several query variants each surface useful evidencefusion
retrieval misses because the corpus is messyfix ingestion first

That last row is doing a lot of work. It deserves to.

What I would implement first

I would not build all five techniques and pray.

I would:

  1. collect real retrieval misses
  2. label them by failure mode
  3. test one transformation per failure class
  4. keep only the transformations that improve evidence quality enough to justify the delay

The system does not need a richer theory of prompts. It needs a better reason for every extra step.

Further reading