Operator memo
Preventing Hallucinations in LLM Systems
A March 2026 playbook for groundedness: retrieval discipline, abstention, claim checks, evals, and guardrails.
Hallucination prevention is no longer one trick. It is a stack: better prompts, better retrieval, tighter output contracts, explicit abstention, and post-generation checks that know their own limits.
1. Stop treating hallucination as a model-only problem
A lot of hallucinations are system design failures.
The model is often blamed for facts it was never given, formats it was never shown, or policies it was never allowed to refuse. In production, hallucinations tend to come from one of five causes:
- missing or weak context
- ambiguous tasks
- unconstrained output shape
- weak refusal rules
- no downstream checks
If you only swap models, you might reduce the symptoms. You probably will not fix the disease.
2. Retrieval discipline beats retrieval volume
Groundedness improves when the system retrieves less but better.
The healthy pattern is:
- retrieve only documents relevant to the specific question
- preserve source identity through ranking and answer generation
- force the model to answer from the retrieved set or abstain
- inspect retrieval failures separately from generation failures
The common anti-pattern is to stuff the prompt with everything remotely related and hope the model becomes wiser through saturation. It usually becomes noisier instead.
3. Prompts should make uncertainty legal
Prompting matters most when it sets boundaries.
Good prompts for high-stakes answers do at least four things:
- define the task precisely
- define the expected output format
- define what counts as enough evidence
- explicitly allow the model to say the answer is unsupported
If the prompt implies that an answer must always be produced, the system will often produce one. That is not obedience. That is leakage from your incentive structure.
4. Use examples to shape outputs, not just style
Google's current Gemini prompt guidance remains refreshingly direct: use clear and specific instructions, and rely on few-shot examples to show the model the pattern you want. Their guidance also recommends consistent formatting across examples and positive patterns over anti-patterns.
That matters for hallucination prevention because examples do more than change tone. They narrow the response manifold.
A few good examples can teach the model to:
- cite only when evidence exists
- choose concise answers when support is thin
- preserve a strict JSON or markdown schema
- refuse when a field cannot be justified
5. Guardrails need to match their actual scope
Guardrails are useful when they are honest about what they can and cannot prove.
Amazon Bedrock's Automated Reasoning checks are a good example of a powerful but bounded tool. The current documentation is explicit: they are useful when you need to demonstrate the factual basis for a response, but they do not protect against prompt injection on their own, do not support streaming, and only validate the parts of the response captured by the policy variables.
That is exactly the right mental model for guardrails in general.
Use them for:
- formal policy checks
- structured domain rules
- controlled post-answer validation
Do not use them as a magical claim that the whole response has become truthful.
6. Claim-level verification is stronger than answer-level vibes
The best production systems now break answers into checks that can be evaluated independently.
Instead of asking, "Does this answer seem fine?", ask:
- which claims depend on retrieved evidence?
- which claims are policy-bound?
- which claims are numerical or date-sensitive?
- which claims should trigger abstention if unsupported?
This lets the system rewrite, trim, or block only the unsafe parts instead of throwing away the entire answer every time something looks suspicious.
7. Evals catch regressions that prompt reviews will miss
OpenAI's current eval guidance is still the right operational lens: if you care about accuracy in production, build evals into the release process.
For hallucination prevention, I like a layered eval pack:
- grounded answer checks
- unsupported-claim refusal checks
- citation integrity checks
- structured-output checks
- risky-domain red-team cases
The important part is not just the dataset. It is the habit: run the same checks after prompt changes, retrieval changes, model swaps, and ranking tweaks.
8. Streaming creates a special problem
Streaming makes users happier. It also shortens your time to regret.
Because streaming emits partial text, you cannot rely on a final answer check alone. If sensitive or unsupported content is not allowed to appear in public, you need one of these approaches:
- scoped generation that makes the bad answer unlikely
- buffered or delayed streaming for guarded fields
- chunk-level sanitization with withheld tails
- post-checks only for non-streaming, high-stakes paths
This is one reason formal post-generation validation tools often sit behind non-streaming or semi-buffered flows.
9. The best fallback is a useful refusal
A good refusal is not a generic apology. It is a precise boundary.
Examples of useful fallbacks:
- "The retrieved material does not support a reliable answer."
- "I can summarize the available facts, but I should not infer beyond them."
- "This claim needs a source-backed check before I answer directly."
The refusal should preserve trust and keep the next step obvious.
A production checklist
Before I trust an LLM system in public, I want these answers:
- Can it abstain?
- Can it show what it used?
- Can it prove the answer shape?
- Can it survive a model swap without changing its truthfulness posture?
- Can I explain a failure without reenacting a séance?
If the answer is no, the system is not done. It is merely fluent.