Note

LLM Product Safety Without Theater

A practical guide to LLM product safety: prompt injection, excessive agency, unsafe outputs, evals, and sober boundaries.

March 9, 20265 min readBy Alex Chernysh

LLMSafetySecurityProduct

Jump to section

Most products do not fail because nobody mentioned safety. They fail because safety stayed a slide deck while the real system kept shipping around it.

Layered safety path

The healthy posture is layered: control what the model sees, what it can do, what can leave the system, and what gets reviewed later.

1. Safety is a product behavior, not a compliance mood

The word “safety” makes some teams think of policy binders and other teams think of censorship.

Neither reaction is especially useful.

In product terms, safety is simpler:

what the system may see
what it may do
what it may claim
what it must refuse
how failures are observed and contained

That is why the best safety work usually looks boring in code:

narrower permissions
clearer approvals
safer defaults
auditable traces
release gates around high-risk behavior

Boring is underrated here.

2. Prompt injection belongs in the normal threat model

OWASP's current LLM Top 10 still starts where it should: prompt injection.

That is not because prompt injection is fashionable. It is because too many systems still trust model-consumed text far more than they should.

The practical rule is plain:

Untrusted content should not be allowed to redefine the system's instructions or its permissions.

That means treating retrieved documents, emails, web pages, and third-party data as hostile by default where it matters.

A model that can read a document is not automatically allowed to obey the document.

3. Excessive agency is a design bug, not an aspirational feature

OWASP now explicitly calls out excessive agency, which is overdue.

The problem is not agency itself. The problem is broad permissions plus vague boundaries plus insufficient review.

The healthy pattern is narrower:

limited tool scopes
typed tool contracts
explicit approvals for durable side effects
reversible operations where possible
telemetry for every external action

If the system can email, purchase, delete, deploy, or mutate records, the permission model needs to be treated like product infrastructure, not prompt decoration.

4. Output validation matters because downstream systems are literal

Unsafe output is not only about offensive text.

It is also about:

malformed JSON entering a workflow
unvalidated SQL or code suggestions reaching execution paths
unsupported legal or medical claims being presented as confident answers
links, commands, or instructions that inherit too much trust from the interface

This is why OWASP's categories around insecure output handling and sensitive information disclosure stay practical. The output is often where a fuzzy model meets a literal system.

That meeting needs supervision.

5. Safety checks should sit where they can actually help

Not every defense belongs in the critical path.

A useful split is:

Critical path

Keep the checks that prevent immediate damage:

permission boundaries
output-schema validation
approval gates for dangerous actions
high-confidence blocks for known forbidden behavior

Monitoring and review

Keep the slower or noisier work here:

deeper red-team analysis
trend monitoring
judge-model grading
broad anomaly review

Teams often get this backward. They either overstuff the critical path with expensive checks or leave dangerous behavior to postmortems.

Neither is elegant.

6. Evals make safety work harder to fake

A good safety story should survive contact with an eval suite.

I want test cases for things like:

prompt injection attempts
unsupported-claim scenarios
unsafe tool-call proposals
data-exfiltration attempts
policy-bound refusal cases
escalation boundaries

Anthropic's recent writing on agent evals is useful here because it keeps returning to one simple discipline: define the task, define the grading logic, and measure repeatedly. Safety work gets better when it stops sounding like posture and starts sounding like test design.

7. Monitoring should explain decisions, not just count incidents

A security dashboard that tells you something bad happened is better than nothing.

A safer system tells you:

what input triggered the behavior
what context was present
which tool the system attempted to call
what policy or approval boundary fired
what finally reached the user or downstream system

Without that, the incident review becomes archaeology with worse morale.

8. The mature safety stack is sober

The systems I trust most do not feel paranoid. They feel disciplined.

They do not promise perfection. They do not claim the model is now safe in some mystical global sense. They simply reduce the number of ways the system can cause expensive trouble.

That is enough. It is also most of the job.

What I would implement first

If I were hardening an LLM product this month, I would do these in order:

map the real side effects and data exposures
narrow tool permissions and approval boundaries
add high-value safety evals for the top risky behaviors
validate outputs before they hit literal downstream systems
improve telemetry until incident review stops feeling speculative

The ceremony can wait. The controls should not.

1. Safety is a product behavior, not a compliance mood

2. Prompt injection belongs in the normal threat model

3. Excessive agency is a design bug, not an aspirational feature

4. Output validation matters because downstream systems are literal

5. Safety checks should sit where they can actually help

Critical path

Monitoring and review

6. Evals make safety work harder to fake

7. Monitoring should explain decisions, not just count incidents

8. The mature safety stack is sober

What I would implement first

Related reading

Further reading