Back to notes

Note

LLM Product Safety Without Theater

A practical guide to LLM product safety: prompt injection, excessive agency, unsafe outputs, evals, and sober boundaries.

5 min readBy Alex Chernysh
LLMSafetySecurityProduct

Most products do not fail because nobody mentioned safety. They fail because safety stayed a slide deck while the real system kept shipping around it.

Layered safety path
The healthy posture is layered: control what the model sees, what it can do, what can leave the system, and what gets reviewed later.

1. Safety is a product behavior, not a compliance mood

The word “safety” makes some teams think of policy binders and other teams think of censorship.

Neither reaction is especially useful.

In product terms, safety is simpler:

  • what the system may see
  • what it may do
  • what it may claim
  • what it must refuse
  • how failures are observed and contained

That is why the best safety work usually looks boring in code:

  • narrower permissions
  • clearer approvals
  • safer defaults
  • auditable traces
  • release gates around high-risk behavior

Boring is underrated here.

2. Prompt injection belongs in the normal threat model

OWASP's current LLM Top 10 still starts where it should: prompt injection.

That is not because prompt injection is fashionable. It is because too many systems still trust model-consumed text far more than they should.

The practical rule is plain:

Untrusted content should not be allowed to redefine the system's instructions or its permissions.

That means treating retrieved documents, emails, web pages, and third-party data as hostile by default where it matters.

A model that can read a document is not automatically allowed to obey the document.

3. Excessive agency is a design bug, not an aspirational feature

OWASP now explicitly calls out excessive agency, which is overdue.

The problem is not agency itself. The problem is broad permissions plus vague boundaries plus insufficient review.

The healthy pattern is narrower:

  • limited tool scopes
  • typed tool contracts
  • explicit approvals for durable side effects
  • reversible operations where possible
  • telemetry for every external action

If the system can email, purchase, delete, deploy, or mutate records, the permission model needs to be treated like product infrastructure, not prompt decoration.

4. Output validation matters because downstream systems are literal

Unsafe output is not only about offensive text.

It is also about:

  • malformed JSON entering a workflow
  • unvalidated SQL or code suggestions reaching execution paths
  • unsupported legal or medical claims being presented as confident answers
  • links, commands, or instructions that inherit too much trust from the interface

This is why OWASP's categories around insecure output handling and sensitive information disclosure stay practical. The output is often where a fuzzy model meets a literal system.

That meeting needs supervision.

5. Safety checks should sit where they can actually help

Not every defense belongs in the critical path.

A useful split is:

Critical path

Keep the checks that prevent immediate damage:

  • permission boundaries
  • output-schema validation
  • approval gates for dangerous actions
  • high-confidence blocks for known forbidden behavior

Monitoring and review

Keep the slower or noisier work here:

  • deeper red-team analysis
  • trend monitoring
  • judge-model grading
  • broad anomaly review

Teams often get this backward. They either overstuff the critical path with expensive checks or leave dangerous behavior to postmortems.

Neither is elegant.

6. Evals make safety work harder to fake

A good safety story should survive contact with an eval suite.

I want test cases for things like:

  • prompt injection attempts
  • unsupported-claim scenarios
  • unsafe tool-call proposals
  • data-exfiltration attempts
  • policy-bound refusal cases
  • escalation boundaries

Anthropic's recent writing on agent evals is useful here because it keeps returning to one simple discipline: define the task, define the grading logic, and measure repeatedly. Safety work gets better when it stops sounding like posture and starts sounding like test design.

7. Monitoring should explain decisions, not just count incidents

A security dashboard that tells you something bad happened is better than nothing.

A safer system tells you:

  • what input triggered the behavior
  • what context was present
  • which tool the system attempted to call
  • what policy or approval boundary fired
  • what finally reached the user or downstream system

Without that, the incident review becomes archaeology with worse morale.

8. The mature safety stack is sober

The systems I trust most do not feel paranoid. They feel disciplined.

They do not promise perfection. They do not claim the model is now safe in some mystical global sense. They simply reduce the number of ways the system can cause expensive trouble.

That is enough. It is also most of the job.

What I would implement first

If I were hardening an LLM product this month, I would do these in order:

  1. map the real side effects and data exposures
  2. narrow tool permissions and approval boundaries
  3. add high-value safety evals for the top risky behaviors
  4. validate outputs before they hit literal downstream systems
  5. improve telemetry until incident review stops feeling speculative

The ceremony can wait. The controls should not.

Further reading