Back to notes

Operator memo

Best Practices for Building Agentic AI Systems in 2026

A practical March 2026 playbook for agent systems: tool contracts, approvals, retrieval discipline, evals, and telemetry.

6 min readBy Alex Chernysh
AIAgentsArchitectureEvals

Agent systems are no longer impressive because they can call tools. The useful question in 2026 is whether they can do that predictably, leave evidence behind, and stop when they should.

Small-loop default
The useful pattern is usually narrower than the first architecture draft.

1. Prefer workflows until you have earned agents

The fastest way to build a fragile system is to begin with a roaming planner because it feels advanced. Anthropic's March 2026 guidance still makes the clean distinction: workflows are better when the steps are known, and agents are better when model-driven adaptation is actually required at runtime.

That sounds obvious, yet teams still skip the boring question: is there a fixed sequence here?

If the task is deterministic enough, a workflow gives you three things for free:

  • easier debugging
  • clearer latency and cost expectations
  • smaller blast radius when the model misreads the room

Use an agent only after the workflow version has become too rigid to cover the real cases.

2. Treat tool use as an API contract, not a personality trait

OpenAI's current Agents guidance is explicit about tools as first-class building blocks: web search, file search, retrieval, MCP/connectors, shell, computer use, and other callable interfaces are all just execution surfaces. That means the quality bar is the same as for any API integration.

Good tool contracts have four properties:

  1. narrow input schema
  2. obvious failure modes
  3. explicit permissions
  4. deterministic post-processing

A tool is not "smart" because an LLM can invoke it. It is useful because the contract is narrow enough that the model has fewer ways to be clever at the wrong moment.

3. Retrieval is part of control, not just context

A surprising number of agent failures are retrieval failures wearing a reasoning costume.

In practice, grounded systems behave better when retrieval is treated as a control layer:

  • retrieve only what the step needs
  • rerank or filter before generation
  • preserve document identity through the whole run
  • let the model decline when support is thin instead of asking it to guess politely

This matters even more in agentic flows, because one weak retrieval step can poison every later action. The agent then looks busy, but it is merely wrong in several stages at once.

4. Human approval should sit on the expensive edge

Approval gates should not appear everywhere. They should appear where the system crosses a boundary that a human would care about later.

Typical approval points:

  • sending or deleting something
  • changing financial or legal state
  • writing code or infrastructure with real side effects
  • answering with confidence in a high-stakes domain

Everything else should be automated, logged, and reversible. The goal is not to make the human do more work. The goal is to reserve human attention for the moves that create durable consequences.

5. Memory is useful only when it is scoped and inspectable

Teams often say they want memory when what they really want is continuity.

Those are not the same thing.

Useful agent memory tends to fall into three buckets:

  • run state: what happened in this thread
  • durable preferences: user or system defaults that are stable enough to reuse
  • retrievable artifacts: receipts, decisions, summaries, or outputs that can be reloaded later

What you do not want is a sentimental blob of previous text that no one can audit. If the memory cannot be inspected, expired, or replayed from source artifacts, it will become mythology.

6. Evals are the operating system

OpenAI's current eval guidance is the right mental shift: go-live work is not just about latency or cost, but production best practices, safety, and accuracy optimization. In agent systems, evals stop being a research accessory and become the thing that lets you sleep.

The strongest eval setups in 2026 usually combine:

  • task success checks
  • tool-call correctness checks
  • groundedness or citation checks
  • refusal and escalation checks
  • latency and cost budgets

An agent without evals is just a workflow you have chosen not to measure yet.

7. Telemetry should explain decisions, not just errors

Most teams now log failures. Fewer teams log reasoning boundaries, tool choices, approval branches, retrieval snapshots, and policy triggers.

That missing context is what makes agent incidents annoying.

At minimum, you want telemetry for:

  • tool selected
  • arguments used
  • documents retrieved
  • policy or guardrail events
  • human approval requests
  • final answer shape and confidence posture

The ideal trace lets another engineer answer: why did this system believe it was allowed to do that?

8. The winning pattern is smaller than people expect

The production pattern I trust most still looks modest:

  1. classify the task
  2. retrieve only relevant context
  3. choose from a constrained set of tools
  4. execute with receipts
  5. run checks
  6. answer or escalate

That is not glamorous. It is also why it works.

A calm default
Use the smallest loop that preserves evidence, boundaries, and eval coverage.

What I would implement first

If I were tightening an agent stack tomorrow morning, I would do these in order:

  1. narrow the tool contracts
  2. add explicit approval boundaries
  3. add receipt logging for every external action
  4. create a small eval pack for the top ten failure modes
  5. simplify the orchestration until every branch is legible

The usual mistake is to start with more autonomy. The healthier move is almost always more legibility.

Further reading