The dangerous version of AI-assisted development is not the one that fails loudly. It is the one that gets to green by quietly lowering the meaning of green.
Test until green
The system keeps changing files until the suite stops complaining.
- tests are weakened casually
- design intent drifts
- the diff gets harder to review each round
Disciplined repair loop
The team diagnoses the mismatch before editing more files.
- code, tests, and docs are reconciled deliberately
- changes stay reversible
- green still means something
1. Green state is a trust condition
A repository is green when:
- the code builds
- the tests are honest
- the contracts still mean what the team thinks they mean
- the diff is understandable enough to own later
That is why "all checks pass" is necessary and still not sufficient.
An AI tool can get a suite to pass in many ways. Some of them are useful. Some of them are a form of polite vandalism.
2. The first diagnosis matters more than the fourth patch
When a change breaks tests, there are usually three possibilities:
- the production code is wrong
- the tests are wrong
- both drifted away from the intended behavior
This sounds banal. It is also where most AI-assisted repair loops go wrong.
A model is excellent at proposing edits. It is less reliable at deciding which layer deserves to move unless the design intent is visible.
That is why the repair loop should begin with explicit diagnosis:
- what behavior was intended?
- which file expresses that intention most credibly?
- is the failure about logic, contract, environment, or stale test assumptions?
Without that step, the agent starts bargaining with the suite.
3. Small diffs are not aesthetic. They are a safety mechanism.
If a fix touches too many files at once, review quality drops and diagnosis gets worse.
In AI-assisted development, small diffs matter even more because the model can produce plausible bulk edits faster than a human can audit them.
I prefer a sequence like this:
- reproduce failure
- isolate cause
- patch one layer
- rerun checks immediately
- continue only if the failure class is actually resolved
That sounds slower. It is often faster because you do not spend the afternoon untangling a 14-file patch that solved two symptoms and introduced five others.
4. Tests are not sacred, but they are not disposable
There is nothing noble about preserving a broken test forever.
There is also nothing disciplined about deleting or weakening a test because it is blocking momentum.
A healthier standard is:
- change the test if the test encodes behavior the system should no longer have
- change the code if the test correctly describes intended behavior
- change both only when the design evolved and neither file fully reflects it anymore
The point is not to defend tests emotionally. The point is to keep the suite as a credible contract.
5. AI helps most where structure exists
AI tools are strongest when the work has enough local structure to constrain the move:
- boilerplate
- repetitive refactors
- test scaffolding
- migration mechanics
- obvious consistency fixes
They are weaker when the task is mostly judgment:
- deciding the contract
- choosing the trade-off
- defining the rollback plan
- deciding what counts as “done”
That does not mean the tool is useless there. It means the engineer remains responsible for the frame.
6. Fast feedback beats one heroic repair session
The external research on software delivery has been saying roughly the same thing for years: smaller changes and faster feedback loops are healthier.
AI does not repeal this. It intensifies it.
When generation is cheap, the temptation is to defer judgment. The better move is the opposite:
- run checks sooner
- fail sooner
- narrow sooner
- revert sooner when necessary
The agent can help move faster. It should not convince you that verification has become optional.
7. Reversibility is part of the design
A good AI-assisted workflow makes rollback easy.
That means:
- additive changes before invasive ones
- clear file ownership
- obvious commit boundaries
- avoiding mixed-purpose diffs
- preserving the ability to back out one move without losing the whole session
The codebase should not need a séance to understand what happened.
8. AI-generated code still inherits operational responsibility
The system does not care whether a regression came from a human, a code model, or an enthusiastic afternoon.
What matters later is:
- who owns the failure mode
- whether the logging was good enough
- whether the contract remained legible
- whether the rollback path exists
That is why I still like a human-in-the-loop framing for engineering work. Not because the model is weak. Because responsibility still needs a home.
9. A disciplined green path
If I wanted a reliable default for AI-assisted repair work, I would keep it close to this:
- reproduce the failure
- diagnose code vs test vs design drift
- patch the smallest plausible layer
- rerun type-check and tests immediately
- stop when the diff is no longer easy to audit
- split the remaining work instead of gambling on one more broad patch
That is not the most cinematic workflow. It is the one I trust.
What I would forbid
A few anti-patterns deserve a plain ban:
- deleting tests without explicit rationale
- merging broad AI-generated refactors that no one can explain
- changing code and tests together without stating which layer was wrong
- using green CI as proof that the design is now healthier
A passing suite is good news. It is not a philosophy.