Evidence Before Done: Why Your AI Agent's "All Green" Is a Claim, Not a Fact

An agent reporting "done" is a claim, not a fact. We learned that the expensive way: a subagent closed a ticket green, the suite passed, and the one behavior the ticket actually asked for had never run. The tests it ran were adjacent to the change, not on it. Green suite, broken feature, confident summary.

This post is the rule we run now and the gate that enforces it: an agent does not get to say "done" until a check that could have failed actually observed the required behavior working.

Why "all green" is not evidence

Coding agents are good at producing a plausible completion message. That message is generated from the agent's belief about what it did, not from observing the result. Three failure shapes show up over and over:

Adjacent-green. The change touches module A, the green tests cover module B. Nothing exercised the new path. A passing suite that skips your diff proves the rest of the repo still works, nothing more.
Unfalsifiable check. The "test" cannot fail. An assertion against a mock that returns the value being asserted, a snapshot regenerated in the same run, a smoke test that swallows its own exit code. If there is no input that turns it red, it is decoration.
Wrong environment. The agent ran it locally against a stub and reported it working in production. The claim and the evidence came from different worlds.

None of these are model failures exactly. They are verification-design failures, and they happen with any agent capable enough to write a convincing summary.

The rule: evidence before done

One sentence: "done" means the specific behavior the task required was observed working, by a check that could have gone red, in the environment the claim is about.

Four properties make a check count as evidence. We bake these into the agent's instructions and into a gate hook:

On-path. A named check exercises the changed code path. If one does not exist, the agent writes it before claiming done. A green run that never touched the diff is not on-path.
Falsifiable. The check has a reachable failing state. If you cannot describe the input that turns it red, it is not a check.
Right environment. Evidence comes from the environment the claim names. A production claim carries production evidence: the prod URL responding, the prod error tracker clean after real traffic, the container reporting healthy.
First-hand. For anything outward-facing or destructive, the supervisor re-runs the check itself. A subagent's "all green" is the subagent's intent, not proof the supervisor has seen.

The gate that enforces it

Instructions drift. A rule the agent is supposed to remember is a rule it forgets under context pressure. So we stopped trusting the prose and added a gate that refuses the completion until evidence is attached.

The mechanism is boring on purpose: completing a task requires a small evidence record, and a check rejects the completion if the record is missing or stale. A minimal version of what we attach:

{
  "task": "add idempotency key to /charge",
  "claim": "endpoint rejects duplicate key within 24h",
  "check": "tests/charge_idempotency_test.py::test_duplicate_key_409",
  "ran_at": "2026-06-17T14:02:11Z",
  "result": "pass",
  "env": "staging",
  "command": "pytest tests/charge_idempotency_test.py -q"
}

The gate asserts three cheap things before it lets "done" through:

The named check exists and ran in this session (the ran_at is inside the work window, not copied from an old log).
result is pass and the command actually produced that result, re-run by the supervisor for outward-facing work.
env matches the claim. A claim that says "in prod" with env: local is rejected.

You can wire this as a stop-style hook in your agent runtime, a CI step that parses the evidence block out of the PR body, or a wrapper around your task runner. The shape matters more than the tooling: the completion path has to pass through something that can say no.

The gotcha that cost us a day: settle windows

Our first version of the "production is clean" check read the error tracker about 60 seconds after deploy, saw nothing, and reported clean. It was reading silence, not health. No real traffic had hit the new code yet, and on one deploy the release tag had not propagated to the events, so even the errors that did arrive were filed under the old version.

Two fixes. First, the clean read has to happen after real traffic, not after the deploy returns. Second, the check confirms the new release identifier is actually present on incoming events before it trusts a zero-error count. A quiet window right after deploy is the absence of evidence, and we were treating it as evidence of absence.

What this costs and what it buys

It is not free. Writing an on-path falsifiable check for work that did not have one is real time, and re-running checks first-hand instead of trusting the subagent adds a step. On throwaway scripts we skip most of it: "it ran and produced the output" is the whole bar for a one-off.

What it buys is that "done" stops being a coin flip. The class of bug where a confident agent hands you a broken feature with a green checkmark goes away, because the checkmark now requires a check that could have caught it. For anything that reaches a customer, moves money, or deletes data, that trade is not close.

The checklist we actually run

Path: a named check exercises the changed path; write it if it is missing.
Falsifiable: the check has a reachable red state.
Env: evidence comes from the environment the claim names.
First-hand: outward or destructive work: the supervisor re-ran it and read the result.
Settled: time-based reads waited for real signal (traffic, propagated tags), not just for the command to return.

If a task cannot clear all five, it is not done. It is in progress with a confident narrator. Make the completion path go through a gate, and the narrator stops being the one who decides.