Adversarial Verification for AI Agents: Make the Checker Try to Fail

The failure mode: an agent that grades its own homework

The single most common reason an agent loop ships broken work is that the same context that produced the work also signs off on it. You ask the agent "did you finish?" and of course it says yes. It just spent its whole window convincing itself the plan was good. Asking it to check that plan is asking a witness to cross-examine themselves.

We hit this hard early on. A build agent would report "all tests passing, feature complete," and the actual behavior was wrong in a way a green test suite never touched. The test ran. It just tested the wrong thing, and the agent that wrote both the code and the test had no incentive to notice. Related-tests-passing is adjacent to done. It is not proof of done.

Adversarial verification is the pattern we use to close that gap. The short version: the thing that checks the work is a separate agent, with a separate context, and its job is not to confirm the work. Its job is to break it.

What "adversarial" actually means here

Most "LLM-as-judge" setups are cooperative by default. You hand a model the output and ask "is this good?" and it pattern-matches toward yes, because that is the agreeable completion. A cooperative judge catches typos and obvious nonsense. It does not catch the subtle wrong thing, which is the thing you actually needed caught.

Adversarial means you flip the incentive in the prompt. The verifier is told, plainly, that the work is probably wrong and its job is to find where. Same model, very different behavior. Here is the system prompt shape we use:

You are a verifier. The work below claims to satisfy the task spec.
Assume it does NOT until you have specific evidence otherwise.

For each acceptance criterion:
  1. State the exact observation that would prove it met.
  2. Find that observation in the artifact, or report it missing.
  3. A criterion with no checkable observation is FAILED, not passed.

Output: per-criterion PASS / FAIL / UNVERIFIABLE, with the
quoted evidence for every PASS. No evidence, no PASS.

The load-bearing line is the last one. "No evidence, no PASS" is what stops the verifier from doing the cooperative thing and waving work through on vibes. It forces the check into a shape where the default is failure and a pass has to be earned with a specific, quotable observation.

The rubric has to be able to go red

This is the part people skip, and skipping it makes the whole pattern theater. A verification step that cannot fail is decoration. If your rubric is "is the output high quality and helpful," every run passes, because that criterion has no red state. You have added a token cost and a warm feeling and zero signal.

Write criteria that name an observation that could be absent:

Bad: "The function handles errors gracefully." (No way to fail this on inspection. Graceful is a vibe.)
Good: "On an empty input list, the function returns [] rather than raising. Show the code path, or the test that exercises it."
Bad: "The summary is accurate."
Good: "Every numeric claim in the summary appears verbatim in the source. List each number and its source line, or mark it unverifiable."

The test for a rubric is simple: take a deliberately broken artifact and run it through. If it does not come back red, the rubric is decoration and you should rewrite it before trusting a single green.

The loop, wired so feedback actually changes the next attempt

Verification on its own is a report. To make it a harness pattern, the failed criteria have to feed back into another build attempt, and you need a stopping rule so you do not loop forever. The shape we run:

build → verify (adversarial) → 
  if all PASS with evidence: done
  if any FAIL/UNVERIFIABLE: 
    feed the specific failed criteria + evidence gap back to builder
    re-build, addressing only those
  cap at N iterations, then escalate to a human with the open criteria

A few things that bit us, so they do not bite you:

Pass the specific gap, not "try again." The builder needs the exact criterion that failed and the missing observation. "Criterion 3 failed: no test exercises the empty-input path" produces a fix. "Verification failed, retry" produces a reroll, and rerolls regress as often as they improve.

Keep the verifier's context separate. If the verifier shares the builder's transcript, it inherits the builder's rationalizations. A fresh window that sees only the task spec and the artifact is the whole point. This is also why a single agent told to "now check your work" is weaker than a second call: same window, same blind spots.

Cap the iterations. Three rebuild attempts on one failing criterion usually means the criterion is wrong, the task is underspecified, or the model genuinely cannot do it. Looping a fourth time mostly burns tokens. Escalate with the open criteria attached so the human starts from the actual blocker.

Cost, and when it is worth it

This roughly doubles the model spend on any task you wrap in it, plus the rebuild iterations. That is real, and it is not free, so do not wrap everything. We reserve adversarial verification for work where a wrong result is expensive to catch downstream: anything that writes to a real system, sends something to a customer, moves money, or feeds a decision we will not manually re-check. For a throwaway script or a draft a human reads next anyway, the cheaper move is to just let the human be the verifier.

One efficiency note from running it: a smaller, faster model can often hold the verifier role, because checking a specific criterion against an artifact is a narrower task than producing the work. We have had the builder on a stronger model and the verifier on a cheaper one, and the cheaper verifier still catches the failures the builder rationalized past, as long as the rubric is concrete. Test that split on your own task before trusting it; the right verifier model depends on how subtle the failures you are hunting are.

If you take one thing

The pattern is two ideas, and both have to be present. First, the checker is adversarial and context-isolated, so it is not invested in the answer. Second, the rubric names observations that can be absent, so a green is earned rather than assumed. Drop either one and you are back to an agent grading its own homework, just with more steps. Build a deliberately broken artifact, send it through your verifier, and watch it go red before you trust the next green.