Verification Throughput Is the New Bottleneck in AI-Assisted Coding
Source: martinfowler
Chris Parsons quietly updated his guide to using AI for software development for the third time, and Martin Fowler flagged it in his April 29 fragments. The guide is worth reading on its own terms. What I want to pull on is one specific line Fowler highlighted, because it captures the real shift that has happened in the last twelve months of AI-assisted coding:
Verified used to mean read by you. With modern agent throughput, it has to mean checked by tests, by type checkers, by automated gates, or by you where your judgement matters.
That sentence is doing a lot of work. It is the difference between using Copilot in 2023 and running Claude Code or Codex CLI in 2026. And it reframes the bottleneck of software engineering in a way that most teams have not yet adjusted to.
The throughput problem nobody planned for
When autocomplete-style AI showed up, the human review loop scaled fine. You looked at a four-line suggestion, accepted or rejected it, and moved on. The cognitive overhead per accepted token was low because the volume was low.
Agentic harnesses broke that. A single prompt to Claude Code or OpenAI’s Codex CLI can produce a multi-file diff that touches hundreds of lines, runs the test suite, fixes its own type errors, and comes back with a summary. Simon Willison has been tracking this evolution for a couple of years now and draws the same line Parsons does: there is vibe coding, where you do not look at the output, and there is agentic engineering, where you do, but the volume of output has outpaced your eyeballs.
If you try to read every line, you become the bottleneck. Your team’s effective throughput collapses to whatever you can review in a day, which is roughly what it was before the agents arrived. You bought a Ferrari and parked it in traffic.
If you stop reading, you ship hallucinated APIs, subtle logic errors, and security holes. The recent supply chain incidents involving prompt-injected agents are a warning shot, not the whole problem.
The resolution Parsons and Fowler land on is the only one that scales: move verification off the human and onto machinery, except for the parts that genuinely need human judgement.
What machine-side verification actually looks like
This is where the abstract advice gets concrete. “Build guardrails” means something specific when the producer is an agent running in a loop.
Type systems are doing more work than ever. A strict TypeScript or Rust compiler catches a remarkable fraction of the failure modes that LLMs produce: invented method names, wrong argument shapes, nullability confusion. If you are writing Python, pyright in strict mode is no longer optional. The cost of turning the dial up was annoying when humans wrote all the code; it is free insurance when an agent does.
Tests as oracles, not documentation. The traditional argument for tests was regression protection and design pressure. With agents in the loop, tests become the ground truth the agent iterates against. Claude Code’s harness will run your test command repeatedly and feed failures back into context until they pass. If your test suite is slow or flaky, the agent’s loop becomes useless. This pushes hard toward fast unit tests, deterministic fixtures, and parallel execution. The economic value of a one-second test run versus a thirty-second test run is now measured in agent iterations per hour.
Pre-commit hooks as gates. Tools like lefthook or pre-commit are no longer just developer hygiene. They are the last automated checkpoint before agent output enters your repository. Linters, formatters, secret scanners, and lightweight static analysis all belong here.
Property-based and fuzz testing for higher-stakes code. Hypothesis for Python, proptest for Rust, and fast-check for TypeScript can probe the corners that example-based tests miss. The cost-benefit shifted: writing a property test takes about as long as writing five examples, and it scales review effort sublinearly with the size of the agent’s output.
Sandboxed execution. Codex CLI and Claude Code both run in sandboxes by default on macOS via Seatbelt and on Linux via Landlock or bubblewrap. The sandbox is part of verification too, in the sense that it limits the blast radius of a confidently wrong agent.
The five-approaches claim
Fowler quotes Parsons:
A team that can generate five approaches and verify all five in an afternoon will outpace a team that generates one and waits a week for feedback.
This is the part I think most teams underestimate. It is not just that agents write code faster. It is that the cost of exploring alternatives has dropped by an order of magnitude, but only if your verification pipeline can keep up.
Think about the implicit math. If verifying an approach takes a week, you generate one approach and commit to it. If verification takes an afternoon, you generate five and pick the best. The quality of the final design is bounded by how many alternatives you can afford to evaluate, and that ceiling just rose.
This is the same dynamic Bret Victor described in 2012 when he argued for tighter feedback loops in creative work, except now the creative agent is an LLM and the feedback loop is your CI pipeline. Teams that invested in fast, reliable, machine-readable verification over the past decade are reaping a compounding advantage they did not specifically plan for.
Where human judgement still matters
Fowler’s qualifier matters: “or by you where your judgement matters.” Not all verification can or should be automated. The places I keep seeing humans add irreplaceable value:
- API design and naming. Tests verify behavior, not whether the abstraction is the right one. Agents are happy to invent five plausible APIs; deciding which one your future maintainers will thank you for is a taste call.
- Security boundaries. Static analysis catches some things, but threat modeling against a new feature still requires a human who understands the system’s trust assumptions.
- Performance under realistic load. Microbenchmarks lie. Knowing whether a change matters in production usually requires understanding the workload, which the agent does not have.
- Cross-cutting consistency. An agent working in a single PR cannot tell you whether the pattern it just used contradicts a convention used in twelve other places in the codebase. Tools like ast-grep and architectural fitness functions help, but the judgement call about whether to enforce a convention is still yours.
The practical move is to deliberately route work to the right verifier. Mechanical correctness goes to the type checker and the test suite. Local design choices go to a code review with reasonable scope. Cross-system architectural choices stay with the human, ideally before the agent starts work, in the form of a written spec or an ADR.
What this means for tooling investment
If I were starting a new project today, the items at the top of my setup checklist look different than they did two years ago. Strict type checking on day one. A test runner that finishes in under ten seconds for unit tests. Pre-commit hooks that catch the obvious stuff before the agent’s diff hits review. A CI pipeline that runs in single-digit minutes. Maybe a mutation testing pass on the critical paths to make sure the tests actually test something.
None of this is new advice. What is new is the economic argument. Before, fast verification was a productivity nice-to-have. Now it is the rate limiter on how much agent throughput you can convert into shipped software. A team with a slow test suite has effectively capped its AI-assisted output, regardless of which model it pays for.
Parsons’ guide is good because it is concrete and it has been updated three times against real practice. The deeper lesson is that the bottleneck has moved. The interesting engineering question of 2026 is not which model to use; it is how fast and how thoroughly your machinery can verify what the model produces.