· 6 min read ·

Linters as the First Sensor: What Static Analysis Buys You When an Agent Is Doing the Typing

Source: martinfowler

Birgitta Böckeler has started a series on martinfowler.com about keeping codebases maintainable when coding agents do most of the typing. The framing comes from her earlier piece on harness engineering, where she described an agent harness as a loop of guides (things that steer the model toward good output) and sensors (things that detect when the output went sideways). The first sensor she examines is the most boring one in the toolbox: a linter.

It is a good place to start, and worth dwelling on. Linters were already underused by humans. The moment an LLM is producing the diff, their economics change completely.

Why a linter is a sensor, not a style nag

The usual defence of a linter is that it enforces a house style so code reviews stop bikeshedding over semicolons. That framing undersells it. A modern linter is a cheap, fast, deterministic static analyzer that catches a specific class of defects before the code runs. ESLint’s no-unused-vars finds dead bindings. no-floating-promises in typescript-eslint flags an async call you forgot to await, which is one of the more common ways an LLM silently drops error handling. Ruff’s B008 catches the Python mutable-default-argument footgun. Go’s errcheck finds ignored error returns.

In an agent loop, each of those rules is a sensor that fires in milliseconds and produces a structured message the model can read. The whole point of Böckeler’s framing is that you want as many of those as you can get, because the alternative signal is a human reviewer reading the diff hours later, or a production incident reading it days later.

The economics flip when the model is typing

With a human author, the cost of a linter false positive is friction: the developer is interrupted, has to read the rule, and either fixes the code or adds a suppression comment. That friction is the reason teams disable rules, raise thresholds, and let warnings rot in CI output.

With an agent author, the cost of a false positive is a few extra tokens. The harness pipes the lint output back into the next turn, the model reads it, and the model either fixes the issue or justifies a suppression. The pain that humans feel from a noisy linter mostly disappears, which means the optimal lint configuration for an agent-driven codebase is stricter than the optimal configuration for a human-driven one.

This flips a long-running argument. The Go team has historically been conservative about adding vet checks because every false positive costs the entire ecosystem real time. Rust’s clippy ships hundreds of lints behind opt-in groups (clippy::pedantic, clippy::nursery, clippy::restriction) precisely because most of them are too noisy to enable by default for humans. Agents have a higher tolerance for that noise, so the pedantic groups become more attractive.

Picking sensors that catch agent failure modes

Not every lint rule is equally useful as a sensor. The ones with the highest signal are the ones that catch the failure modes LLMs actually exhibit. A few that come up over and over in agent diffs:

  • Unused imports and variables. Models love to add an import “just in case” or leave a variable around after refactoring. ESLint’s no-unused-vars, Ruff’s F401/F841, and Go’s built-in unused-import error catch these.
  • Unawaited promises. A model writing TypeScript will sometimes call await on the wrong line or forget it entirely. @typescript-eslint/no-floating-promises and @typescript-eslint/require-await are cheap insurance.
  • Shadowed identifiers. When a model rewrites a function it occasionally reintroduces a name that already exists in scope. no-shadow and shadow in staticcheck catch this.
  • Dead code after returns. Models sometimes leave the old implementation dangling. ESLint’s no-unreachable and staticcheck’s SA4006 catch the obvious cases.
  • Inconsistent return types. TypeScript’s noImplicitReturns and consistent-return flag the case where one branch returns a value and another returns undefined, which is a frequent symptom of a half-finished edit.

You can run all of these in well under a second on a typical file, which makes them ideal to wire into a pre-commit hook or, better, into the agent’s own inner loop so it sees the failures before it claims it is done. The pre-commit framework has had this pattern for humans for years; the only change is that the consumer of the output is now a model, not a developer.

Type checkers are linters with longer runtime

Böckeler’s first installment focuses on “basic code linting,” but the same argument extends to type checkers. tsc --noEmit, mypy, pyright, and flow are all sensors in the same sense: deterministic static analyzers that produce structured messages an agent can act on. They are slower, often by an order of magnitude, but they catch a different and complementary set of defects.

The interesting question is where to put them in the loop. Running tsc on every file edit is wasteful when you have a fifty-thousand-file project. Tools like tsc --build with project references, or pyright’s watch mode, exist precisely so you do not pay the full cost on every change. An agent harness should treat them the way an IDE does: incremental, in the background, with a feedback channel back to the model.

What the source does not cover, and what to watch for

The martinfowler.com piece is explicitly the first in a series, and it stops at linting. There are at least four more sensor categories worth thinking about now, before subsequent installments arrive:

  1. Test coverage as a sensor. Not coverage as a metric to optimize, but coverage as a way to detect that the agent added a function and forgot to test it. Tools like c8 and coverage.py can produce per-diff reports that a harness can read.
  2. Mutation testing for agent-written tests. Stryker and mutmut catch the case where the model writes a test that passes but does not actually assert anything meaningful. Slow, but cheap relative to a human review.
  3. Architectural sensors. Tools like dependency-cruiser, import-linter, or go-arch-lint enforce module boundaries. An agent that does not know about a layering rule will cheerfully violate it; a sensor that fails the build is more reliable than a sentence in CLAUDE.md.
  4. Semantic diff review. semgrep lets you write structural rules like “no console.log in production code” or “all HTTP handlers must call the auth middleware.” These are the closest thing to a linter for your specific codebase’s invariants.

Each of these has a different latency, a different false-positive rate, and a different cost. The harness designer’s job is to schedule them so the model gets the cheapest signals first and the expensive ones only when the cheap ones pass.

The under-discussed tradeoff

The risk in all this is not that the sensors are too strict. It is that the agent learns to satisfy the sensors without satisfying the underlying intent. A model that keeps getting no-floating-promises errors will eventually start sprinkling void operators in front of every promise to silence the rule, which is technically valid and semantically wrong. A model that keeps getting coverage warnings will write tests that exercise lines without asserting behavior.

This is Goodhart’s Law applied to coding agents, and it is the reason Böckeler’s framing of “sensors” is more useful than “checks” or “gates.” A sensor is supposed to be one of many. The harness needs enough of them, pointing in different directions, that gaming any single one does not get the agent past the others. A linter alone is not a maintainability strategy; a linter plus a type checker plus a test suite plus a mutation tester plus an architectural rule plus a human spot-check is closer.

The first installment of Böckeler’s series is a solid argument for the cheapest sensor in that stack. The interesting work is in deciding which of the more expensive ones earn their place next.

Was this interesting?