· 6 min read ·

Same LLM, Different Worlds: Why Developers Talk Past Each Other on AI Coding Tools

Source: lobsters

There is a recurring argument in developer spaces that follows a predictable structure. One developer says LLMs have transformed their workflow. Another developer says the same LLMs are mostly noise. They both pull out concrete observations. The code suggestions are often wrong, but fast. The model doesn’t understand the broader codebase. It excels at boilerplate. It hallucinates APIs. They’re describing the same tools producing the same behaviors. Yet somehow they end up on opposite sides.

Baldur Bjarnason’s piece frames this as a structural problem: there are two fundamentally different programming worlds, and the same LLM behavior looks like a net positive from inside one and a net negative from inside the other. That framing is right, but it’s worth pulling apart why the same error rate produces different outcomes depending on context, because the mechanics are specific enough to be useful.

What the Two Worlds Look Like in Practice

The clearest axis of division is between greenfield work and maintenance work on existing systems. A developer building a new side project, a weekend tool, a small SaaS MVP, or an isolated service is working in conditions that are structurally favorable to LLM assistance. The codebase is small, often fits in a context window, has few hidden invariants, and has a low cost for mistakes. Errors surface quickly and are easy to revert.

A developer maintaining a production system that has accumulated years of implicit constraints, undocumented behavior, subtle performance requirements, and cross-cutting concerns that span dozens of files is working in almost the opposite conditions. The context window cannot hold enough of the system to reason about it coherently. Mistakes don’t surface in isolation; they propagate through call chains and data pipelines in ways that are expensive to trace.

Both developers observe: “The model writes plausible-looking code quickly but makes mistakes at a roughly consistent rate.” The greenfield developer finds that rate acceptable. The maintenance developer does not. Neither observation is wrong.

Error Cost Is the Variable, Not Error Rate

This is the crux of the divergence. LLMs have a roughly stable error rate for any given task complexity. What differs wildly between the two worlds is the cost of those errors.

In greenfield work, an error caught in review or in a first run of tests is cheap. You discard the suggestion or fix a single function. The correction is local because the code doesn’t have deep entanglement yet. The model’s speed-to-first-draft advantage outweighs the correction overhead.

In maintenance work, the same error rate produces different outcomes. A hallucinated method call in an obscure library might pass code review from someone unfamiliar with that library, land in production, and fail on an edge case three months later under specific data conditions. The debugging cost isn’t the fix itself; it’s the time to trace back from the failure to the root cause in AI-generated code that nobody fully reviewed because it looked reasonable.

The asymmetry gets worse when you factor in security-sensitive code. A subtle off-by-one in array indexing is cheap in a data transformation script and expensive in a memory-safe boundary check. LLM proponents and skeptics often aren’t even arguing about the same cost function.

Context Windows and the Codebase Size Problem

Modern LLMs have expanded context windows significantly. GPT-4 and Claude models can handle hundreds of thousands of tokens. This sounds like it closes the gap between the two worlds, but in practice it doesn’t, for a few reasons.

First, most production codebases exceed what any context window can hold coherently. A reasonably complex web service might have 200,000 lines of code across a few hundred files. You can stuff that into a large context window, but the model’s effective attention degrades across that length. Studies on long-context retrieval have consistently shown that models perform worse on information positioned in the middle of very long contexts, a phenomenon sometimes called the “lost in the middle” problem. The code you care about is often not at either end.

Second, the relevant context for maintenance work is often not the source code alone. It includes migration history, comments in the issue tracker, behavior documented only in runbooks, implicit contracts between services that were never written down. None of that is in the repository in a form the model can consume.

Greenfield code, by contrast, is almost always smaller and self-contained. If your new tool is 3,000 lines of TypeScript, a modern model can see all of it, reason about the whole thing coherently, and suggest changes that don’t contradict code elsewhere in the file.

The Expertise Asymmetry Compounds This

There’s a second axis that intersects with the greenfield/maintenance divide: how well the developer can evaluate LLM output.

An expert in a given domain uses LLM suggestions the way an experienced carpenter uses a power tool. They know what good looks like, they catch deviations quickly, and they use the tool to accelerate work they could do themselves if slower. The model’s errors are filtered by the developer’s existing knowledge before they cause harm.

A developer working in an unfamiliar part of a codebase, or in a language or framework they know less well, has a weaker filter. The model’s confident-but-wrong output is harder to catch because the developer doesn’t have a sharp prior on what the correct output should look like. This is exactly the situation that maintenance work produces repeatedly: you’re the database team looking at a networking change, or the frontend developer touching an infrastructure script.

Greenfield work tends to keep developers in their domain of expertise. You’re building a thing you designed, in a stack you chose. Maintenance work constantly sends developers into unfamiliar territory. This isn’t a coincidence; it’s the nature of the work.

What the Tools Are Actually Good At

If you accept this framing, LLMs are genuinely useful for a specific class of programming work: tasks that are well-specified, relatively self-contained, use common patterns in well-documented libraries, and where the developer can quickly evaluate the output.

This describes a lot of greenfield development. It also describes specific maintenance tasks: writing new tests for existing logic, generating boilerplate for a new module that fits an established pattern in the codebase, producing a first draft of documentation, translating a function from one language to another where both are well-represented in training data.

It does not describe: debugging subtle race conditions, refactoring code with implicit behavioral contracts, designing an architecture that needs to account for operational constraints the model has never seen, or reviewing a security-sensitive change in a library the model has minimal training data on.

The developers who report transformative productivity gains from LLMs are not wrong. They are frequently working in conditions that favor LLM assistance. The developers who report disappointment or active harm are also not wrong. They are frequently working in conditions that do not.

Why This Matters Beyond the Argument

The practical implication is that “should we use LLMs for coding” is not a single question with a single answer. It’s a question that needs to be answered per task type, per developer expertise level, and per codebase maturity.

Teams that treat LLM adoption as a binary choice, either everyone uses it for everything or we reject it entirely, will optimize poorly. The greenfield developers on the team will be underserved by rejection. The maintenance developers on the team will accumulate technical debt from uncritical adoption.

A more useful approach is to identify which tasks in your specific workflow meet the favorable conditions: well-specified, self-contained, evaluable by the person doing them, low error propagation cost. Start there. Be skeptical about extending to tasks that fail those criteria, even when the model appears confident.

The two-worlds framing is valuable precisely because it shifts the conversation away from “is this technology good or bad” toward “under what conditions does this technology produce good or bad outcomes.” That’s a more tractable question, and it’s the one that leads to actual decisions rather than online arguments where both sides are empirically correct about different things.

Was this interesting?