· 6 min read ·

LLM Productivity Is a Training Data Problem in Disguise

Source: lobsters

The debate over LLM productivity in software development has produced a peculiar stalemate. Developers share roughly the same observations and reach opposite conclusions. Baldur Bjarnason’s piece on the two worlds of programming frames this as a genuine split between different developer populations living in different technical realities, which is correct. The question worth pushing further is: what creates those different realities, and why are they distributed the way they are?

The answer is structural. The divide maps almost exactly onto training data coverage.

How LLMs develop their coding intuitions

Large language models learn to write code by absorbing enormous quantities of existing code, documentation, Q&A threads, and blog posts. The training corpus is not a uniform sample of all software ever written. It is heavily weighted toward whatever appears in public repositories, developer forums, and documentation sites.

That means Python, JavaScript, TypeScript, and Go are massively over-represented relative to Ada or Fortran. Popular open-source frameworks with extensive tutorial ecosystems appear thousands of times per concept. Problem types that generate Stack Overflow questions are more thoroughly covered than problem types solved in closed office environments and never discussed publicly. Clean, pedagogical examples dominate over the messy, constraint-laden code of real production systems.

The result is uneven competence across the range of real programming work. The model has a strong prior about what correct code looks like in domains well-covered by training data, and a much weaker prior in domains that are sparse or absent. This reflects what was learned from, not a failure of the underlying architecture. The model is good at outputting the center of its training distribution. Work that sits at that center benefits enormously. Work at the periphery benefits much less, and often receives plausible-sounding wrong answers, which can be worse than no answer.

Why the same observations diverge

Consider the most common shared claim: “LLMs produce code quickly but it needs review.” The experience of that review differs radically depending on where your work sits in the training distribution.

A developer building a SaaS product with FastAPI, SQLAlchemy, and a React frontend will find the review fast. The generated code is usually close to correct. Errors are minor, obvious, and caught by the type checker or the test suite. The overhead of review is low relative to the time saved on generation. When I am scaffolding a new Discord bot cog or writing boilerplate for a command handler, the model produces something I would have written myself with roughly a fifth of the friction.

A developer writing firmware for a proprietary sensor interface, or maintaining a legacy financial system with an internal domain-specific language, or doing low-level cryptography with specific side-channel constraints, will experience that same review as the bulk of the job. The generated code compiles and looks plausible. It fails in ways that require deep domain knowledge to catch, sometimes in ways that require hardware-in-the-loop testing or careful timing analysis to surface. The model’s confidence is high. Its accuracy in the domain is not.

Both developers are accurately reporting their experience. The divergence follows from where their work sits relative to the training distribution’s center of mass.

The same pattern holds for API hallucination. For Django’s ORM, the React hooks API, or discord.py’s event system, hallucinations are rare and caught immediately by the interpreter. For a proprietary internal SDK with thin documentation, or a specialized embedded library not well-represented in public repositories, hallucination is constant. The same model, the same general behavior, producing wildly different outcomes depending on domain coverage.

What the benchmarks actually measure

The standard coding benchmarks bake in this bias. HumanEval, the benchmark that first put LLM coding capability on the map, consists of 164 algorithmic problems of the type common in competitive programming and coding interviews: precisely the problem type that floods both training data and fine-tuning datasets. Top models now score above 90% pass@1 on HumanEval. That number has very limited predictive value for performance on an issue in a medical imaging pipeline or an optimization pass in a compiler backend.

SWE-bench, which attempts to evaluate on real GitHub issues, is a genuine improvement but still heavily weighted toward popular Python projects: Django, sympy, scikit-learn, matplotlib. Projects with years of public history, extensive documentation, and strong representation in training data. Performance on SWE-bench Verified reached around 50% for leading models in 2024. Performance on equivalent tasks in unpublicized internal codebases would almost certainly be substantially lower, though this is difficult to measure without access to non-public code.

The benchmark ecosystem reflects what is easy to measure and what is well-represented in training data. Those two things correlate strongly, which means benchmark performance systematically overstates the technology’s utility in the long tail of real software work.

The METR complication

A 2024 study by METR adds a layer the simple distribution framing does not fully account for. They measured experienced developers working on tasks within their own domain and found they were about 19% slower with AI assistance than without, despite predicting they would be faster.

This points to a friction cost that is separate from output quality. Context management, prompt iteration, evaluating generated output, course-correcting when the model pursues the wrong approach: all of these take time. For developers who have a clear mental model of what they need to build and the skills to build it directly, this friction can dominate the time savings on routine generation.

The productive sweet spot appears to be work that is well-covered in training data, so outputs are mostly correct and review is fast, combined with work that is somewhat outside the developer’s immediate fluency, so scaffolding value outweighs management overhead. That describes a senior developer working in an unfamiliar language or framework, or a mid-level developer working in a well-documented stack. It does not describe an expert developer doing deep work in their primary domain.

This also explains a pattern that puzzles people when they first encounter it: junior developers in popular web stacks often report moderate productivity gains, while senior developers in the same stack sometimes report frustration. The senior developer has a precise intention and a detailed mental model; the model’s output introduces noise relative to that intention. The junior developer has a vaguer goal and less capacity to generate the scaffold unaided; the model’s output provides genuine lift.

What better models change, and what they do not

Model capability continues to improve. Context windows have grown large enough to ingest substantial portions of a codebase. Retrieval-augmented generation can in principle compensate for training distribution gaps by providing domain-specific documentation at inference time. Fine-tuning on proprietary codebases can shift the model’s prior toward a specific organization’s patterns. These are real improvements.

The underlying constraint is more durable. The model’s intuitions about what correct code looks like are anchored to what it was trained on. Longer context and retrieval help in specific cases, but they require deliberate engineering investment. The experience of typing a vague prompt and getting working code that a React developer describes does not automatically transfer to domains with sparse training coverage, regardless of context window size.

For teams working primarily in well-represented domains with public-facing codebases, the gains are real and reproducible. For teams doing specialized systems work, domain-specific scientific computing, or heavily proprietary development, the investment required to close the training distribution gap is substantial and often not made visible in productivity reporting.

The developers who report the most consistent gains are not necessarily in the easiest domains. They are in domains well-covered by training data, and they have developed workflows that lean into what the model does well. That is a genuine and transferable skill. It is not uniformly available across all programming work, and treating individual reports of large gains as representative of the technology’s potential everywhere is what keeps the debate unresolved.

Was this interesting?