· 5 min read ·

The Training Data Gradient Underneath the LLM Productivity Debate

Source: lobsters

The debate over LLM coding productivity has generated more heat than clarity because most participants are diagnosing the wrong variable. Baldur Bjarnason’s recent post correctly identifies that developers making identical observations reach opposite conclusions, and that this reflects genuine structural differences in their working conditions. Those differences include codebase size, project phase, and domain complexity. They also include something less often discussed: the training data density of the specific technologies in use, which may be the strongest single predictor of LLM utility.

The Gradient in Practice

LLMs are trained on text from the public internet, including code from GitHub, documentation sites, Stack Overflow, and tutorials. That corpus is not uniformly distributed across technologies. React has been the dominant frontend framework for over a decade; its patterns are documented across millions of repositories and articles. The model has seen so many variations of a useEffect hook, a React Router configuration, and a Next.js API route that it generates reasonable implementations with high reliability.

Vue has substantial coverage but less. Svelte is smaller still. Solid.js, Astro, and Qwik are fragments of that. Gleam, which only reached version 1.0 in March 2024, has perhaps a few thousand public repositories as of early 2026. An LLM generating Gleam code is operating at the edge of its reliable knowledge. A developer using Gleam, or Zig, or Bun’s newest APIs will have LLM experiences that are structurally similar to someone maintaining a proprietary enterprise system, not because the codebase is large or legacy, but because the model has not seen enough examples to build reliable heuristics.

This creates a gradient. At one end: TypeScript with Next.js and Prisma, Python with FastAPI and SQLAlchemy, Go with standard library networking. LLMs are competent at these because the training data is dense and redundant; errors in one training example get averaged away across thousands of others. At the other end: your company’s internal Go service wrapping a custom gRPC schema, or a Rust application using a crate with 300 stars. The model generates plausible-looking code that is wrong in ways that reflect the absence of reliable training signal.

Why This Cuts Across the Greenfield/Legacy Divide

The most common framing of the “two worlds” problem is greenfield versus maintenance work. That framing is real. But the training data gradient operates independently and can override it.

A greenfield developer who chooses a niche or emerging stack will encounter poor LLM performance even on brand-new code. The project is small, fits in context, has no legacy invariants, and still produces unreliable output because the model lacks training data to do better. Conversely, a developer maintaining a legacy application built on Rails or Django with standard gems and packages will find relatively reliable LLM assistance, because those patterns are well-represented even if the specific application’s business logic is not.

The training data gradient also explains something the greenfield/legacy framing struggles with: why some technologies produce consistently positive LLM reports while others produce consistently negative ones, independent of project phase. TypeScript developers report high productivity across project phases because TypeScript’s patterns are heavily represented in training data. Developers working in less-popular languages or on proprietary frameworks report lower utility across the board, whether they are starting fresh or maintaining existing code.

The Private Code Problem

The extreme case is private, proprietary code, which the model has never seen and never will see. An enterprise application with internal abstractions, custom ORM layers, proprietary service APIs, and years of accumulated conventions exists entirely outside the model’s training distribution. The model approaches it the way a competent generalist approaches an unfamiliar industry: it can apply generic patterns, but it doesn’t know the specifics, and crucially it doesn’t know what it doesn’t know.

This compounds an existing problem. For well-documented frameworks, hallucinated API calls are quickly caught because developers have reference points. If an LLM invents a nonexistent method on a widely-used library, any developer familiar with that library will catch it immediately. For internal systems, hallucinated behavior may be unrecognizable as wrong to the reviewing developer, particularly when they are not the domain expert for that subsystem.

The GitClear 2024 analysis of code quality trends with AI assistance found increasing code churn correlating with AI adoption. One plausible explanation is that AI-generated code passes initial review but requires revision after the fact. This pattern would be more pronounced in proprietary codebases where the model’s training distribution offers the least support, and where reviewers are least equipped to catch confident-but-wrong output.

Why Model Improvements Won’t Fully Close This

Scaling model capabilities and training data has produced real improvements in LLM coding performance. But the training data distribution problem is structural in a way that scaling does not resolve.

Publicly available code for popular frameworks grows as the developer community grows. Private enterprise code does not become public. The proprietary systems built by large organizations, the internal tooling, the domain-specific business logic, none of this enters the training corpus. As AI coding tools become more prevalent, the delta between what the model knows well and what it does not may actually widen: popular frameworks receive ever more training data while private codebases remain invisible.

Retrieval-augmented approaches and code indexing tools partially address this. Tools like Aider build compressed representations of a specific codebase and include them in context, giving the model navigational access to code it hasn’t seen in training. This reduces hallucination about local API shapes and function signatures. It doesn’t give the model the deep pattern knowledge that comes from having been trained on millions of similar examples. Knowing a function exists is different from knowing how it behaves under the conditions that matter.

What This Means for Technology Choices

One implication that doesn’t get much discussion: framework popularity now carries a secondary value it did not have before. A team choosing between two technically comparable frameworks should, all else equal, prefer the one with more public training data, because developers will have a materially better LLM-assisted experience. This is a new factor in the technology selection calculus, and it is likely already influencing adoption patterns in ways that are hard to disentangle from other network effects.

For teams working on existing infrastructure, the implication is more immediate. Be explicit about which parts of the codebase have reliable LLM support and which do not. The proprietary ORM wrapper warrants different skepticism than standard SQL query generation. The internal auth system warrants different scrutiny than a Next.js route handler. Treating LLM output uniformly across this gradient, either with blanket trust or blanket skepticism, misses the actual pattern of where errors concentrate.

The two-worlds framing is right that developers work in genuinely different conditions. The training data gradient is one of the clearest mechanisms creating those conditions, and unlike codebase maturity or organizational complexity, it maps directly to choices teams make about their technology stack. That makes it one of the more actionable parts of the picture.

Was this interesting?