· 8 min read ·

The Slop Problem Has a Measurement Problem Inside It

Source: lobsters

The word “slop” entered the software vocabulary somewhere around 2024, borrowed from the discourse around AI-generated text and images. It describes code that is technically present, syntactically valid, and occasionally functional, but hollow in a specific way: verbose without being clear, structured without being designed, consistent only in the way a template is consistent. The question of whether we can measure it is not just academic. If you cannot measure it, you cannot track it, and if you cannot track it, you cannot make a principled argument about whether it is increasing in your codebase.

An experiment at pscanf.com recently tried to do exactly this: operationalize the concept of software slop and see whether a metric could be built around it. The results are instructive, and the difficulties the experiment runs into illuminate something important about the nature of the problem.

What Slop Actually Looks Like in Code

Before measuring something, you need to describe it precisely enough that two people looking at the same file would agree on whether it is present. Slop in prose is easier to identify intuitively: it is text that fills space without moving thought forward. In code, the analogous pattern has several recognizable forms.

Over-commenting is the most visible. AI models, trained partly on tutorials and documentation, tend to annotate obvious operations:

# Increment the counter by 1
counter += 1

# Return the result
return result

This kind of comment does not communicate anything a competent reader could not derive from the code itself. It increases file size while reducing signal density.

Defensive handling for impossible cases is another marker. LLM-generated code frequently guards against scenarios that the surrounding architecture makes unreachable:

function processItems(items: Item[]) {
  if (!items) {
    return [];
  }
  if (items === null) {
    return [];
  }
  if (items === undefined) {
    return [];
  }
  // actual logic
}

In TypeScript, the type annotation already rules out null and undefined. The guards are noise that a developer has to read past.

Structural self-similarity, where every function is shaped the same way regardless of what it does, is harder to quantify but equally telling. AI models produce code with a template regularity: every async function has the same try/catch shape, every class has the same lifecycle methods, every utility is decomposed to the same depth. Human developers introduce variation through context-specific judgment. Models introduce variation mainly through the differences in the prompts they receive.

The Long History of Trying to Measure Code Quality

The ambition to reduce code quality to a number is old. Maurice Halstead published his complexity metrics in 1977, counting operators and operands to derive measures of program volume, difficulty, and effort. Thomas McCabe introduced cyclomatic complexity in 1976, counting linearly independent paths through code as a proxy for how hard it is to test and understand.

These metrics became embedded in tools like SonarQube, PMD, and Code Climate, which synthesize them into maintainability scores. The Maintainability Index used by Visual Studio combines Halstead volume, cyclomatic complexity, and lines of code into a single 0-100 score.

These metrics capture real things. High cyclomatic complexity genuinely correlates with defect density. Studies going back to Basili and Selby (1987) showed the connection between branch density and fault rates, and the finding has held up across decades of replication. But these metrics were designed for a world where the primary source of complexity was human developers writing too much branching logic. They measure density of decisions, not poverty of thought.

Slop often has low cyclomatic complexity. A function that adds twenty lines of redundant null checks and obvious comments before doing something simple scores fine on cyclomatic complexity, because cyclomatic complexity does not care about comment density or redundant guards. You can write profoundly sloppy code without triggering a single existing linter warning.

What Signals Actually Correlate with AI Generation

Research into detecting AI-generated text has produced a body of signals that transfer to code with modification.

DetectGPT, from Stanford in 2023, works by checking whether small perturbations to a piece of text decrease its probability under the generating model. AI-generated text tends to sit in local probability maxima, so perturbations are more likely to lower the score than for random human text. Applied to code, this requires a code-specific language model to score perturbations, but the principle holds and has been demonstrated on Python and Java corpora.

Binoculars, published in 2024, uses the ratio of perplexity scores between two models: a scorer and an observer. AI-generated text has low perplexity under the scorer but also low perplexity under the observer, whereas human text has higher perplexity under the scorer and even higher under the observer. This ratio is more robust than raw perplexity alone, and it does not require access to the original generating model.

For code specifically, there are structural signals that do not require a language model at all. Comment-to-code ratio is measurable with any tokenizer. Identifier entropy, how much information is carried by variable and function names across a file, can be computed directly. Files where every identifier is a long descriptive noun phrase with consistent casing have lower identifier entropy than files with a natural mix of terse and verbose naming, because human developers vary naming style based on scope and importance in ways models do not.

Repetition at the structural level can be measured by computing pairwise similarity between function bodies in the same file using tree-sitter ASTs or simpler token-level n-gram overlap. Human developers writing similar functions introduce more variation than LLMs, because the model has a strong prior toward its training-data templates even when the context calls for something different.

Why Any Metric Degrades Over Time

Here is the core difficulty: the signals that currently distinguish slop from good code are artifacts of current model limitations, not permanent features of AI-generated code.

A model that learned to write tight, appropriately commented code with contextually justified defensive handling would produce output that scores well on every metric above. The best AI code and the best human code should be indistinguishable by definition, because the criteria for good code do not change based on who or what wrote it. So any reliable measurement today is measuring “what current models produce when operating below their best” rather than “what AI generation produces in principle.”

This is not a reason to abandon measurement. It is a reason to be precise about what is being measured. The slop problem is really two problems that are easy to conflate.

The first is a quality problem: code that is present but does not carry its weight. This exists in human-written codebases too, and the measurement challenge is similar to what it has always been, just with new failure modes added.

The second is a provenance problem: understanding what proportion of a codebase was generated versus authored, as a proxy for how much accumulated domain context the codebase carries. A codebase that is largely generated by models that do not understand the business domain is at structural risk in ways that are hard to quantify but real. When something breaks in an unexpected way, the generated code offers no trail of intent to follow.

These problems require different interventions. For the quality problem, existing static analysis tools can be extended with slop-specific rules: flag comment density above a threshold, flag redundant type guards that the type system already handles, flag structural homogeneity across functions in the same file. For the provenance problem, approaches like watermarking at generation time, as researched by Kirchenbauer et al. (2023) for text and now being explored for code, are more appropriate than post-hoc analysis.

The Experiment as a Calibration Tool

The value of an experiment like the pscanf one is not that it produces a deployable slop detector. It is that it calibrates intuition. When you try to operationalize a fuzzy concept, you learn which parts of the fuzziness are essential and which are incidental.

If a metric flags a lot of code you consider good, the metric is wrong. If a metric passes a lot of code you consider sloppy, either the metric is wrong or your intuitions about slop are inconsistent. Running the experiment forces you to decide.

The practical outcome from this kind of calibration work is usually a small set of targeted linting rules rather than a unified score. Comment density thresholds can be enforced in CI. Structural repetition warnings can be surfaced during code review. These interventions are coarse, but coarse and deployable beats precise and theoretical.

What the experiment probably cannot produce, and what no single experiment could produce, is a principled aggregate score for sloppiness across a whole codebase. The concept is too context-dependent. High comment density is appropriate in a library with a public API. Defensive null handling is appropriate at service boundaries. The same pattern that is slop in one file is good engineering in another.

What This Means for Codebases Under Pressure

The practical context for this work is that development teams using AI coding assistants are shipping code faster and reviewing it less carefully. In this environment, the slop question is really a leverage question: where in the development process does it matter most that code carries genuine design intent, and where is it acceptable for code to be generated from templates?

Boilerplate, test fixtures, and configuration files can tolerate higher slop levels because they are read less often and changed less often. Core domain logic, API contract code, and data migration code carry much higher risk when generated without careful review, because errors in those places compound in ways that are hard to debug and expensive to fix.

A practical measurement framework might therefore not be a single slop score but a set of targeted checks applied selectively based on where in the codebase a file sits. Projects like semgrep already support path-scoped rule enforcement, which makes it feasible to apply stricter quality gates to high-stakes directories without burdening the whole codebase with rules that make no sense for generated scaffolding.

The measurement problem is real, but the engineering problem it serves is tractable: not “is this code slop?” in the abstract, but “does this code, in this location, meet the quality bar its position in the system requires?” That is a question existing tooling can approximate, even if it cannot answer perfectly. The experiment at pscanf is a useful step toward knowing what approximate means in this context.

Was this interesting?