Putting a Number on Software Slop

The idea behind this experiment on pscanf.com is straightforward to state: can we actually measure the degradation in code quality that developers have started calling “software slop”? The harder it turns out to be to answer that question, the more interesting the question becomes.

Software slop, as the term has settled into use, refers to code that passes superficial review but carries the hallmarks of low-effort AI generation: verbose boilerplate, generic identifiers, comments that restate what the code already says, error handling pasted in without thought. It is technically functional in many cases, but it lacks the qualities that make code maintainable over years.

The intuition is easy; the measurement is not.

What Makes Code “Sloppy”?

Before measuring anything, you have to define it, and this is where most attempts at this kind of analysis run into trouble. Sloppiness in code is partially aesthetic and partially structural, and the two categories do not map cleanly onto any single metric.

Structural sloppiness has some measurable proxies. Cyclomatic complexity captures one dimension: code with unnecessary branching, redundant conditionals, or edge cases handled downstream rather than at the right abstraction boundary tends to score high. AI-generated code has a known tendency toward defensive over-structure, adding null checks and try/catch blocks to code paths where the type system or calling convention already guarantees safety. You end up with something like:

def get_user_name(user):
    try:
        if user is not None:
            if hasattr(user, 'name'):
                if user.name is not None:
                    return user.name
        return None
    except Exception as e:
        return None

Where a developer who understood the contract would write:

def get_user_name(user):
    return user.name if user else None

The first passes tests. The second communicates intent. Cyclomatic complexity gives both roughly the same score, which tells you something about the limits of that particular signal.

Comment density is another candidate. AI-generated code trends toward verbose inline commentary that adds no information: # Initialize the counter, # Loop through the items, # Return the result. You can detect unusually high comment-to-code ratios without too much difficulty. The problem is that this metric punishes thoughtful documentation in genuinely complex code and rewards terseness that might itself be a readability failure. It penalizes the wrong things depending on context.

The Perplexity Approach

The most technically interesting approach is perplexity-based detection, borrowed from AI-generated text identification. Language models generate text and code by selecting statistically probable completions, producing sequences that are “expected” in a way that human-written text often is not. Tools like GPTZero apply this principle to natural language by measuring how surprised a reference model is by each token; researchers have applied similar ideas to code, using large code-trained models to score how “likely” a given snippet is. Low perplexity correlates with AI generation; high burstiness, meaning wide variance in per-token perplexity across a document, correlates with human writing.

This approach works poorly at small granularity. A single utility function may have low perplexity regardless of its origin, because there are limited correct ways to implement it and both humans and language models converge on similar solutions. At file or module scale, the signal strengthens. At repository scale, you might have something actionable.

The challenge is that the approach requires a reference model, and the reference model embeds assumptions about what “normal” code looks like. A codebase that uses unusual domain-specific patterns will score as sloppy by this measure when it is actually just specialized. A model fine-tuned on payment processing code would behave very differently from a general-purpose code model, but that means the detector’s calibration depends on choices that are often invisible to whoever runs it.

Identifier Entropy

One more signal worth examining: identifier vocabulary. Developers writing domain-specific code tend to use names drawn from the problem domain. A codebase about payment processing will have identifiers like ledger_entry, settlement_batch, chargeback_reason. AI-generated code in the same domain gravitates toward generic identifiers: data, result, item, process, value.

You can measure this as the entropy of the identifier vocabulary relative to the codebase’s domain language, or as the fraction of identifiers drawn from a generic stoplist. The signal is imperfect because junior developers and people writing throwaway scripts use generic names for entirely human reasons. But as part of a composite signal across a large body of code, it carries some discriminative value.

Stylometric authorship attribution has a long history in computational linguistics, and some of that work has been applied to code. Individual programmers tend to have identifiable stylistic fingerprints: preference for certain loop structures, consistent use of early returns versus nested conditionals, characteristic alignment patterns. AI-generated code flattens these fingerprints toward a statistically average style. Detecting the absence of a fingerprint is harder than detecting its presence, but it is not impossible at sufficient scale.

The Ground Truth Problem

All of these approaches share a common bottleneck: labeled data. You need code you know was AI-generated and code you know was human-written, in sufficient quantity to validate any signal against.

This is genuinely hard to construct. Developers routinely edit AI-generated code before committing, sometimes heavily. The result is a spectrum of human-AI collaboration rather than a binary classification. A function generated by Copilot and then refactored by an experienced engineer may carry almost no detectable trace of its origin, while a function typed by a junior developer under deadline pressure might score badly on every slop metric through no AI involvement whatsoever.

The core conceptual problem is that “AI-generated” and “low quality” are not the same category, and a measurement tool designed to detect one will generate false positives against the other. Open source maintainers who want to filter AI-generated contributions are generally trying to filter low-quality contributions; the correlation is real but loose. A slop detector that primarily penalizes rushed human work while passing polished AI output would be worse than useless as a quality gate.

Research on Copilot’s effect on GitHub contributions has shown mixed results: some analyses find modest productivity improvement with flat or slightly degraded review quality, others find increases in bug density in domains where the model’s training data is thin. None of this is clean enough to serve as ground truth for a general-purpose measurement tool, and the pace of model improvement means any dataset you build today becomes stale within months.

What an Experiment Can Establish

An experiment in this space cannot fully solve detection, but it can do useful scoping work. It can establish which candidate signals have any discriminative power at all. It can identify the granularity at which a signal becomes meaningful. It can surface false positive rates and help determine whether a detector would be practical in a real review workflow.

A detector with a high false positive rate is counterproductive as an automated gate; it would flag legitimate contributions and train maintainers to dismiss its output entirely. A high-precision detector that surfaces candidates for human review, even at lower recall, has a plausible use case. The deployment framing matters as much as the metric itself, perhaps more: a tool that helps a maintainer prioritize review attention is a different product from a tool that blocks PRs automatically, and the acceptable error rates are completely different between those two use cases.

There is also a subtler issue about incentive effects. If a slop detector becomes widely adopted, the implicit optimization pressure shifts. AI coding tools will eventually be fine-tuned or prompted in ways that evade the signals a detector relies on. Perplexity-based detectors are already susceptible to this: a model instructed to “write in an unconventional style” or to “vary its patterns” can shift its output distribution toward higher perplexity without improving the actual quality of the code. Detection becomes an arms race, and arms races in this domain tend to favor the generation side.

The experiment on pscanf.com is worth following regardless of where its specific findings land, because formalizing the question has value independent of answering it completely. Software slop is observable enough as a phenomenon that people keep coining terms for it and open source maintainers keep complaining about it in issue threads. Whether it is measurable well enough to act on systematically is what experiments like this one are positioned to answer, incrementally and imperfectly.