The Code That Works and Nobody Understands

There is a category of technical debt that does not show up in static analysis, does not trigger linters, and will not appear in code coverage reports. It lives in the gap between code that runs correctly and code that someone on your team genuinely understands. Addy Osmani named this comprehension debt recently, and the framing is worth taking seriously.

The term “technical debt” has been stretched so far from Ward Cunningham’s original 1992 metaphor that it now covers everything from missing tests to poorly named variables. Comprehension debt is more specific: the accumulated liability of code in your codebase that your team cannot confidently reason about, modify, or debug. AI code generation has dramatically accelerated the rate at which teams accumulate it.

How Technical Debt and Comprehension Debt Diverge

Technical debt is about code quality: the shortcuts taken under time pressure that make a codebase harder to work with. It is measurable through cyclomatic complexity, test coverage, coupling metrics. Comprehension debt is a knowledge problem, not a code quality problem. A function can be perfectly structured, well-tested, and clearly named while still being a liability if nobody on your team understands why it was written that way.

This distinction matters because the two debts require different remedies. Technical debt gets paid down with refactoring, tests, and better tooling. Comprehension debt requires human time and attention: someone has to read the code, trace through its behavior, understand the edge cases, and build a working mental model. That work has no shortcut.

Robert Martin observed, famously, that developers spend roughly ten times as much time reading code as writing it. That ratio captures something real: the cost of code is mostly borne after it is written, by people trying to understand it, modify it, or debug it. Comprehension debt is a tax on that ratio, and it has been growing.

The AI Acceleration Problem

Before AI code generation, comprehension debt accumulated slowly. Someone wrote a clever piece of code, documented it poorly, and left the company. Or a developer copy-pasted from Stack Overflow without fully understanding the answer. These were localized problems, often caught during code review by someone who asked “wait, what does this do?”

AI code generation makes this worse in a specific way: it produces code at a rate that outpaces the team’s capacity to build understanding. When a developer prompts a coding assistant and accepts a 50-line function, they may have a rough sense of what it does. They understand the intent. But the intent is not the implementation, and the implementation is where the edge cases, the failure modes, and the implicit assumptions live.

I have felt this directly building Discord bots. You prompt for an event handler, the code looks right, passes your happy-path tests, and gets merged. Six months later someone reports a weird edge case in production. You open the file and nobody on the team, possibly including you, can say with confidence what the code does when a specific combination of permissions and guild state lines up in an unexpected way. The code worked until it did not, and there is no mental model to fall back on.

Here is a representative example of the pattern:

// accepted without deep review, works in tests
async function processEventBatch(events, opts = {}) {
  const { concurrency = 5, retryLimit = 3, backoffBase = 250 } = opts;
  const queue = [...events];
  const results = [];
  const active = new Set();

  while (queue.length > 0 || active.size > 0) {
    while (active.size < concurrency && queue.length > 0) {
      const event = queue.shift();
      const p = processWithRetry(event, retryLimit, backoffBase)
        .then(r => { results.push(r); active.delete(p); })
        .catch(e => { active.delete(p); throw e; });
      active.add(p);
    }
    if (active.size > 0) await Promise.race([...active]);
  }
  return results;
}

This code works. It implements a bounded-concurrency async queue with retry. But the catch handler re-throws, which means one failure aborts the entire batch and abandons every remaining event in the queue. Whether that was an intentional design decision or an AI artifact is not obvious from the code. The person who merged it may not know. The person who wrote the prompt likely did not verify this behavior, because they only ran the happy path. Six months later, a production incident reveals the answer.

Code Review as Pattern Matching

Part of what makes comprehension debt insidious is that code review does not reliably catch it. When reviewing AI-generated code, reviewers tend to pattern-match against known good structures. If the code uses familiar APIs and reasonable names, it gets approved.

Pattern matching is fast and often correct, but it is a different cognitive process from building a working model of the code across all its possible states. Comprehension debt survives review because review, especially under time pressure, optimizes for “is this probably fine” rather than “do I understand every meaningful behavior path here.”

A 2024 GitClear study found that code churn, code written and then substantially modified or deleted within two weeks, more than doubled between 2022 and 2024. That window corresponds closely with widespread adoption of AI coding assistants. Code churn is a measurable signal of code that was accepted without adequate comprehension: it passes review, ships, and then gets corrected when reality surfaces what the tests did not cover.

The Organizational Knowledge Layer

Comprehension debt also degrades team knowledge in ways that compound over time. When a team accepts AI-generated code without building understanding of it, the bus factor for that code approaches one, and that one is ambiguous, because nobody can claim deep ownership of code they did not write and did not read carefully.

This matters more in some contexts than others. In systems programming, code carries implicit contracts about memory layout, concurrency safety, and resource ownership. Rust’s type system can guarantee memory safety while the logic is still wrong: a generated function might be sound from the borrow checker’s perspective while making assumptions about execution order or resource lifetime that only become visible under production load. The compiler cannot save you from logic errors, and comprehension debt is precisely the accumulation of logic that nobody has fully examined.

Security-sensitive code carries similar risk. A 2021 Stanford study on GitHub Copilot-generated programs found that roughly 40% of the generated code contained security vulnerabilities. Many of those vulnerabilities were not obvious; recognizing them required understanding what the code assumed about its inputs and callers, the kind of understanding that pattern-matching review does not produce.

The Context Gap

There is a deeper structural problem underneath the review failure. AI coding assistants have no organizational memory. They do not know that a certain retry pattern was abandoned after a production incident, or that a particular API has an undocumented rate limit, or that the team decided two months ago to standardize on a specific error propagation style. They generate code that is locally reasonable while being disconnected from the decisions and constraints that shape the rest of the codebase.

This is a gap that experienced developers fill through accumulated context. They know why the authentication middleware is structured strangely, because they were there when it was written. They know which parts of the codebase are load-bearing and deserve extra scrutiny. AI-generated code has none of that context embedded in it, and when teams accept it without building that context deliberately, the gap between the code and the team’s understanding of the code widens.

Paying It Down

The practical counter-pressure is a habit Osmani describes as “explain before you accept”: before merging AI-generated code, require that someone on the team can explain the non-obvious parts, specifically the implementation and not just the intent. Why does this loop terminate? What happens if the network call fails mid-batch? What is the memory lifecycle of this resource?

This is slower than accepting the diff and moving on, but it is the only way to convert AI output into team knowledge. Code review templates can help by adding explicit questions: “Describe the error handling behavior” or “What happens when the input collection is empty.” These are not adversarial; they are what distinguishes comprehension from pattern matching.

Documentation helps more than people expect, but the useful kind records why the code is structured the way it is, not what it does. AI-generated code often has structural choices that are reasonable but non-obvious, and capturing the reasoning prevents the next person from spending an hour determining whether a given behavior was intentional.

Some teams have started requiring that AI-generated code be accompanied by a short written explanation from the developer who prompted it, describing what they understood the code to do and what they verified. This is not about auditing the AI; it is about ensuring that the developer’s understanding is concrete enough to write down, because if it is not, the code should not ship.

The Wider Picture

More code is being produced than teams can absorb, and the gap between code in the repository and code the team understands is widening. The productivity gains from AI coding tools are real and continuing to grow. The compounding cost is also real, and it does not show up in the same dashboards.

The skill that becomes more valuable in this environment is reading AI-generated code carefully enough to genuinely own it. That takes time and deliberate attention, and it is worth treating as a first-class part of the development process rather than an optional step that gets skipped when the sprint is running long. The alternative is a codebase that works until something breaks in a way nobody predicted, because nobody built a model of how it could break.