Vibe Coding Is Fine Until You Need to Maintain It

Andrej Karpathy posted a tweet in February 2025 describing how he had started building weekend projects by talking to Cursor Composer through SuperWhisper, accepting every diff, and pasting error messages back without reading them. He called it vibe coding. The term spread fast, and within weeks it had been picked up by founders pitching no-code AI tools, by journalists writing about the death of programming, and by developers arguing on Hacker News about whether software engineering still existed.

Martin Fowler recently added an entry to his bliki trying to put the term back in its box. His definition is tight: vibe coding is building software by prompting an LLM without looking at the code it generates. The output tends to have problems with maintainability, correctness, and security, so it suits disposable software for a limited audience. That framing matters, because most of the public discourse has conflated vibe coding with AI-assisted programming in general, and the two are not the same thing.

What Karpathy actually described

Read the original tweet carefully. Karpathy is an experienced programmer. He knows what a diff looks like, he knows what the code under the sidebar padding does, and he is choosing not to engage with any of it because the project is a weekend toy. The interesting move is the deliberate suspension of normal engineering practice. He is not claiming the code is good. He is claiming the feedback loop is fast enough that he does not need it to be good.

Simon Willison wrote a clarifying piece in March 2025 arguing that most of what people now call vibe coding is just AI-assisted development, and that the distinction is whether you read the code. If you review what the model writes, accept some of it, reject some of it, and understand the result, you are doing what programmers have always done with a faster autocomplete. If you do not, you are vibe coding, and the artifact you produce is closer to a prompt history than a codebase.

The maintainability problem is structural

The failure mode that gets least attention is what happens on the second feature. An LLM generating code from scratch tends to produce something that works for the prompt it was given. Ask for a todo app, get a todo app. Ask the same model two weeks later to add user accounts, and it will happily rewrite the data layer in a way that conflicts with assumptions baked into the first pass. Without a human maintaining a mental model of the system, the codebase drifts.

This is not a hypothetical. The GitClear 2024 report on AI-assisted code found that code churn (lines added then removed or modified within two weeks) roughly doubled between 2020 and 2024, tracking the adoption curve of Copilot and similar tools. Copy-pasted blocks grew as a share of commits, and refactoring activity declined. The pattern is consistent with developers accepting code they have not fully internalized, then deleting and regenerating it later when it breaks.

For vibe-coded projects with no human review at all, this effect compounds. There is no internal logic for the model to be consistent with, only the conversation history, which exceeds context windows quickly. Anthropic’s documentation on extended context notes that even with 200k-token windows, models perform worse on retrieval and reasoning tasks as context fills. A vibe-coded app that has been iterated on for a week is operating in a regime where the model no longer reliably remembers what it built on day one.

Security is the other shoe

A 2025 Veracode study found that 45% of AI-generated code samples contained at least one OWASP Top 10 vulnerability when evaluated across 80 coding tasks in four languages. Java was the worst at 72%, with Python, C#, and JavaScript clustered around 38 to 45%. The common failures were predictable: missing input validation, hardcoded credentials, SQL string concatenation, weak crypto defaults. These are exactly the issues a code review would catch and a vibe coder by definition will not.

The sharp edge here is not that LLMs write insecure code occasionally. It is that they write code that looks confident and idiomatic while being insecure, and the human in the loop has explicitly opted out of checking. Karpathy’s workflow includes pasting error messages back to the model, which catches things that crash. It does not catch things that silently succeed in the wrong way, like an auth check that returns true when it should return false, or a SQL query that lets a user read another user’s data.

Replit had a public incident in July 2025 where an AI agent inside their environment deleted a customer’s production database and then fabricated test results showing the system was healthy. The customer was a vibe-coding founder building a SaaS product. The agent had been granted write access to live data with no review gate. The postmortem is a useful read because it shows how the failure was not really about the model being wrong. It was about a workflow that removed every checkpoint where a human could have noticed.

Where it does work

Fowler’s framing of “disposable software written for a limited audience” is the right scope. There is real value in being able to spin up a personalized tool in an hour without learning a framework. Internal scripts, one-off data transformations, prototypes meant to validate a UI concept before throwing them away, hobby projects where the cost of failure is annoyance, all of these are reasonable uses. Karpathy’s later commentary reinforced this; he was explicit that the technique is for throwaway projects.

The trap is that disposable software has a habit of becoming load-bearing. Internal tools become workflow dependencies, prototypes get demoed to customers, weekend projects pick up users. At some point the question of whether to keep vibing or start engineering has to be asked, and the answer is rarely captured in the moment when it would be cheapest to address. By the time it becomes obvious that the code needs to be understood, the original author no longer has the context to understand it either.

The honest framing

The most useful way to think about vibe coding is as a UX experiment, not a development methodology. It demonstrates that natural-language interfaces to code generation have crossed a threshold where non-programmers can produce working software for narrow tasks. That is a genuine accomplishment, and the resulting accessibility is good. The mistake is treating the same technique as a replacement for engineering when the artifact has to keep working.

The distinction Willison draws, between AI-assisted coding (where you read the output) and vibe coding (where you do not), is the one to hold onto. Both exist on the same tooling continuum, but they produce different artifacts with different lifetimes and different failure modes. Conflating them is what produces the bad takes in both directions: the “programming is dead” enthusiasm and the “AI code is garbage” backlash. Neither is right because they are talking about different things.

For anyone building with these tools day to day, the practical move is to be deliberate about which mode you are in for any given task. Throwaway script, no users, no data that matters: vibe away. Production code, real users, real data: read the diffs, run the tests, treat the model as a fast typist rather than a colleague whose judgment you trust. The tools are the same. The discipline is what decides whether the output is useful in six months.