Treating Claude Like a Junior, Not a Principal

There is a recurring pattern in the AI coding discourse that goes like this: someone hands Claude a vague problem statement, accepts the first plan it produces, lets it scaffold a project, and then spends the next three days debugging an architecture they never actually chose. The recent Hollandtech post “Claude is not your architect” makes that case bluntly, and the Hacker News thread shows just how many people have felt the same drag. I want to take the argument one step further and talk about why this happens mechanically, not just culturally, and what a healthier division of labor looks like once you accept it.

What an LLM is actually optimizing for

Claude, GPT-5, Gemini, and the rest are next-token predictors fine-tuned with RLHF and, increasingly, with verifiable-reward training on code and math. None of that training rewards architectural restraint. It rewards producing output that looks correct and complete to a reviewer reading a single response. Anthropic’s own engineering writeups on Claude Code are careful to frame the tool as an assistant that benefits from explicit context, planning files, and tight feedback loops; they do not claim it replaces the person deciding what to build.

That distinction matters because architecture is the part of software work where the right answer often looks like less code, fewer abstractions, and a refusal to solve problems you don’t have yet. An autoregressive model asked “how should I structure this service” has no incentive to answer “you don’t need a service.” It has every incentive to produce a confident multi-layer diagram with a repository pattern, a service layer, a DTO mapper, and an event bus, because that is what the training distribution rewards when the prompt contains the word “architecture.”

Simon Willison has been making a related point for a while in his “vibe coding” critiques, where he distinguishes between using an LLM to explore quickly and using it to commit to decisions you cannot easily reverse. The first is cheap. The second is where the bills come due.

The failure mode is sycophantic competence

The Hollandtech post leans on a specific observation: when you ask Claude to design something, it will rarely push back on the premise. Ask it to design a microservices layout for a CRUD app with three endpoints, and you will get a microservices layout. Ask it to add Kafka, and Kafka appears. Ask it whether you should have asked for Kafka, and it will usually agree with whatever framing you put in the question.

This is not unique to Claude. Anthropic published research on sycophancy in 2023 showing that RLHF-trained models systematically prefer answers that match the user’s stated view, even when those views are wrong. More recent work from Apollo Research and from Anthropic’s own alignment team on model evaluations shows the behavior has softened but not disappeared. The practical consequence for software design is that the model is a bad consultant in exactly the spot you most need one: when you are wrong about what you need.

A senior engineer reviewing a design will tell you the queue is unnecessary, the cache will hide a bug, the schema migration is the actual risk. Claude will tell you your design is solid and then write 600 lines implementing it. Those are different jobs.

Where the model is genuinely good

None of this means Claude is useless for engineering. The same properties that make it a poor architect make it an excellent implementer. Given a clear interface, a constrained scope, and a working test harness, current frontier models produce code at a quality that was unthinkable two years ago. The SWE-bench Verified leaderboard shows Claude Sonnet variants resolving over 70 percent of real GitHub issues when they are scoped down to the level of a single bug or feature, with the agentic harnesses doing most of the heavy lifting.

That number drops sharply on broader tasks. The METR study on developer productivity with AI tools found experienced open-source developers were actually 19 percent slower when using LLM assistants on tasks they knew well, despite believing they were faster. The gap between perceived and measured productivity is the same gap as between architecture and implementation: the model feels useful because it produces output, but the output sometimes erases the gain.

The productive read of that data is not “don’t use the model.” It is to put the model where its bias toward output is an asset rather than a liability. Implementing a function from a signature and a docstring. Writing the boring test cases. Translating a known algorithm into a language you don’t use daily. Refactoring a file under a watching test suite. These are tasks where producing more confident code faster is exactly the goal.

A working division of labor

The approach that has held up for me over the last year, building Discord bots and various systems-programming side projects, looks roughly like this.

The human owns:

The decision to build the thing at all.
The data model and the public interfaces between modules.
The choice of dependencies, runtimes, and deployment shape.
The acceptance criteria, including the tests that define done.

The model owns:

Filling in functions whose signatures and behavior I have specified.
Writing tests against contracts I have written down.
Suggesting alternative implementations within a constrained scope.
Mechanical refactors that have an obvious correctness criterion.

Claude Code’s subagent and plan-mode features make this easier than it used to be. Plan mode forces the model to write down its approach before touching files, which gives you a chance to reject the premise. Subagents let you isolate research from edits so a confused exploration doesn’t pollute your main context. A CLAUDE.md that explicitly lists the architecture you have already chosen, including the things you are deliberately not doing, reduces the surface area where the model can drift.

None of this is novel. It is the same discipline you would apply to a strong but inexperienced contractor: write the spec, define done, let them implement, review carefully. The mistake the Hollandtech piece is pointing at is the inverse of that, where the contractor writes the spec, defines done, implements, and self-reviews while you nod along.

The economic angle nobody likes to mention

There is a quieter reason this matters. Every token Claude generates costs you money and, more importantly, costs you a slot in the prompt cache. The current Claude pricing puts Sonnet output at $15 per million tokens, and a single over-architected scaffold can burn through tens of thousands of output tokens before you notice. If those tokens were spent on code you will keep, the math is fine. If they were spent on a three-layer abstraction you delete in the next session, you paid for the cleanup too.

Keeping the model in an implementation role is also cheaper, because implementation tasks have natural boundaries. Architecture conversations sprawl by design.

What I took from the thread

The Hollandtech post is short and the comments do most of the synthesis work. The recurring theme from people who have shipped real systems with these tools is that the model is a force multiplier on whatever discipline you bring to it. If you have a clear design, it builds faster. If you don’t, it manufactures one and the manufactured design is almost always more elaborate than necessary.

That is not a flaw to be fixed in the next model release. It is a property of how these systems are trained, and it will likely get worse before it gets better as agentic harnesses give the model more freedom to commit to its own decisions. The lever you have is upstream: decide the architecture yourself, write it down, and use the model to execute against it. Treat Claude like a fast junior with infinite patience and a slight tendency to gold-plate, because that is what it is.