What Agentic Coding Looks Like When the Codebase Fights Back

Most agentic coding showcases run against TypeScript monorepos or Python microservices. The feedback loop is fast, the context is manageable, and the build step takes three seconds. ClickHouse is none of those things.

The ClickHouse blog post on agentic coding documents what their engineering team found when they started running AI agents against their actual codebase in a serious way. The codebase is several million lines of C++, built around a vectorized columnar query engine, multiple storage backends in the MergeTree family, a custom SQL dialect, and enough domain-specific knowledge that onboarding a new human engineer takes months before they can contribute meaningfully to the core engine. The post is worth reading on its own terms, but it raises questions that the article itself only partially answers.

The Compilation Wall

The most underappreciated constraint in applying agentic coding to systems C++ is the feedback loop. An AI agent operating on a Python or JavaScript project can verify a change in seconds: run the linter, run the test, get a result, iterate. In ClickHouse, a full build on a well-specced machine takes 30 to 60 minutes. Even with ccache or sccache and incremental compilation, a non-trivial change to a core header can ripple into a multi-minute rebuild before you know whether the change compiles at all.

This is not a model problem. It is an environment problem. The agent doing the reasoning might be capable enough; the bottleneck is that the verification step of its observe-plan-act-verify loop is slow by an order of magnitude compared to the environments where current agentic tools were benchmarked and tuned.

The implication is that agent scaffolding for C++ projects needs to invest heavily in pre-verification: static analysis, type checking via clang-tidy, and targeted unit tests that exercise only the subsystem being modified rather than the full integration suite. A well-structured CLAUDE.md or equivalent project context file needs to tell the agent which subsystems are independently testable and how to invoke a fast path. Without that, the agent either burns time on full builds or ships changes it hasn’t verified.

Context Windows Against a Multi-Million Line Codebase

Another constraint that scales poorly is context. The ClickHouse codebase has roughly 1,000 source files in the core src/ directory alone, with architecture spread across Storages/, Interpreters/, Processors/, and Functions/ among others. The storage layer alone, covering MergeTree variants like ReplacingMergeTree, SummingMergeTree, and AggregatingMergeTree, represents a large and heavily interdependent surface area.

No context window currently in production fits the whole thing. What this forces is a regime of deliberate context selection: the agent needs to navigate rather than absorb. Tools like ripgrep, clangd for LSP-based navigation, and structured documentation files become load-bearing infrastructure for the agent rather than developer conveniences.

The pattern that has emerged across several engineering teams working with tools like Claude Code and Cursor on large codebases is to treat context management as a first-class engineering problem. This means maintaining living documents that capture architecture decisions, subsystem boundaries, and non-obvious invariants. In a codebase like ClickHouse, those invariants are everywhere: the assumption that block sizes are multiples of 8 for SIMD alignment, the ownership semantics around IColumn interfaces, the way query pipelines compose Processors in a pull-based model. An agent working without that context will produce plausible-looking code that violates subtle contracts.

Domain Depth and the Limits of Pattern Matching

There is a category of change in a database engine where syntactic pattern matching is genuinely sufficient. Adding a new SQL function, extending an existing aggregate, wiring up a new setting through the existing configuration machinery: these are structurally repetitive tasks where the agent can look at how similar things were done and follow the pattern accurately. The ClickHouse codebase is large enough that finding prior art for almost any well-scoped task is feasible through code search.

Then there is the other category. Modifying the merge algorithm in MergeTree to handle a new edge case. Changing how primary key skipping indexes interact with the query planner. Adjusting data part lifecycle under concurrent writes. These are not tasks where pattern matching helps much, because the correct behavior depends on a causal model of the system that requires sustained reasoning about invariants, not just style imitation.

Agentic tools have gotten genuinely better at the first category. The second category remains deeply human work, and the ClickHouse team appears to have found a similar boundary. The interesting operational question is not whether AI can handle the hard cases but how quickly engineers can recognize which category a given task falls into before handing it off.

Workflow Architecture Matters More Than Model Choice

A thread running through the ClickHouse post, and through similar write-ups from Shopify, Sourcegraph, and other engineering-heavy organizations, is that productivity gains from agentic coding are largely a function of workflow architecture rather than model quality. The difference between a team getting 10% productivity improvement and 40% is usually not which foundation model they are using; it is whether they have:

A well-maintained project context document the agent can rely on
A fast verification path that does not require full builds
Clear task decomposition that gives the agent a bounded problem
A review process calibrated to the output format agents produce

For ClickHouse specifically, the review piece is interesting. C++ PR reviews for a database engine typically require reasoning about performance implications, memory safety, exception safety, and concurrency. These are not things reviewers can spot-check casually. An AI-assisted workflow that increases the volume of PRs without adjusting the review process simply shifts the bottleneck rather than removing it.

The teams that have made this work have generally moved toward a model where the agent does the mechanical scaffolding and a human reviewer focuses narrowly on the invariants the agent cannot be trusted to uphold. That requires reviewers to internalize what the agent is good at and where its failure modes concentrate, which is itself a non-trivial skill that takes time to develop.

What This Means for Systems Programming More Broadly

ClickHouse is not an outlier in systems programming. Codebases of comparable scale and complexity exist at most large infrastructure companies, in game engines, in compilers, in operating system kernels. The agentic coding story in those environments is a decade behind where it is in web development, partly because the tooling ecosystem was slower to develop and partly because the feedback loops are slower and the correctness requirements are higher.

The infrastructure investment required to make agents productive in these environments is substantial: fast build caching, well-maintained architecture documentation, LSP-based navigation tooling, and a test suite structured to allow subsystem-level verification. Most teams have not made that investment because it was not necessary before agents existed. Making it now, retroactively, while also shipping features, is the actual challenge that posts like ClickHouse’s are quietly documenting.

The honest summary is that agentic coding at this scale works, but it requires treating the agent as a new kind of junior engineer that needs the same onboarding infrastructure a human would need, just formalized into files and tooling rather than conversations and documentation wikis. The teams that build that infrastructure are going to see compounding returns. The teams that try to bolt agents onto an undocumented codebase and wonder why the output is mediocre are going to keep wondering.