The Code Is Not the Hard Part: Why AI Has a Structural Ceiling in Data Engineering
Source: lobsters
Robin Moffatt recently published a firsthand account of using Claude Code on real data engineering work, and the conclusion lands where most honest practitioners end up: the tool is genuinely impressive for certain tasks and genuinely blind to others. The “yet” in his title is doing real work there. But I think the structural gap between AI coding tools and data engineering practice is worth examining more precisely, because the ceiling is not about code quality.
What Claude Code Actually Gets Right
Code generation for data work has a clear ceiling on the upside and a clear floor on the downside, and Claude Code tends to operate near the ceiling. Given a well-specified transformation, it will write clean PySpark, dbt SQL, or Pandas code faster than most engineers. Given a schema and a target schema, it will produce a reasonable mapping. Given a log format, it will write a parser. These are real wins.
The tasks where LLMs shine in data work share a common shape: the problem is fully specified within the prompt. If you can describe the input, the output, and the transformation rules completely, the model can close the loop without needing anything outside that context window. This is a meaningful chunk of the boilerplate that data engineers spend time on, particularly earlier in their careers.
For greenfield pipelines built on well-documented data sources, with clean schemas and clear business requirements handed to you in a ticket, Claude Code is a legitimate accelerant. You write less of the transformation scaffolding yourself. That is a real productivity gain.
The Institutional Knowledge Problem
Here is where the ceiling appears. Most data engineering work is not greenfield. It is maintenance, extension, and debugging of systems built over years, on top of data sources that evolved organically, with business logic encoded in comments (if you are lucky), tribal knowledge (if you are not), and occasional post-incident runbooks buried in Confluence.
Consider something like a revenue table that has three different columns named revenue_net, revenue_adjusted, and revenue_recognized. A new engineer looking at this table needs to know that revenue_net excludes a specific category of refunds that the finance team decided to reclassify in Q3 2023 after an audit, that revenue_adjusted includes a correction factor applied to one regional market that was fixed upstream in January 2024 but the correction factor column was kept for backwards compatibility, and that revenue_recognized follows ASC 606 and diverges from revenue_net for multi-year contracts. None of this is in the schema. None of it is in the code. It lives in the heads of two people and a Jira ticket that was closed two years ago.
Claude Code cannot access any of that. When you ask it to write a pipeline that uses revenue, it will pick a column based on its name semantics, and it will be wrong in a way that is not immediately obvious. The pipeline will run. The numbers will look plausible. The data will be wrong.
This is not a solvable problem through better prompting or larger context windows. The information does not exist in any form that can be ingested as context. It requires organizational relationships, access to closed tickets, familiarity with past incidents, and sometimes direct conversations with stakeholders.
Schema Evolution and State
Data pipelines have state in a way that most software does not. A web API can be stateless; a data pipeline is consuming and transforming a stream of historical reality. Schema changes in upstream systems propagate through pipelines in ways that can be subtle and delayed. A column that switched from VARCHAR(50) to VARCHAR(255) six months ago might only cause problems today if a new consumer tries to index it in a downstream system that still has the old constraint.
Handling schema evolution in production requires understanding the full lineage of a field across every system it touches. Tools like Apache Atlas, DataHub, and OpenLineage exist specifically because this problem is hard enough that companies build dedicated infrastructure to track it. Even with those tools, the lineage is incomplete and requires human interpretation.
Asking Claude Code to modify a pipeline that participates in a complex lineage graph is asking it to make changes in a system it cannot see. It will write correct-looking code that handles the immediate transformation. It will not know that this field is read by three other pipelines maintained by different teams, that one of them has a hard-coded type cast that will silently truncate the new values, or that the downstream reporting system caches this data and will need to be invalidated.
Data Quality Rules Are Business Logic
Data quality validation in production pipelines encodes business rules that are often not written down anywhere. Row counts should be within 5% of yesterday. The ratio of null values in a certain column should never exceed 0.3%. Certain combinations of values are semantically impossible even if they are syntactically valid. These rules exist because someone got paged at 2am and wrote them in the aftermath.
Frameworks like Great Expectations and dbt tests give data engineers a place to encode these rules, but the rules themselves come from domain knowledge and incident history. Claude Code can write the Great Expectations suite if you tell it the rules. It cannot tell you what the rules should be. For a new pipeline, getting the data quality rules right often takes months of production operation before the edge cases reveal themselves.
The Production Deployment Gap
Even in domains where AI code generation is mature, production deployment is a separate discipline. For data pipelines, this gap is wider. Deploying a change to a running pipeline can mean reprocessing historical data, managing offset commits in Kafka, handling backfill windows in Airflow, or migrating state in Flink. The operational complexity of these tasks is not primarily about writing code; it is about sequencing changes correctly, understanding the failure modes of partial migrations, and having rollback plans that account for data already written downstream.
A model that generates a correct updated pipeline definition has done maybe 20% of the work required to safely ship that change in a production environment. The rest is operational knowledge that is specific to the organization’s infrastructure, the particular version of each tool in the stack, and the current state of data in the system.
What “Yet” Actually Requires
Moffatt’s hedging with “yet” is reasonable, but it is worth being specific about what closing the gap would require. Better code generation is not the answer; the code generation is already good. What would actually shift this equation:
Persistent organizational memory. A model that had read every closed Jira ticket, every Confluence page, every Slack thread, every incident postmortem for a given company would have access to much of the institutional knowledge that matters. This is technically possible as a RAG architecture but organizationally difficult: data is siloed, poorly tagged, and spread across systems. Building the corpus is as much of an engineering challenge as the model itself.
Schema graph awareness. If a model could ingest a complete, accurate lineage graph of every field in an organization’s data warehouse, it could reason about the downstream impact of changes. This requires the lineage tooling to be both comprehensive and accurate, which is a hard prerequisite. Most organizations do not have this.
Production environment access with guardrails. Generating a migration plan is different from executing one. An agent that could inspect the actual state of running pipelines, read monitoring dashboards, and propose sequenced changes with rollback conditions would be substantially more useful than a code generator. This is agentic infrastructure work, not a model capability question.
None of these are impossible, but they require infrastructure investment that sits outside the model itself. Claude Code is a code-generation tool working against a specification. Data engineering, at its core, is the work of producing that specification from ambiguous, incomplete, and sometimes contradictory sources. The model cannot replace the work upstream of itself.
The practical takeaway is that Claude Code is most useful to data engineers who already know what they want to build and can express it clearly. For experienced engineers, that is a genuine productivity multiplier on a specific class of tasks. For less experienced engineers working in unfamiliar domains, the danger is that the confident, well-formatted output will obscure how much of the important context was not in the prompt.