The Part of Data Engineering That Isn't Code

Robin Moffatt, who has spent years working in the data streaming space with Kafka and ksqlDB, put Claude Code through its paces on real data engineering tasks and found the results unsurprising: competent at code generation, lost at everything else. His conclusion isn’t a hot take. It follows a pattern that keeps showing up whenever someone points a general-purpose AI coding assistant at data infrastructure.

The interesting part isn’t that Claude Code failed. It’s understanding precisely where it failed and why that failure is structural rather than a matter of model capability.

Code Is the Thin Layer

Data engineering has a code problem in the same way that cooking has a heat problem. The heat matters. You need to get it right. But knowing how to operate a stove doesn’t tell you what to make, when the dish is done, or whether the ingredients you bought are any good.

A typical data pipeline involves writing SQL transforms, orchestration logic in something like Airflow or Dagster, connector configuration for sources and sinks, and maybe some Python glue code. Claude Code can handle all of that reasonably well in isolation. Give it a schema and ask it to write a dbt model, and it will produce something that compiles. Give it an Airflow DAG pattern and ask it to add a new task, and it will do that too.

The problem is that none of those isolated tasks are the actual job. The actual job is understanding why the orders table has three different representations across your warehouse depending on which system wrote it, deciding which one is authoritative, knowing that the customer_id field was backfilled in 2023 with a different logic than it uses today, and writing transforms that remain correct when the upstream team changes their schema next quarter without telling anyone.

None of that is in the code. It lives in Slack threads, data catalogs that are perpetually out of date, a wiki page someone wrote in 2022, and the memory of whoever was on-call when the pipeline broke last August.

What SWE-bench Actually Measures

A lot of the discourse around AI replacing software engineers rests on benchmarks like SWE-bench, which presents AI models with real GitHub issues and measures whether the model can produce a patch that makes the tests pass. Claude’s performance on SWE-bench Verified has improved significantly, and Anthropic has been transparent about methodology.

But SWE-bench is measuring something narrower than it appears. Each task is self-contained: the repository is there, the issue is described, the tests define what “correct” means. The model doesn’t need to understand why the feature was built the way it was, what other systems depend on the behavior being changed, or whether fixing the test actually fixes the user’s problem.

Data engineering work rarely has that shape. The closest analogue would be a SWE-bench task where the repository contains no tests, the issue is described in business terms by someone who doesn’t know SQL, half the relevant context is in a different repository that you need to read first, and the definition of “correct” changes after you submit your patch because the stakeholder misunderstood their own requirements.

Benchmarks that measure code synthesis are good at measuring code synthesis. They don’t transfer cleanly to work that is mostly about understanding systems and data.

The Environment Specificity Problem

Data engineering is also exceptionally environment-specific in ways that pure software development is not. A Python function that sorts a list behaves the same way everywhere. A Spark job’s behavior depends on cluster configuration, memory settings, the specific version of the connector being used, the partitioning of the source data, and the quirks of the particular Hive metastore or Delta Lake version in your environment.

Claude Code can read your files and understand your codebase, but it cannot observe what your data actually looks like at runtime. It cannot see that the Kafka topic you’re consuming has 90% of its messages in one partition because of a bad partitioning key choice from two years ago. It cannot see that the “small” lookup table you’re joining against has actually grown to 800 million rows and is now causing memory pressure in production that doesn’t appear in development.

Experienced data engineers develop intuition about these runtime characteristics. They know where to look when something is slow, what symptoms indicate which class of problem, and which optimizations are worth making given their specific infrastructure. That intuition comes from years of reading query plans, watching metrics dashboards, and debugging failures that happened for unexpected reasons.

A model that has never observed a slow query plan cannot reason about why yours is slow. It can suggest generic optimizations, and sometimes those suggestions will be right, but it’s operating without the information that actually matters.

Schema Evolution and Silent Data Corruption

One of the most dangerous failure modes in data engineering is silent incorrectness. A pipeline can run without errors while producing wrong numbers. The job succeeds, the tables get written, the dashboards update, and nobody notices that the revenue figures are off by 15% because an upstream team added a new order status that your transform wasn’t handling.

Dealing with schema evolution requires understanding what changes are backward-compatible, what downstream consumers exist and what they depend on, and what the business semantics of a schema change actually are. Tools like Apache Avro and Confluent Schema Registry formalize some of this, and frameworks like dbt let you encode expectations in code. But the judgment calls about what constitutes a breaking change, and whether a change is intentional or a bug, still require human understanding of context.

Claude Code can generate a schema migration script. It cannot tell you whether the migration is safe to run on your production data given the specific state of that data today. It doesn’t know that 3% of your records have a value in the field you’re about to mark as NOT NULL, because it hasn’t looked at your database.

Where AI Assistance Actually Helps

None of this means AI tools are useless for data engineering. They’re genuinely helpful in specific ways.

Boilerplate reduction is real. Writing the tenth connector configuration of the day is tedious, and Claude Code is good at generating that structure correctly. Documentation and explaining what a complex SQL query does is another genuine strength, since data engineers often inherit queries with no comments and need to understand them quickly.

Translation between SQL dialects is a practical use case that comes up often during migrations. Moving from Hive to Spark SQL, or from Oracle to BigQuery, involves syntactic differences that are mechanical and well-defined. A model can handle most of that translation competently.

Writing test cases for data transforms is another area where AI assistance adds value. Given a transform and some example input data, generating edge cases to test against is a task that maps well onto what these models can do.

What doesn’t work is treating Claude Code as an autonomous data engineer that can take a vague requirement and produce a correct, production-ready pipeline. The gap between “generates plausible SQL” and “builds reliable data infrastructure” is the entire profession.

The Underlying Shift

There’s a version of this conversation that’s worth taking seriously: not whether AI replaces data engineers now, but whether the trajectory matters. Models that can synthesize code competently are useful tools that change the time allocation of skilled engineers. Work that used to take two hours of writing boilerplate takes twenty minutes. That’s real, and it compounds over time.

What doesn’t compress is the diagnostic work, the stakeholder translation, the judgment calls about data quality and correctness, and the institutional knowledge about why your specific data looks the way it does. Those parts of the job are harder to benchmark and harder to automate because they require grounding in the particulars of an organization’s data that no general-purpose model has access to.

Moffatt’s experiment is useful precisely because it’s concrete. Showing where a tool fails is more informative than abstract arguments about capability. The lesson isn’t that AI is overhyped in general, but that the specific failure modes of these tools in data engineering contexts are predictable and follow from the nature of the work rather than from any particular model’s limitations. Better models will generate better code. The parts of the job that aren’t code will still be there.