Robin Moffatt’s recent piece on Claude Code and data engineering carries a qualifier worth examining: “yet.” The observation is that Claude Code is already a capable accelerant for greenfield pipeline work but consistently fails at the parts of data engineering that aren’t code generation. What that piece points at, without fully unpacking it, is that data engineering has a knowledge layer that doesn’t exist in any file a language model can read.
Code generation from AI is genuinely good now. Given a clean schema, a clear task description, and a well-understood framework, Claude Code will produce working dbt models, Airflow DAGs, and Spark transforms that are hard to distinguish from human output. That capability matters for certain tasks. It does not matter for the class of problems where the bottleneck is not code production.
The Revenue Column Problem
Consider a dimension table with three columns: revenue_net, revenue_adjusted, and revenue_recognized. The names carry semantic signal but not history. revenue_adjusted exists because a finance team reclassification happened in Q3 2023 after an audit. revenue_recognized diverges from revenue_adjusted specifically for multi-year contracts under ASC 606. There is a correction factor for one regional market that was fixed upstream in January 2024 but kept in the transform for backward compatibility with a downstream report that nobody updated.
When an AI coding tool is asked to write a pipeline against this table, it will make a choice based on column name semantics. The pipeline will run without errors. The numbers will look plausible. The data will be wrong in ways that may not surface for weeks, not in testing or CI, but when a finance analyst asks why Q4 revenue doesn’t reconcile with the actuals from the closed-book report.
This is not a problem that better prompting solves. It is not a problem that a larger context window solves. The information required to make the right choice does not exist in any structured form. It lives in the heads of two people, a closed Jira ticket, and the institutional memory of why the original decision was made. No amount of schema introspection surfaces it.
Data Lineage vs. Decision Lineage
The data tooling ecosystem has invested heavily in lineage. DataHub, OpenLineage, and Apache Atlas all exist to track the flow of data from source to consumption. A mature lineage graph tells you that fct_revenue depends on stg_orders which depends on raw.orders. It tells you which columns flow into which transforms. It does not tell you why.
Take a filter in a dbt model:
WHERE
account_type NOT IN ('test', 'internal', 'partner_demo')
AND created_at >= '2021-06-01'
The lineage graph records this filter exists. It says nothing about why partner_demo was added to the exclusion list 18 months ago, whether that decision is still valid given that the partner program has since changed, or why 2021-06-01 was chosen rather than the product launch date. An AI reading this code sees two filter conditions. A data engineer who was present sees a decision with a history that may or may not still apply.
This matters because data transformations accumulate decisions the way software accumulates design choices. The difference is that software design decisions at least tend to be captured in commit messages, architecture docs, or RFC threads. Data transformation decisions get made in Slack, in spreadsheets, in conversations before a dashboard goes live. The codified form of those decisions is the filter condition or the join key, stripped of the reasoning that justified them.
The Silent Failure Mode
The most dangerous property of AI-generated data pipelines is not the error that throws an exception. When a pipeline throws exceptions, the problem is visible and actionable. The dangerous failure mode is a pipeline that runs successfully, writes data, and updates downstream systems with numbers that are plausible but wrong.
Monthly active users is a canonical example. An AI tool given a product events table can write syntactically correct SQL that computes MAU. What it cannot know: that your product defined “active” as a session of at least 30 seconds; that the form_submit event type was added in August 2024, so including it in a historical cohort creates an artificial step-change in the MAU trend; that your product counts MAU by the user’s local timezone rather than UTC, because someone did the math once and discovered the UTC approach was undercounting mobile users in Asia-Pacific by a meaningful percentage.
None of these specifications live in the schema. They exist in a product spec in Notion, in a Linear ticket, in a data dictionary maintained by someone who left the company. A generated MAU query that lacks this context passes all available validation, writes plausible numbers, and introduces a silent error that compounds over time as stakeholders make decisions against the metric.
Data quality frameworks like Great Expectations and dbt data tests give engineers a place to encode constraints: row counts should be within 5% of yesterday, the null rate for a given column should not exceed 0.3%, certain value combinations are semantically impossible. But those rules exist because a data engineer got paged at 2am and wrote them in the aftermath of an incident. Claude Code can write the Great Expectations suite if you supply it the rules. It cannot determine what the rules should be.
What “Yet” Would Actually Require
Moffatt’s qualifier is honest about the nature of the gap. It is not a capability gap in the conventional code-generation sense; it is a structural problem about where knowledge lives and in what form. Closing it would require three things that are technically feasible but organizationally difficult.
The first is persistent organizational memory. A RAG architecture over closed Jira tickets, Confluence pages, Slack threads, incident postmortems, and data review documents would give an AI system access to decision history rather than just current state. The technical architecture for this exists. The organizational problem is that this information is siloed across systems, inconsistently tagged, and frequently incomplete. The ticket that explains the 2021-06-01 cutoff might be linked from the dbt model description, or it might exist in an archived project that nobody references anymore.
The second is a complete and accurate schema lineage graph. Most organizations have partial graphs, with gaps where data crosses system boundaries, enters through undocumented manual processes, or predates the adoption of the current data catalog tooling. Even organizations with mature dbt deployments and schema registries like Confluent’s have edges in their lineage graph that nobody owns and nobody fully understands.
The third is production environment access with appropriate guardrails. Deploying a change to a running pipeline is different from writing the pipeline code. It may require reprocessing historical data, managing Kafka offset commits, handling backfill windows in Airflow, or migrating state in a streaming processor. An agent that generates correct updated pipeline code has solved what is often the smallest part of safely shipping that change in production. The sequencing, the rollback conditions, the coordination with downstream consumers: these require both system access and the organizational knowledge of which downstream systems have hard SLA dependencies on this pipeline and which are tolerant of reprocessing delays.
Where the Tooling Is Working
The most successful applications of AI in data engineering have deliberately narrow scope. dbt’s Assist feature focuses on column-level documentation generation, test suggestions for existing models, and natural language to SQL for well-specified reporting queries. The narrowness is a design choice. These are tasks where the context required for a correct answer is bounded and derivable from the existing codebase and schema.
Natural language to SQL for simple aggregations over a single table with clearly named columns and no semantic ambiguity is a task where AI tools perform well and the failure modes are visible. Generating transform logic for a new pipeline feature given a clear spec and a schema that has not accumulated years of undocumented decisions is genuinely useful. These are real productivity gains.
The ceiling is a function of the ratio between knowledge that exists in accessible form and knowledge that lives in organizational memory. Better models with larger context windows push that ceiling upward, but they do not eliminate it, because the information required to make correct data engineering decisions frequently does not exist in any form that can be ingested as context.
The “yet” in Moffatt’s title acknowledges that the gap is structural without treating it as permanent. What would close it is not a better code generator but a different relationship between AI systems and organizational knowledge, one where decision history is captured with the same rigor as data lineage. Few organizations have that infrastructure today. Building it is itself a data engineering problem, and one that requires human judgment to define what is worth capturing in the first place.