The Metadata Layer That Sits Between AI and Data Engineering Competence

Robin Moffatt’s hands-on account of using Claude Code for real data engineering work lands on a conclusion that’s easy to agree with and harder to explain: the tool is impressive, and it still misses. The gap is not a matter of the model being insufficiently capable at writing code. It’s a matter of what data engineering actually is, and how little of it lives in the files that an AI can read.

The SWE-bench Disconnect

When we talk about AI coding benchmarks, SWE-bench comes up constantly. The benchmark presents models with real GitHub issues from established open-source Python repositories and asks them to produce a patch that makes failing tests pass. Models like Claude 3.7 Sonnet have been scoring above 60% on the verified subset. These are genuinely hard software engineering tasks.

But look at what the benchmark structure assumes. There is a well-defined bug report. There is an existing test suite that encodes the correct behavior. There is a bounded, well-understood codebase with years of public documentation. The model just needs to find the right change.

Data engineering work almost never has this shape. When a senior data engineer joins a new company and is asked to “fix the revenue pipeline,” there is rarely a failing test that defines what correct revenue looks like. There is a Looker dashboard where a number has been wrong for two quarters, and a Slack thread from 2023 where someone noted that the orders table started double-counting refunds after a migration but “we’ll fix it properly later,” and a dbt model that has a WHERE status != 'voided' filter that no one can explain anymore. The code is legible. The context is not.

What Lives Outside the Repository

A typical dbt project gives an AI a lot to work with. The models/ directory contains SQL transformations. schema.yml files describe columns and can encode data tests like not_null and unique. The dbt_project.yml specifies materialization strategies. The compiled manifest.json and catalog.json expose the full dependency graph and column-level metadata.

This is a rich representation of structure. It tells you almost nothing about semantics.

Consider a dbt model with a filter like this:

WHERE
  account_type NOT IN ('test', 'internal', 'partner_demo')
  AND created_at >= '2021-06-01'

The schema.yml won’t tell you why partner_demo accounts were excluded in a patch commit eighteen months ago, or whether the 2021-06-01 cutoff was chosen because the data before that date was migrated from a legacy system with different semantics. An AI reading the repository sees these as constraints. A data engineer who knows the history sees them as decisions with specific reasons that may or may not still apply.

This is what you might call the decision lineage problem, as distinct from data lineage. Data lineage tools like OpenLineage and Marquez can track which tables a model reads from and writes to. They can show you that fct_revenue depends on stg_orders which depends on raw.orders. What no tool currently captures well is the lineage of decisions: why does this join use a LEFT JOIN instead of INNER, and what business rule justified the choice to use date_trunc('month', created_at) rather than the actual event timestamp.

The Semantic Correctness Problem

A data pipeline can be syntactically valid, pass all its tests, and produce results that are wrong in ways that only someone with domain knowledge would catch. This is the most dangerous failure mode because it is invisible to the tooling.

Suppose you ask Claude Code to write a dbt model that calculates monthly active users from a product events table. Given a reasonable schema, it will produce working SQL, probably something like:

SELECT
  DATE_TRUNC('month', event_timestamp) AS month,
  COUNT(DISTINCT user_id) AS monthly_active_users
FROM {{ ref('stg_events') }}
WHERE event_type IN ('page_view', 'click', 'form_submit')
GROUP BY 1

This query runs. The numbers come out. But maybe your product counts MAU by calendar month in the user’s local timezone, not UTC. Maybe your definition of “active” was formalized in a product spec to mean at least one session with a minimum duration of 30 seconds. Maybe the form_submit event was added to the tracking plan in August 2024 and including it for historical months creates an artificial spike in your MAU chart that will confuse your board presentation.

None of these constraints are derivable from the table schema. They exist in a product spec in Notion, in a historical decision logged in a Linear ticket, in a data dictionary maintained by someone who left the company. Claude Code cannot read those sources unless you paste them in, and even then it can only act on what you thought to include.

Schema Evolution and the Blast Radius Problem

Where AI tools become genuinely risky in data engineering is schema evolution. When an upstream source system changes a column type, renames a field, or silently starts producing NULLs for records that previously had values, the impact radiates outward through every downstream model in the lineage graph.

A data engineer handling this change needs to understand: which downstream models actually depend on this column versus just selecting *; what the business impact is of the change (is this a breaking change or an additive one); whether backfilling historical data is feasible and what the state management looks like in the orchestrator; and whether any BI tools or reverse-ETL pipelines consume the affected models directly.

Orchestration tools like Apache Airflow and Dagster encode the task graph and retry logic in code. But the acceptable blast radius of a given change, the SLAs that govern which pipelines are critical-path, the alerting thresholds that distinguish a tolerable delay from a production incident, these are organizational agreements that live in runbooks, on-call handbooks, and institutional memory.

An AI can read a DAG definition. It cannot tell you that the nightly_revenue_load task at 2am UTC is load-bearing for a finance close process that has a hard deadline, while the ml_feature_backfill task running in parallel can safely be paused for a week without consequence.

What AI Handles Well

None of this means AI tools are useless in a data engineering context. The places where they genuinely accelerate work are the ones where the specification is complete and the validation criteria are clear.

Generating boilerplate dbt source definitions from a database schema dump is mechanical work that Claude Code handles well. Scaffolding a new Airflow operator that follows the pattern of existing ones in the project is exactly the kind of imitation task that large language models do reliably. Writing the SQL for a well-specified aggregation, given a detailed prose description and the relevant table schemas, produces useful first drafts. Catching common SQL anti-patterns or suggesting more efficient query structures are areas where the model’s breadth of exposure to code genuinely helps.

dbt Labs has been integrating AI features into their Cloud product, and the most successful applications are in this bounded territory: column-level documentation generation, test suggestion for existing models, and natural language to SQL for straightforward reporting queries. The scope is deliberately narrow.

The Gap That Remains

The limitation is not that AI models lack intelligence. It is that intelligence applied to incomplete context produces confidently wrong answers, which in data engineering is often worse than no answer at all. A pipeline that fails loudly is fixable. A pipeline that silently produces plausible-looking wrong numbers can corrupt months of business decisions before anyone notices.

Data engineers carry a mental model of the system that includes the code, the data, the organizational agreements around the data, the history of decisions, the failure modes they have personally debugged, and the humans who own each piece of the stack. That composite understanding is what makes a data engineer dangerous in a good way. The code they write is an expression of that understanding, not the understanding itself.

Until AI tools have a way to ingest and reason over the full context surrounding a data system, not just its repository, the tooling will be a capable assistant for well-defined subtasks and a liability for the judgment calls that actually define the profession. The “yet” in Moffatt’s title is doing real work. The path to closing the gap probably runs through better tooling for capturing organizational knowledge in machine-readable form, not through making the models larger.