The Knowledge That Never Makes It Into the Repository

Robin Moffatt published a piece this week titled Claude Code isn’t going to replace data engineers (yet) that’s worth reading carefully. His argument is grounded and practical, as you’d expect from someone who spent years at Confluent and Lenses.io actually operating data infrastructure at scale. The conclusion is right. But I think the structural reason behind it deserves more unpacking.

The short version: the code in a data pipeline repository is not where the hard knowledge lives. It never was. The hard part of data engineering is a layer of institutional knowledge, business context, and metadata that isn’t in any file, and that’s precisely what language models cannot reach.

What the Repository Actually Contains

When you look at a mature data engineering codebase, you see dbt models, Apache Airflow DAGs, configuration files, maybe some Great Expectations suites, maybe some OpenLineage instrumentation wired into your orchestrator. A capable LLM can read all of that. It can reason about it. It can generate new models that are syntactically valid and structurally consistent with what’s already there.

Here’s a concrete example. Suppose your dbt project has a model called fct_revenue. An LLM can look at that model, understand the joins, follow the ref() dependencies back through stg_orders and stg_payments, and generate a new model fct_revenue_by_region that compiles cleanly and runs without error. The SQL is correct. The lineage graph looks sensible.

-- fct_revenue_by_region.sql (generated)
with revenue as (
    select * from {{ ref('fct_revenue') }}
),
regions as (
    select * from {{ ref('dim_regions') }}
)
select
    r.region_name,
    sum(rev.gross_revenue) as total_revenue
from revenue rev
join regions r on rev.region_id = r.region_id
group by 1

This compiles. It runs. It produces numbers. And it might be completely wrong.

The Semantic Layer That Isn’t Written Down

What the LLM doesn’t know, because it can’t know, is that gross_revenue in fct_revenue has been under active dispute for six months. The finance team and the growth team define it differently. There’s a Slack thread about it. There was a post-mortem after a board presentation where the numbers didn’t match. The current model uses the finance team’s definition, but only since a migration that happened in October, and the historical data before that cutover uses a different calculation.

None of that is in the repository. Some of it might be in a comment, if you’re lucky. Most of it lives in the heads of two or three people.

This isn’t an edge case. It’s the normal state of production data infrastructure at any organization that has been operating for more than a year. The code is the least ambiguous artifact in the system. The business rules, the exception handling, the tribal knowledge about which upstream sources are reliable and which ones silently drop records on the last day of the month: none of that has a canonical location.

Schema Evolution Is a Solved Problem Until It Isn’t

Apache Iceberg has good schema evolution semantics. You can add columns, rename columns, reorder columns without breaking downstream readers. The tooling handles it. An LLM can tell you this.

What the LLM cannot tell you is that the customer_tier column you’re about to rename was added by a contractor two years ago and is being read directly by a legacy reporting system that bypasses the catalog entirely, a system that three people know about and only one of them still works here. Renaming it in Iceberg will succeed. The downstream break will show up two weeks later in a report that someone’s VP looks at on Monday mornings.

The metadata layer that would capture this dependency doesn’t exist in most organizations. Marquez and OpenLineage are excellent projects that are trying to build it. But adoption is incomplete, the data is always somewhat stale, and the lineage graph only captures what the instrumented systems know about. The legacy reporting system isn’t instrumented.

Why SWE-bench Doesn’t Transfer

SWE-bench is a reasonable benchmark for software engineering tasks on open-source repositories. The task is well-defined: given a repository and a bug report, produce a patch that passes the test suite. The knowledge required to solve the problem is, by construction, largely contained in the repository. That’s what makes it a tractable benchmark.

Data engineering doesn’t have that property. The equivalent benchmark would require encoding the business context, the upstream data quality history, the organizational politics around metric definitions, and the undocumented dependencies that exist outside the formal system. You can’t write a test suite for “the CFO’s definition of ARR as it evolved between Q2 and Q4 of last year.”

This is why SWE-bench scores, however impressive, tell you almost nothing about whether an LLM can operate as a data engineer. The tasks are structurally different. One is mostly a code problem. The other is mostly a knowledge management problem that happens to have some code in it.

What Data Quality Rules Look Like in Practice

Great Expectations lets you encode data quality rules as code. dbt has its own test framework. These are genuinely useful tools. But the rules you encode are the rules someone already thought to encode.

In practice, a significant portion of data quality knowledge looks like this:

“The event_timestamp field from the mobile SDK is in UTC but the web SDK sends local time; this was fixed in SDK version 4.2 but we still have historical data from before the fix”
“Orders with status = 'pending' for more than 72 hours are almost always abandoned but we don’t mark them that way in the source system”
“The daily feed from the third-party vendor skips Sundays and backdates them to Monday; this is intentional on their end, don’t file a support ticket”

None of this is in a YAML file. Some of it is in a runbook that someone wrote once and nobody updates. Most of it is in the memory of the person who got paged when the anomaly first appeared.

An LLM reading the repository will not find it. If you ask the LLM to write a data quality check for event_timestamp, it will write a reasonable check based on what it can infer from the schema and the existing tests. It will not know about the UTC/local time issue unless someone tells it, and if someone tells it, you’ve already done the hard work.

The Actual Ceiling

I’m not arguing that AI coding tools are useless for data engineering. They’re genuinely useful for the parts of the job that are well-defined code problems: writing boilerplate dbt models, generating Airflow task scaffolding, translating a SQL query from one dialect to another, writing the skeleton of a Great Expectations suite. These are real time savings.

But the ceiling is structural, not a matter of model capability. The knowledge that makes a data engineer valuable to their organization isn’t in the repository. It’s in the documentation that doesn’t get written, the post-mortems that don’t get filed, the Slack messages that scroll off the screen, and the accumulated experience of having been burned by the same upstream source twice.

Until that knowledge has a canonical, machine-readable form, LLMs are working from an incomplete picture of the problem. That’s not a criticism of the models. It’s a description of where the actual complexity of data engineering lives, and it’s mostly not in any file.