When the Code Works and the Data Is Wrong

Robin Moffatt put Claude Code through real data engineering work and published an honest accounting of where it landed. The summary is unsurprising to anyone who has done this work: the tool handles code generation well and loses the thread when the work shifts from syntax to semantics. The gap he describes is real, but I want to go deeper into the specific mechanisms of that failure, because they follow a pattern that reveals something important about what data engineering actually is.

The central problem is not that AI cannot write SQL. It can write SQL. The problem is that SQL is the last 10% of the work, the part where you translate a decision that was already made into a form the database can execute. The other 90% is arriving at the decision: which definition of the thing you are measuring is authoritative, what assumptions about data shape are safe to encode in code, which failure modes should be loud and which should be handled quietly, and what happens to every downstream consumer when you change any of this. None of that is in the file system.

The Ambiguity Is Load-Bearing

Consider a column called created_at. This seems unambiguous. It does not mean the same thing in every system.

In a transactional database, created_at is typically the timestamp of record insertion, which usually approximates the time the event occurred. In a CDC-replicated copy of that database, created_at is still the original insertion timestamp, but there is now a separate metadata field for when the record arrived in the replica. In a Kafka topic that was consumed from that CDC stream and written to a data lake, the messages may carry three timestamps: the original event time, the Kafka producer timestamp, and the time the message was written to object storage. In a dbt model that joins these sources, someone made a choice about which timestamp to surface as created_at, and that choice is either documented or it is not.

When an analyst queries the model and sees created_at, they will assume it means what it looks like it means. When an AI coding assistant is asked to write a pipeline against that model, it will make the same assumption. Both may be wrong, and the error will not surface immediately. It will surface when the analyst runs a cohort analysis and the numbers do not match the transactional system, or when the pipeline produces a correct row count with a wrong temporal distribution.

This category of semantic ambiguity is not limited to timestamps. Status columns encode state machines whose transitions are often underdocumented. Amount fields may or may not include taxes, may or may not reflect returns, and may be denominated in different currencies across different records depending on when the row was written. Column names are identifiers, not specifications, and the distance between the identifier and the full specification is where real data engineering lives.

Schema Compatibility as Policy

Schema evolution is a forcing function for organizational decisions that nobody wants to have explicitly. When an upstream team adds a field to a Kafka message, or changes a field type, or removes a field, they are making a statement about their priorities relative to their consumers. Whether that statement is backward-compatible, forward-compatible, or neither is not just a technical question; it is a policy question about how the team is allowed to change its contract.

Avro and Protobuf encode this distinction mechanically. In Avro, adding a field with a default value is backward-compatible; removing a field is not. In Protobuf, field numbers are permanent identifiers; reusing a number after removing a field corrupts any consumer that still holds messages using the old schema. Confluent Schema Registry enforces compatibility rules at schema registration time, which means someone had to configure those rules and understand what they imply.

The compatibility setting in Schema Registry is where business policy gets encoded as infrastructure configuration. A setting of BACKWARD means consumers can be updated before producers, which implies that the organization expects consumers to be more stable than producers. FORWARD means producers can be updated before consumers, which implies that producers move faster. FULL means both, which implies either that the team is disciplined about maintaining both or that they have not thought carefully about what they are committing to.

An AI coding assistant that writes a schema change cannot reason about these implications. It can generate a valid Avro schema. It cannot know that this particular producer is consumed by eleven downstream systems across four teams, that one of those teams releases quarterly and cannot be expected to deploy a consumer update quickly, and that the existing compatibility setting was chosen specifically to accommodate that constraint. Changing the schema without understanding these relationships produces code that passes validation and breaks production.

Silent Failure Is a Design Choice

Software engineers are trained to make systems fail loudly: raise an exception, return an error code, alert someone. Data systems have a failure mode that is structurally different and significantly more dangerous: they can produce wrong output without producing any error at all.

The most common version of this is a filter that silently drops rows that should have been included. A transform written when a status column had four possible values will handle all four correctly. When a fifth value is added upstream without notification, the transform will silently exclude all rows with that value. Row counts may or may not deviate enough to trigger an alert, depending on how common the new value is. The dashboard numbers will be slightly wrong. Nobody will know until someone reconciles them manually.

Great Expectations and dbt tests exist precisely to close this gap: you encode your assumptions about data shape as executable checks that run as part of the pipeline. A dbt test asserting that status is accepted_values(['pending', 'confirmed', 'shipped', 'delivered']) will fail loudly when that fifth value appears. But writing those tests requires knowing what the valid values are, why they are valid, and what should happen when they are violated. The test suite encodes knowledge. The knowledge has to come from somewhere.

For a new pipeline, the right set of tests is often not knowable in advance. It emerges from production operation over weeks or months as edge cases surface. The first time a pipeline processes data from a new source region, it might encounter timezone handling that was absent in test data. The first time it processes a full month, it might encounter month-end adjustment rows with a different structure than normal rows. These are the incidents that produce the institutional knowledge that eventually makes a mature pipeline trustworthy.

AI cannot generate that test suite from the schema alone. It can generate plausible-looking assertions, and some of them will be correct, and the ones that are wrong will either be too strict (causing false alarms) or too permissive (missing real failures). The distinction requires understanding the actual data distribution, the upstream system behavior, and the business semantics of what the data represents.

The Cost of Wrong Queries

Data engineering mistakes have a financial dimension that pure software engineering rarely does. A poorly written SQL query in BigQuery or Snowflake does not just run slower; it runs more expensively. BigQuery prices on data scanned, not on rows returned. A query that could have pruned to 2GB with a proper partition filter but instead scans 200GB because the filter was applied after the scan costs roughly 100 times more than it should. At scale, this is not a performance metric; it is a budget line item.

This matters for AI-assisted data engineering because a model generating SQL will produce syntactically correct queries that may be semantically expensive. It does not know your partition scheme. It does not know that event_date is the clustering key in Snowflake and that filtering on timestamp instead bypasses micro-partition pruning entirely. It does not know that joining two large tables without first aggregating one of them will cause a shuffle at intermediate scale that produces terabytes of data in Spark before the downstream filter reduces it.

These optimizations are not academic. They are the difference between a pipeline that costs $50/day and one that costs $5,000/day. They require understanding the physical layout of the data, the execution engine behavior, and the specific data distribution, all of which live in the production environment and not in any file.

Data Contracts and Where They Actually Come From

The term “data contract” has gotten traction as a way of formalizing the interface between data producers and consumers. Tools like dbt contracts, introduced in dbt Core 1.5, and schema registry configurations give teams a place to define and enforce these interfaces. The concept is sound.

But a data contract is not a technical artifact. It is an organizational artifact that happens to have a technical encoding. The questions a data contract answers, what fields are guaranteed to be present, what their types and cardinalities are, what SLAs apply to freshness, which consumers are authorized to depend on this data and under what terms, are questions that require organizational authority to resolve. Someone with the standing to make commitments on behalf of the producing team has to agree to them. Someone speaking for the consuming teams has to agree to what they are depending on.

Generating a data contract from schema files is straightforward. Generating the organizational agreement that gives the contract teeth is not a technical problem at all. An AI can scaffold the YAML. It cannot hold the meeting, identify the stakeholders, negotiate the SLA, or make the tradeoff when the producing team wants to change something that would break a consuming team’s pipeline.

Data lineage tooling like OpenLineage and Marquez makes the dependency graph visible, which is a precondition for contracts being enforceable. If you cannot see who depends on a dataset, you cannot notify them when it changes. But visibility does not create the governance structure; it just makes the absence of one more apparent.

SLAs, Backfills, and the Temporal Complexity That Never Shows Up in Code

Data pipelines have a time dimension that application software generally does not. A web service processes requests as they arrive. A data pipeline processes events that may have occurred hours or days before they were ingested, due to late-arriving data from mobile clients, batch uploads from partners, or delayed processing in upstream systems.

Handling late-arriving data in Flink requires configuring an allowed lateness window, after which late events are either dropped or routed to a side output. The right value for that window is not a technical decision; it is a business decision about how much lateness is acceptable versus how much memory pressure the watermark mechanism can sustain. Too small a window and you miss legitimate late events. Too large and you hold state indefinitely for events that are probably never coming.

Backfill scenarios compound this. When a pipeline has been processing data incorrectly for three weeks due to a bug, the fix is not just deploying corrected code; it is reprocessing three weeks of history, managing the fact that some of that history has already been used in downstream aggregations loaded into a reporting database, deciding whether to restate those aggregations or annotate them, and handling the period during which new data and old data coexist. Airflow’s backfill command runs the DAG for historical execution dates, but it has no knowledge of downstream dependencies that also need to be invalidated.

This is coordination work, not code work. The right sequence of steps depends on the specific downstream systems, the volume of data involved, whether the downstream team has capacity to handle a re-delivery, and whether the SLA for the data requires a formal restatement notice to consumers. None of this is in the codebase.

What This Means in Practice

Moffatt’s conclusion is that AI coding tools are accelerants for engineers who already know what they are doing and liabilities for engineers who lean on the output to fill gaps in their own understanding. That is exactly right, and the mechanisms described here explain why the liability profile is so specific to data engineering.

Application code that is wrong usually fails visibly: the test fails, the HTTP request returns an error, the service crashes. Data code that is wrong usually fails invisibly: the pipeline succeeds, the row counts look reasonable, and the wrong numbers accumulate in tables that feed dashboards that inform decisions. The consequences of the invisible failure are often more severe, because they compound quietly before anyone notices.

The semantic layer that sits between business intent and technical implementation in data engineering is not a gap that better code generation closes. It is the substance of the job. An engineer who understands what created_at means in each system, which compatibility policy is appropriate for each schema, when silence is worse than failure, and what a backfill operation actually requires to be done safely is doing data engineering. An engineer who generates code that processes created_at as event time when it is actually ingestion time, and deploys it confidently because it compiled and the tests passed, has used a tool to produce a confident mistake.

The tool is good. The judgment has to come from somewhere else.