· 6 min read ·

Between SQL and pandas: Why DuckDB Has Become the Data Layer for Coding Agents

Source: simonwillison

The question of whether to use SQL or Python for data analysis has been debated for a long time. Coding agents collapse that distinction in a useful way, and a tool called DuckDB is largely responsible.

Simon Willison’s recent piece on coding agents for data analysis covers the practical shape of these systems: the code-execute-observe loop, the sandboxing options, the failure modes. I want to zoom in on one piece that gets less attention in these discussions: the choice of query engine, and why it matters more for agent-based workflows than it does for human analysts.

Why the Query Engine Choice Is Different for Agents

When a human analyst switches between SQL and pandas, the cost is mostly cognitive. They understand the data, they pick the right tool for the task, and they move on. An agent does not have this luxury. The agent’s understanding of the data is mediated entirely by what it can discover through code execution: df.head(), df.dtypes, DESCRIBE table, SHOW TABLES.

The richer and more structured the introspection surface, the more accurately the agent can model what it is working with. SQL databases have a standardized schema representation. The model can issue:

SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_name = 'orders'

and get back a clean, structured description of the data. A CSV loaded into a pandas DataFrame offers df.dtypes and df.head(), which is usable but less rich, and does not naturally expose relationships between tables.

This is not an argument against pandas, but it is a reason to think carefully about what you hand to an agent at the start of a session.

DuckDB Specifically

DuckDB is an in-process SQL database that runs embedded in Python without a server. You import it like any other library:

import duckdb

conn = duckdb.connect()
result = conn.execute(
    "SELECT * FROM read_csv('sales.csv') LIMIT 5"
).fetchdf()

The read_csv call there is not loading the file into memory. DuckDB streams from the file and applies the query predicate as a filter. For an agent working with a large CSV, this means the first exploration query runs quickly without waiting for a full file load.

Several properties make DuckDB well-suited for coding agent workflows.

Direct file querying without loading. The agent can write SELECT COUNT(*) FROM 'data/*.parquet' and DuckDB scans all matching Parquet files. This fits the iterative explore-then-narrow pattern that effective data analysis follows.

Structured schema introspection. DESCRIBE SELECT * FROM 'file.csv' returns column names and inferred types in a clean tabular form. An agent can issue this as its first step and get a reliable schema to reason from, rather than parsing free-form df.info() output.

SQL coverage of awkward operations. DuckDB’s dialect includes PIVOT, UNPIVOT, list aggregates, and UNNEST for working with nested data. These operations are clumsy in pandas and natural in SQL. Agents tend to generate syntactically correct code more reliably for operations that map cleanly to SQL semantics.

Compatibility with the columnar ecosystem. The Apache Arrow and Parquet formats have become how real datasets are stored and shared. DuckDB queries these natively; an agent working with partitioned Parquet files does not need a separate loading step.

The SQL vs. pandas Tradeoff for Agents

pandas has genuine advantages in specific workflows. Row-wise transformations, custom aggregation functions, tight integration with scikit-learn, complex string cleaning pipelines, and feature engineering that depends on per-row logic: these are pandas territory, and generating correct pandas code for them is something capable models do reasonably well.

But a large fraction of the questions that get asked of data analysis agents are aggregation and filtering questions: “what is the distribution of X?”, “which groups have the highest Y?”, “how does Z change over time?” For these, SQL is at least as expressive as pandas and often produces more readable, auditable code. An agent that generates:

SELECT
    region,
    product_category,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY region, product_category
ORDER BY total_revenue DESC

is producing something the user can read and verify more easily than an equivalent pandas method chain, and the SQL is more likely to be correct on the first attempt because the semantics map directly to what was asked.

Datasette and Simon Willison’s llm CLI tool take this approach seriously. Datasette treats SQLite databases as a publication format and a query surface; the llm tool has integrations for generating SQL against local SQLite files. The philosophy is that a database is a better interface for an agent than raw files: the schema is explicit, the query language is well-understood by language models, and results are deterministic.

Schema Quality Is the Bottleneck

Whatever query engine you use, the agent’s effectiveness is limited by how well it understands the schema. A table called ev with columns c1 through c47 will produce far worse agent output than a table called customer_events with columns named event_type, user_id, session_start, and revenue_usd. The work of making data agent-friendly is often table naming, column naming, and documentation, not the agent configuration itself.

DuckDB makes it easy to annotate schemas using column comments:

COMMENT ON COLUMN orders.status IS
    'One of: pending, shipped, delivered, returned. Nulls indicate pre-2022 records.';

When the agent issues:

SELECT column_name, comment
FROM duckdb_columns()
WHERE table_name = 'orders'

it gets these annotations alongside type information. This is roughly the equivalent of docstrings for data, and it has a similar impact on output quality: the model has a richer description of the semantics of each field without needing to infer from examples alone.

Practical Setup

For anyone building a coding agent for data analysis, a minimal effective stack looks like this:

  1. DuckDB for the query layer, with data in Parquet or loaded from CSV using CREATE TABLE AS SELECT * FROM read_csv(...)
  2. Column comments and, optionally, a _docs table containing natural-language descriptions of what each table contains and any domain-specific caveats
  3. An initial agent step that issues SHOW TABLES and DESCRIBE on each table to ground the session before the first analysis query
  4. E2B for sandbox isolation if the data is sensitive, or a local DuckDB connection for development workflows

The agent also benefits from an explicit note in the system prompt about the available files and query surface:

You have access to a DuckDB connection. The following files are available:
- orders.parquet (sales orders, 2020 to present)
- customers.csv (customer master data)
- returns.parquet (return events)

Query files directly with SQL:
  SELECT * FROM 'orders.parquet' LIMIT 5

Start by running SHOW TABLES and DESCRIBE on tables of interest.

This anchors the agent’s first move and avoids the common early failure mode of an agent that tries to load large files into pandas when a SQL query would be faster and produce a more structured result.

Where This Leaves the SQL vs. Python Question

The SQL vs. pandas debate was always somewhat artificial, because skilled analysts use both and the right choice depends on the task. In the agent context, DuckDB resolves the tension from a different direction: it gives you SQL semantics, in-process execution, Python interoperability, and direct file querying in a single library. An agent can use SQL for aggregation and filtering, switch to Python for visualization or custom transformations using conn.execute(...).fetchdf() to bridge the two, and never change the underlying data layer.

For agent workflows in particular, the structured introspection surface that SQL databases provide is worth more than the flexibility advantage pandas holds in complex transformations. Most analysis questions, especially the exploratory ones that people reach for a coding agent to answer, are answerable with SQL. The ones that require pandas can use it as a downstream step from a DuckDB query.

The query engine is an infrastructure decision that most tutorials on coding agents gloss over in favor of the more visible pieces: the model, the tool loop, the sandbox. It turns out to have a larger effect on reliability than any of those, because it determines how well the agent can understand what it is working with before it writes a single line of analysis.

Was this interesting?