SQL-First Data Analysis Agents: Why DuckDB Changed the Equation
Source: simonwillison
The choice between generating Python (pandas or polars) and SQL when building a coding agent for data analysis is often framed as a matter of existing infrastructure or team preference. The decision has measurable consequences for output reliability, and the reasons trace back to structural differences between declarative and imperative code.
Simon Willison’s recent exploration of coding agents for data analysis covers the core architecture well: describe the data, generate code, execute in a sandbox, observe output, repeat. The choice of what kind of code to generate sits beneath that architecture, and it is underspecified in most implementations.
Declarative vs. Imperative Failure Modes
SQL specifies the shape of the result you want. The query engine decides how to produce it, validates column references at parse time, and enforces grouping rules before execution begins. If a column name does not exist, the query fails immediately with a clear error identifying the missing reference. If a GROUP BY clause omits a non-aggregated column, the engine refuses to proceed.
Pandas and polars are imperative: you specify a sequence of operations on a mutable object, and validity is checked at runtime against the current state of that object. The failure surface is wider, and the failure modes more varied.
Consider the same task in each language: compute total revenue per product category, excluding refunded orders.
In SQL:
SELECT
category,
SUM(amount) AS total_revenue
FROM orders
WHERE status != 'refunded'
GROUP BY category
ORDER BY total_revenue DESC
The query either runs or it does not. Column references are validated against the schema before execution. The logic is complete in one expression.
In pandas, a correct approach:
result = (
df[df['status'] != 'refunded']
.groupby('category')['amount']
.sum()
.sort_values(ascending=False)
.reset_index(name='total_revenue')
)
A slightly off-target LLM generation might produce:
result = df[df['status'] != 'refunded'].groupby('category').sum().sort_values('amount', ascending=False)
This produces a DataFrame with all numeric columns summed, not just amount. The result looks plausible, runs without error, and is wrong. The agent receives no signal that the computation diverged from intent. This failure mode, a correct execution that produces a wrong answer, is harder to recover from than a traceback. There is nothing to feed back to the model.
The pandas API has other traps for LLM generation. Confusion between .loc (label-based indexing) and .iloc (integer-based indexing) produces wrong rows rather than errors. Method chaining where an intermediate step returns a Series when the next method expects a DataFrame raises a ValueError with an error message rooted in NumPy internals rather than the offending operation. Index alignment errors when combining DataFrames with arithmetic operators produce empty results that look reasonable unless you know what the correct row count should be.
The Training Data Gap
SQL has been a high-signal presence in training data since the 1970s. StackOverflow, database documentation, academic papers, and tutorials have accumulated decades of well-formed examples where a natural language description maps to a specific SQL pattern. The coverage is dense and relatively consistent; a broken SQL example in a tutorial is an obvious error that tends to get corrected.
The pandas API has evolved more disruptively. DataFrame.append() was deprecated in 1.4 and removed in 2.0. The introduction of pd.NA alongside np.nan created irregular null handling that even experienced users misapply. Training data drawn from a wide range of pandas versions mixes correct modern patterns with deprecated ones. An LLM generating pandas code for a rolling average may use the pre-2.0 .rolling().mean() on a GroupBy object and encounter an index alignment error that requires a reset_index() call, which may not be obvious from the error message alone.
The BIRD benchmark for text-to-SQL systems shows leading models reaching the 70-80% range on complex queries against well-documented schemas. The equivalent systematic evaluation for pandas code generation does not exist at the same scale, but the directional evidence from practitioners is consistent: SQL generation degrades less on complex multi-step operations.
SQL error recovery also works better within an agent loop. When a SQL query fails, the error typically names the column, the offending value, or the specific syntax violation. When pandas raises ValueError: cannot reindex from a duplicate axis midway through a chained operation, the error identifies the symptom at the NumPy layer rather than the pandas operation that caused it. Feeding a structured SQL error back to an LLM produces more reliable correction behavior than feeding a pandas stack trace through C extension internals.
DuckDB Removes the Infrastructure Barrier
The historical limitation of SQL for ad-hoc data analysis was infrastructure: querying a CSV file required a database server, a schema creation step, a data import step, and a connection management layer. That overhead made Python the pragmatic default for agent-driven analysis on arbitrary files.
DuckDB removes that overhead. It is an in-process analytical database, embedded in Python like SQLite, that can query CSV, Parquet, Arrow, and JSON files directly without any import step. The schema is inferred from the file. The query runs in the same process as the agent. The result comes back as a pandas DataFrame or an Arrow table.
import duckdb
conn = duckdb.connect()
result = conn.execute("""
SELECT
category,
COUNT(*) AS order_count,
ROUND(AVG(amount), 2) AS avg_order_value,
SUM(CASE WHEN status = 'refunded' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS refund_rate
FROM read_csv_auto('orders.csv')
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY category
ORDER BY avg_order_value DESC
""").df()
The read_csv_auto function handles type inference including dates, mixed-type columns, and quoted fields with embedded delimiters. No pd.to_datetime() calls, no dtype specification, no encoding argument required in the common case. The SQL runs against the file as if it were a database table.
For an agent, this means the first analytical step can be SQL rather than a pandas load followed by schema exploration followed by typed conversion code. DuckDB’s error messages are structured at the SQL layer: a type mismatch names the column and value; a missing column names the specific reference. The feedback loop for correction is cleaner than what pandas provides.
DuckDB also supports the full range of analytical SQL. Window functions handle running averages, cumulative sums, percentile rankings, and lag/lead calculations. UNNEST handles nested array data. LIST aggregations produce array outputs directly from SQL. The ASOF join handles time-series alignment without Python loops. Operations that traditionally required complex pandas apply() calls or explicit Python iteration have direct SQL equivalents.
When Python Is Necessary
SQL’s advantages are bounded to set-based operations. Several categories of data analysis genuinely require Python.
Machine learning is the clearest case. Feature engineering with scikit-learn transformers, model training, cross-validation, and prediction all require Python. An agent doing predictive analysis should use SQL for data extraction and feature selection, then hand off to Python for the model work. The data flow is natural: a DuckDB query produces the training set as a DataFrame, which passes to sklearn.
Text processing beyond SQL string functions requires Python. Tokenization, embedding generation, named entity recognition, and regex extraction with group references are all Python territory. An agent analyzing text columns for semantic content will need Python for the analysis steps, though SQL handles the filtering and aggregation around those steps efficiently.
Visualization requires Python. Matplotlib, seaborn, and plotly operate on Python data structures. Any analysis requiring charts will have a Python step that receives the results of SQL queries and renders them. This is a clean separation: SQL for data wrangling, Python for presentation.
Some iterative algorithms with per-row control flow are cleaner in Python. If a computation requires updating state across rows in a way that does not reduce to a window function, Python loops or Numba JIT compilation may be the practical path. These cases are less common than they appear in typical analytical work, because SQL window and aggregation functions cover more ground than practitioners often expect.
Routing Between the Two
The cleanest agent architecture for data analysis gives the model two explicitly distinct tools: one for SQL execution and one for Python execution, with clear descriptions of when to use each.
tools = [
{
"name": "sql_query",
"description": (
"Execute a DuckDB SQL query against available data files. "
"Use for filtering, aggregation, joins, window functions, "
"and any set-based data transformation. Returns results as JSON. "
f"Available tables: {schema_description}"
),
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
}
},
{
"name": "python_exec",
"description": (
"Execute Python in a sandboxed environment. "
"Use for visualization (matplotlib, seaborn), machine learning (sklearn), "
"text processing, and operations requiring row-level control flow. "
"pandas and numpy are available."
),
"input_schema": {
"type": "object",
"properties": {"code": {"type": "string"}},
"required": ["code"]
}
}
]
When the model has two tools with explicit mandates, it routes more reliably than when it has a single Python tool and must decide whether to write SQL through a database connection or pure pandas code. The explicit separation also produces more readable audit trails: SQL steps are clearly data access and preparation; Python steps are clearly analysis or presentation.
This aligns with the transparency principle that Simon Willison has built into his tooling from Datasette to the llm CLI: the generated code communicates its intent to the human reviewer, and the reviewer can verify it. SQL for data access and Python for analysis is a structure that data practitioners recognize immediately, which makes review easier than a monolithic pandas pipeline.
Practical Considerations
A few details change the calculus for specific setups.
If your data lives in a relational database rather than files, SQL is already the natural interface, and DuckDB adds little. The agent writes queries against your existing schema, and the schema quality argument (annotated descriptions, documented foreign key relationships) applies regardless of which SQL engine is running.
If your team primarily knows Python, the generated SQL in a hybrid system may be harder for them to review and modify. The auditing advantage of SQL assumes reviewers can read SQL fluently.
Polars has also made progress on this problem from the Python side. Its SQLContext interface allows SQL queries against Polars LazyFrames, combining Polars’ efficient lazy execution with SQL’s cleaner expression syntax. For Python-first workflows that want SQL syntax without adding DuckDB, this is worth evaluating.
DuckDB’s Python library adds roughly 20MB to a dependency footprint and has no non-Python dependencies, which makes it straightforward to include in sandboxed environments. E2B sandboxes include it by default; a self-hosted Docker sandbox needs only pip install duckdb in the image build.
The recommendation is straightforward: if you are building a coding agent for data analysis on file-based or SQL-accessible data, make DuckDB available and instruct the agent to default to SQL for data wrangling. Add Python execution for visualization, ML, and the cases where SQL runs out. Provide an annotated schema that includes table descriptions, key column notes, and business logic the model needs to generate semantically correct queries. The model will generate better code, recover from errors more cleanly, and produce an audit trail that is easier for practitioners to review.