Getting Production Query Plans from an Empty Database

A few weeks back, Simon Willison published a note on a technique for reproducing production query plans in a local database that holds no production data at all. The technique is simple once you understand why it works, but the underlying mechanism is worth exploring carefully because it clarifies a class of performance debugging problem that trips people up repeatedly.

The short version: a database query planner does not make decisions based on your data. It makes decisions based on statistics about your data. Those are separate artifacts, stored separately, with completely different privacy and logistics implications. Most development workflows collapse the distinction, which is why EXPLAIN output in development frequently fails to predict what happens in production.

How the Query Planner Actually Works

SQLite’s Next Generation Query Planner, introduced in 3.8.0, is cost-based. When evaluating a query against multiple candidate execution strategies, it estimates the number of rows each strategy would need to examine and picks the cheapest option. The cost model depends almost entirely on per-table and per-index statistics stored in a set of ordinary tables: sqlite_stat1, and optionally sqlite_stat3 and sqlite_stat4.

Without those tables, the planner defaults to hard-coded heuristics. An equality predicate on an indexed column is assumed to match roughly one-tenth of the rows. A range predicate is assumed to match one-third. These are not unreasonable guesses for a tiny database, but they break down quickly at scale. A production table with 50 million rows and a status column where 80% of rows hold the value 'active' looks nothing like those heuristics suggest, and the planner will make entirely different decisions once it can see the actual distribution.

What sqlite_stat1 Contains

The sqlite_stat1 table has been part of SQLite since version 3.6.18. Its schema is:

CREATE TABLE sqlite_stat1(tbl, idx, stat);

Each row describes one index on one table. The stat column is a space-separated string of integers. The first integer is the total row count for the table. Each subsequent integer represents the average number of rows sharing the same value for the leftmost N columns of that index. Consider:

tbl     | idx                    | stat
orders  | idx_orders_user_id     | 10000000 50
orders  | idx_orders_status      | 10000000 1666666
orders  | idx_orders_created_at  | 10000000 1

From this, the planner derives selectivity estimates. Filtering by user_id returns ~50 rows on average (50 / 10,000,000 selectivity). Filtering by status returns ~1.67 million rows on average, so a full table scan may be cheaper than an index scan for common status values. Filtering by created_at matches roughly one row per distinct timestamp, making the index highly selective.

None of this information reveals anything about individual orders, customer identities, or business transactions. It is purely structural: row counts and cardinality estimates. That distinction matters considerably when production data cannot legally or practically be copied to development environments.

The sqlite_stat4 Extension

When SQLite is compiled with SQLITE_ENABLE_STAT4, a fourth statistics table is available that stores histogram samples. Rather than a single average selectivity per index column, sqlite_stat4 records actual sampled index key values at regular intervals, along with their frequencies. This allows the planner to reason about skewed distributions.

A column where 80% of rows have status = 'active' is fundamentally different from one with uniform distribution, and sqlite_stat1’s single average can mislead the planner for queries that filter on low-frequency values. With sqlite_stat4, the planner can correctly identify that status = 'cancelled' is rare and worth an index scan, while status = 'active' may be better served by a sequential scan.

Note that Python’s bundled sqlite3 module does not enable SQLITE_ENABLE_STAT4 by default, so you may not see this table in all environments. The SQLite amalgamation with a custom build, or distributions like the one packaged with macOS or many Linux distros, may include it.

Transferring Statistics Without Transferring Data

Because sqlite_stat1 is an ordinary table, the workflow for transferring production statistics to development is straightforward:

# Export from production (safe, no user data included)
sqlite3 /path/to/prod.db ".dump sqlite_stat1" > prod_stats.sql

# If sqlite_stat4 is populated, export it too
sqlite3 /path/to/prod.db ".dump sqlite_stat4" >> prod_stats.sql

# Import to development database (schema must already match)
sqlite3 /path/to/dev.db "DELETE FROM sqlite_stat1;"
sqlite3 /path/to/dev.db < prod_stats.sql

# Reload the planner's in-memory statistics cache
sqlite3 /path/to/dev.db "ANALYZE sqlite_master;"

The ANALYZE sqlite_master call at the end is critical. SQLite loads statistics into an in-memory structure at connection open time. Inserting rows into sqlite_stat1 does not automatically update that structure; you need to trigger a reload. Calling ANALYZE with the name of a virtual table or sqlite_master causes the reload without rescanning the actual data tables.

After this, your development database has zero rows, but EXPLAIN QUERY PLAN shows the same index choices, join orderings, and scan strategies that production uses.

Constructing Statistics Without Production Access

If you cannot read production statistics directly, you can construct plausible entries by hand. This is useful during schema design, when you want to evaluate whether a proposed index will be used at production scale, or in regulated environments where even statistics exports require review.

-- Simulate a 10M row orders table with approximate cardinalities
DELETE FROM sqlite_stat1;
INSERT INTO sqlite_stat1 VALUES
  ('orders', 'idx_orders_user_id',    '10000000 50'),
  ('orders', 'idx_orders_status',     '10000000 1666666'),
  ('orders', 'idx_orders_created_at', '10000000 1'),
  ('orders', NULL,                    '10000000'),
  ('users',  'idx_users_email',       '200000 1'),
  ('users',  NULL,                    '200000');

ANALYZE sqlite_master;

The numbers do not need to be exact. Getting the order of magnitude right, and the relative selectivity between indexes, is enough to produce meaningful query plan comparisons. A join between orders (10M rows) and users (200K rows) will be optimized very differently than a join between two 100-row development tables.

How PostgreSQL Handles the Same Problem

PostgreSQL’s statistics model is more expressive but harder to manipulate directly. Column-level statistics live in pg_statistic, exposed through the pg_stats view, and include most_common_vals, histogram_bounds, and correlation in addition to row counts. Table-level row counts live in pg_class.reltuples.

Manipulating these directly requires superuser access and knowledge of internal OID structures. The practical workarounds:

Row count injection is the highest-impact change and requires only superuser:

UPDATE pg_class SET reltuples = 50000000, relpages = 500000
WHERE relname = 'orders';

Column-level n_distinct overrides are available without touching system catalogs:

ALTER TABLE orders ALTER COLUMN status SET (n_distinct = 6);
ALTER TABLE orders ALTER COLUMN customer_id SET (n_distinct = -0.9);

A negative value for n_distinct means a fraction of total rows rather than an absolute count.

PostgreSQL 17 introduced pg_dump --section=statistics and the corresponding restore path, making this a first-class operation rather than a workaround. The SQLite sqlite_stat1 approach predates this by years; SQLite’s design of keeping statistics in plain user-accessible tables rather than internal catalog structures made the portability obvious from the start.

Oracle has offered DBMS_STATS.SET_TABLE_STATS and SET_INDEX_STATS as official, fully documented procedures since at least Oracle 9i. You specify numrows, numblks, and histogram data directly. MySQL exposes the equivalent through mysql.innodb_table_stats and mysql.innodb_index_stats, which are ordinary tables you can UPDATE directly.

SQLite lands closer to the Oracle and MySQL model than to PostgreSQL’s: the statistics tables are part of the public interface, not an internal catalog.

When Plans Diverge Silently

The scenarios where this matters most are not always obvious. A few common cases:

A query with multiple applicable indexes where the planner chooses the wrong one because the row count estimate is off by four orders of magnitude. The wrong index is still used, the query runs, and it appears correct in development. In production, it takes seconds instead of milliseconds.

A join order reversal: with accurate statistics, the planner will typically put the more selective table on the outer loop of a nested join. With no statistics, it may reverse this, producing a plan that makes sense for small tables but is catastrophic at scale.

A LIMIT query where the planner’s decision to use an index versus sort-and-scan depends on the estimated row count before the LIMIT. With too-small row count estimates, the planner may opt for a sequential scan and sort that would be entirely wrong for a production-sized table.

In all three cases, EXPLAIN QUERY PLAN in development gives you a plan that predicts production behavior, provided the statistics match. Without statistics transfer, the plan is essentially random with respect to production.

The technique Willison describes has been possible for years, but the framing is what makes it accessible: the problem is not that you lack production data, it is that you lack production statistics, and those are entirely separate things. Keeping a sqlite_stat1 export in your repository, updated after significant data growth milestones, costs almost nothing and eliminates a whole category of “works in dev, slow in production” surprises.