· 6 min read ·

The Statistics Gap That Makes Your Dev Database Lie About Query Plans

Source: simonwillison

Simon Willison wrote about a practical technique back in early March 2026 for getting accurate production-equivalent query plans in a development environment without copying any production data. It’s worth looking at in more depth, because the underlying mechanism explains a class of performance debugging failures that most developers hit repeatedly without fully understanding why.

The short version: your query planner doesn’t make decisions based on data. It makes decisions based on statistics about data. Those are different artifacts with different privacy implications and different logistics, and most development workflows treat them as identical when they shouldn’t.

Why Dev Plans Diverge From Production Plans

PostgreSQL’s optimizer is cost-based. It doesn’t apply fixed rules about when to use an index scan versus a sequential scan; it estimates the cost of each candidate plan and picks the cheapest one. Those cost estimates depend almost entirely on statistics stored in pg_statistic and surfaced through the pg_stats view:

SELECT attname, n_distinct, null_frac, most_common_vals, histogram_bounds, correlation
FROM pg_stats
WHERE tablename = 'orders';

The planner uses this data to answer questions like: how many rows will survive a WHERE status = 'pending' filter? Is the data physically sorted in a way that makes an index scan cheap for range queries? Without accurate answers, every cost estimate is wrong.

Table-level statistics live in pg_class:

SELECT relname, reltuples, relpages
FROM pg_class
WHERE relname = 'orders';

On a fresh development database, reltuples is -1 or 0. The planner falls back to defaults: assume a tiny table, assume uniform value distribution. A table with 50 million production rows looks like it has one. The consequences compound: the planner may choose a nested loop join where production uses a hash join, pick a sequential scan where production relies on an index, or reverse the join order entirely. You can spend an hour tuning a query in dev and make it slower in production.

This is the statistics gap, and it’s independent of whether your dev database has representative data.

What Statistics Actually Contain

The pg_stats view exposes several key columns per table column:

  • n_distinct: number of distinct values; a negative value means a fraction of total rows, so -0.02 means roughly 2% of rows are unique
  • most_common_vals / most_common_freqs: the most frequent values and their occurrence rates, stored as typed arrays
  • histogram_bounds: bucket boundaries representing the distribution of values outside the most-common-values list
  • correlation: how closely the physical row order matches the column’s sort order; values near 1.0 mean index scans are cheap because pages are physically sorted
  • null_frac: fraction of NULL entries

PostgreSQL populates all of this via ANALYZE, which samples the actual table data. The default sample size is controlled by default_statistics_target (default 100, yielding roughly 30,000 rows sampled). For columns with skewed distributions, increasing this target gives the planner more accurate bucket boundaries:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

For columns with cross-column dependencies, PostgreSQL 10+ supports extended statistics:

CREATE STATISTICS orders_status_region ON status, region FROM orders;
ANALYZE orders;

Without extended statistics, the planner assumes all columns are independent. If status and region are correlated in your data, the planner will compound its estimation errors when both appear in a WHERE clause.

Moving Statistics Without Moving Data

The core technique is to export statistics from production and import them into dev. No actual row data crosses the boundary.

The single highest-impact change is correcting row counts in pg_class. Generate the update script from production:

SELECT format(
  'UPDATE pg_class SET reltuples = %s, relpages = %s WHERE relname = %L AND relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = %L);',
  reltuples, relpages, relname, nspname
)
FROM pg_class c
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
  AND c.relkind = 'r';

Run the output on dev. If your orders table now looks like it has 50 million rows instead of zero, the planner immediately starts making fundamentally better join and scan decisions.

For column-level distribution data, PostgreSQL exposes a stable interface that doesn’t require touching pg_statistic directly:

ALTER TABLE orders ALTER COLUMN customer_id SET (n_distinct = -0.85);
ALTER TABLE orders ALTER COLUMN status SET (n_distinct = 6);

This sets n_distinct without running ANALYZE, and the change takes effect at the next plan. Combining row count corrections with n_distinct overrides handles the majority of plan divergences in practice.

For full statistics transfer including MCVs and histogram bounds, direct manipulation of pg_statistic is possible with superuser access but brittle, because it references internal type OIDs that may differ between databases. The safer approach is a script that serializes pg_stats to SQL, resolves type OIDs on the target side, and rebuilds the rows. Several community scripts exist for this pattern, and the PostgreSQL hackers list has discussed formalizing it as part of pg_dump.

The SQLite Version Is More Elegant

SQLite handles this more gracefully because its statistics live in ordinary user-accessible tables rather than internal catalog structures. The sqlite_stat1 table holds row counts and per-index statistics:

SELECT * FROM sqlite_stat1;
-- tbl    | idx                      | stat
-- orders | orders_customer_id_idx   | 50000000 5

The stat column is space-separated: first number is the table row count, subsequent numbers are the average rows per distinct value at each key column of the index. Lower values mean higher selectivity. To transplant these statistics from production to dev:

# Export from production
sqlite3 /path/to/prod.db ".dump sqlite_stat1" > prod_stats.sql
sqlite3 /path/to/prod.db ".dump sqlite_stat4" >> prod_stats.sql

# Import to dev
sqlite3 /path/to/dev.db < prod_stats.sql
sqlite3 /path/to/dev.db "ANALYZE sqlite_master;"

The ANALYZE sqlite_master call at the end forces SQLite to reload the statistics into its internal schema. Without it, plans don’t reflect the new numbers until the next connection.

Because sqlite_stat1 is a regular table, you can include it in a schema dump, check it into version control alongside migrations, or generate it from known production characteristics without ever connecting to the production database. sqlite_stat4 extends this with histogram samples for individual index entries, giving the planner better selectivity estimates for range queries.

PostgreSQL’s statistics model gives the planner more information, but SQLite’s simpler design makes statistics portability significantly easier. That’s a tradeoff that lands well for developer experience.

Tools That Extend This Pattern

hypopg lets you create hypothetical indexes that exist only for EXPLAIN purposes, without building them. Combined with statistics injection, you can answer: if I add an index on (status, created_at) given production-accurate distributions, what plan would the optimizer choose?

SELECT hypopg_create_index('CREATE INDEX ON orders(status, created_at)');
EXPLAIN SELECT * FROM orders WHERE status = 'pending' AND created_at > now() - interval '7 days';
SELECT hypopg_drop_index(indexrelid) FROM hypopg_list_indexes();

The complementary piece on the production side is auto_explain. With auto_explain.log_analyze = true and a reasonable log_min_duration, it logs actual execution plans for slow queries. This gives you the ground-truth production plan to compare against what dev produces after statistics injection.

Why This Framing Matters

The framing Willison offers is the useful part: the problem is not that you lack production data, it’s that you lack production statistics. Those are separate artifacts with different privacy and logistics implications.

A histogram of created_at values tells the planner how orders are distributed over time. It tells you nothing about any specific order. Exporting statistics is safe in many environments where exporting user data is not, which matters considerably in regulated industries where copying production databases to development requires security review or is outright prohibited.

A schema dump plus a statistics export script is a much lighter artifact than a sanitized data dump. It answers the specific question you usually care about when debugging a slow query: will this plan hold up at production scale, given production data distributions. Everything else, the actual rows, the sensitive fields, the volume you can’t legally replicate, is irrelevant to that question.

Was this interesting?