· 5 min read ·

The Statistics Layer Behind SQLite Query Plans

Source: lobsters

SQLite’s query planner is cost-based. Before executing a query, it enumerates candidate execution strategies, assigns cost estimates to each, and picks the minimum-cost plan. The cost estimates depend on statistics: row counts and index selectivity figures stored in sqlite_stat1 and populated by running ANALYZE. When those statistics are absent or stale, the planner is estimating by heuristic, and the plans it produces can be substantially worse than they should be.

syntaqlite describes query plan visualization as a core feature of its high-fidelity devtools approach. Most SQLite tools expose query plans as formatted renderings of raw EXPLAIN QUERY PLAN text. The deeper value comes from what query plan visualization can expose when it has access to both the plan itself and the statistics that shaped it. Few tools in the existing SQLite ecosystem do anything with that connection.

How sqlite_stat1 Works

SQLite’s query optimizer was rebuilt for version 3.8.0 in 2013 with a cost-based approach called the Next Generation Query Planner. The primary data source for its estimates is sqlite_stat1, whose schema is:

CREATE TABLE sqlite_stat1(tbl TEXT, idx TEXT, stat TEXT);

The stat column is a space-separated list of integers. The first value is the row count for the table. Subsequent values are the average number of rows per distinct value in each index key column, from left to right. For a table with 10,000 rows and a two-column index where the first column has 100 distinct values and the second 1,000, the entry reads 10000 100 10.

SQLite 3.8.1 added sqlite_stat4, which extends this with histogram samples of actual index key values. This matters for range queries: a filter like WHERE created_at > '2025-01-01' has very different costs depending on the value distribution. Without histogram data, the planner has to guess the fraction of rows that satisfy the predicate.

Both tables are populated by running ANALYZE:

-- Analyze all tables in the database
ANALYZE;

-- Analyze a specific table
ANALYZE users;

For large indexes, PRAGMA analysis_limit controls the number of samples collected for sqlite_stat4. The default is zero, meaning no limit, which is appropriate for development. Production databases with very large indexes may want to set a cap to keep analysis time bounded.

What Missing Statistics Look Like

When sqlite_stat1 has no entry for a table’s index, SQLite’s planner uses hard-coded fallback heuristics. The default assumption is a table with roughly one million rows and moderate index selectivity. For small tables this overestimates cost; for large tables with selective indexes, it can seriously undervalue the index.

The visible symptom is a query plan that scans a table instead of using an available index. Consider:

EXPLAIN QUERY PLAN
SELECT id, email FROM users WHERE email = 'user@example.com';

Without statistics, the planner may produce:

QUERY PLAN
`--SCAN users

With accurate statistics showing high selectivity on the email column, the same query produces:

QUERY PLAN
`--SEARCH users USING INDEX users_email_idx (email=?)

The query returns correct results in both cases, and the difference in execution time is proportional to the table size. PostgreSQL’s EXPLAIN ANALYZE makes stale statistics visible directly, showing both the planner’s estimated row count and the actual row count at each plan node. A large divergence between the two points immediately to stale or absent statistics. SQLite has no equivalent command: EXPLAIN QUERY PLAN shows the plan chosen, not the estimates that shaped it.

Most developers encounter this silently. An ORM runs a query, the query is slow, nothing in the output explains why. The index exists, the column is filtered, and the planner still scanned the table. Without a tool that cross-references the plan against sqlite_stat1, the missing ANALYZE call is invisible.

What a High-Fidelity Tool Can Do

A devtools layer with access to both EXPLAIN QUERY PLAN output and the statistics tables can make the connection explicit. When a plan node shows a table scan on a column with an available index, the tool can check sqlite_stat1 for that index:

SELECT stat FROM sqlite_stat1
WHERE tbl = 'users' AND idx = 'users_email_idx';
-- Empty result: no statistics. The planner had no selectivity estimate.

Combined with schema information from sqlite_schema, a tool can distinguish three cases: the index does not exist, the index exists but has no statistics, or the index exists and statistics suggest low selectivity. Only the third case explains a scan as a considered optimizer decision; the second points to a missing ANALYZE run as the cause.

PostgreSQL’s ecosystem has handled this kind of integration for years. pgBadger cross-references slow query logs with pg_stats to diagnose optimizer issues. PEV2 renders EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) output with per-node divergence between estimated and actual rows highlighted in the visualization. SQLite has had structured EXPLAIN QUERY PLAN output since 3.31.0 and the statistics tables since 3.8.0. The gap has been the absence of a tool that assembles both into a coherent view.

Syntaqlite’s decision to build on the actual SQLite C amalgamation, rather than reimplementing the dialect in a standalone parser, is what makes this level of integration feasible. A tool embedded in the SQLite library can inspect sqlite_stat1 through the same interface the query planner uses. A pure SQL parser sitting outside the library cannot do this with the same fidelity.

The Bytecode Layer

EXPLAIN QUERY PLAN gives the logical plan. SQLite also exposes EXPLAIN without the QUERY PLAN qualifier, which emits the bytecode the VDBE (Virtual DataBase Engine) will execute. Each row represents one instruction, with opcode, register operands, and a comment:

addr  opcode         p1    p2    p3    p4             comment
0     Init           0     9     0                    Start at 9
1     OpenRead       0     2     0     3              root=2 iDb=0; users
2     Integer        1     1     0                    r[1]=1
3     SeekRowid      0     8     0     1              intkey=r[1]
4     Column         0     1     2                    r[2]=users.name
5     ResultRow      2     1     0                    output=r[2]

This level is rarely needed day-to-day, but when the plan tree does not explain observed behavior, the bytecode often does. The difference between a SeekRowid lookup and a Next-based loop, the register layout for an aggregation, the structure of a correlated subquery: these appear in the bytecode and not in the high-level plan. A devtools layer that exposes both levels, letting developers start at the plan tree and drill into bytecode when needed, covers cases where the plan alone is insufficient.

The Runtime Complement

Static analysis works before execution. SQLite’s C API also provides sqlite3_trace_v2, introduced in 3.14.0, which fires callbacks on statement execution, profiling in nanoseconds elapsed per statement, individual row events, and database close. The profiling callback gives actual wall-clock cost per query without modifying application code.

Combined with EXPLAIN output, this gives you both the plan and the measured execution time for comparison. PostgreSQL provides this combination through a single EXPLAIN ANALYZE command. SQLite’s version requires assembling two separate mechanisms, but both building blocks are stable and have been available for years. A devtools tool that uses the real SQLite library is positioned to combine them in a way that an external SQL parser cannot.

The statistics problem is one instance of a broader pattern: performance issues that are invisible to tools treating SQLite as an opaque execution target, and visible to tools that understand the engine’s internals well enough to read back what the optimizer was thinking when it chose a plan. SQLite has had the infrastructure for this kind of devtools depth since the 3.8.x era. The tooling ecosystem is only now starting to use it.

Was this interesting?