· 7 min read ·

The -shm File Is the Real Problem with SQLite WAL Mode in Docker

Source: simonwillison

Simon Willison wrote last week about running SQLite in WAL mode across Docker containers that share a volume. The post is worth reading, but I want to go deeper on the mechanism of failure, because the standard advice, “WAL doesn’t work on network filesystems,” undersells the problem. The failure mode exists even on local volumes with a local filesystem, and understanding why requires understanding what the -shm file actually does.

What WAL Mode Is Doing

SQLite’s default durability mechanism is the rollback journal. Before modifying any page, the original page bytes are written to a -journal file. On crash, SQLite replays that file to restore the original state. Simple and correct, but it requires an exclusive lock on the database for the entire duration of a write. Readers block.

WAL mode flips the model. Instead of saving the old content aside, new content is appended to a -wal file. The main database file is not touched until a checkpoint occurs. Readers continue reading from the original database file, supplementing with relevant pages from the WAL. Writers append to the WAL without blocking readers. This is why WAL mode has better read concurrency: readers and writers do not contend for the same pages.

Checkpointing, the process of folding WAL frames back into the main database file, happens automatically when the WAL grows past a threshold (1,000 pages by default), or on demand via PRAGMA wal_checkpoint.

All of this works elegantly in the scenario SQLite was designed for: multiple threads or processes on the same machine.

The Shared Memory Index

Here is the part that matters for Docker. The WAL file can grow large, and reading it sequentially to find the current version of each page would be expensive. SQLite solves this with a third file: the -shm file.

The -shm file is a shared memory index. It contains a header (stored twice for crash safety) with the database size, a change counter, the last valid frame number, and salt values that match the WAL. Beyond the header, it contains hash tables that map page numbers to WAL frame numbers. When a reader needs page 42, it does not scan the WAL; it looks up page 42 in the hash tables and jumps directly to the most recent frame containing it.

Beyond indexing, the -shm file is the coordination point for locking. SQLite uses POSIX advisory byte-range locks on specific offsets within the -shm file itself:

  • Byte 120: write lock (exclusive for the current writer)
  • Byte 128: pending lock (signals that a writer is waiting)
  • Byte 136: shared lock (acquired by readers checking the index header)
  • Byte 144: checkpoint lock
  • Bytes 160 through 223: eight read-lock slots, one per concurrent reader, used to pin WAL frames in place so the checkpointer cannot overwrite frames a reader depends on

The entire scheme assumes that all processes mapping the -shm file see the same bytes at the same time. On a single machine, the OS virtual memory system enforces this. When two processes mmap() the same file, they get views into the same physical memory pages. A write by one process is immediately visible to the other. This is what makes the lock coordination work.

Why Docker Breaks This

Docker containers are Linux namespaces with separate process trees. When container A calls mmap() on the -shm file, it gets a mapping in container A’s page cache. When container B calls mmap() on the same file, it gets a mapping in container B’s page cache.

On the same host with a local filesystem, the kernel will usually back both mappings with the same physical pages, so writes from one container are visible in the other. But this is not guaranteed, and it is not the designed behavior for cross-namespace mappings. The coherency guarantees that SQLite relies on apply within a single process group, not across Linux namespace boundaries.

The result is subtle corruption. Container A updates the WAL index showing that frame 78 contains the latest version of page 14. Container B’s mapping of -shm reflects a stale index. Container B reads page 14 from the main database file, getting a version that a committed transaction has already superseded. The database appears consistent, but returns wrong data.

The locking also degrades. POSIX fcntl advisory locks are keyed on inode and PID within a single kernel. Across containers on the same host, the locks technically work: both containers share the same kernel. But POSIX lock semantics have a well-known flaw: closing any file descriptor to a file in a process releases all fcntl locks that process holds on that file. Connection pooling patterns inside containers can trigger this accidentally.

On network-backed Docker volumes, the situation is worse. NFS byte-range lock support requires lockd and the NLM protocol, which is unreliable and often disabled with nolock mount options. mmap() coherency across NFS hosts is not provided at all; each host maintains an independent page cache. AWS EFS, Google Filestore, and similar managed NFS products share these limitations. The SQLite documentation is explicit on this point:

WAL mode does not work on network filesystems. All processes that access the database must reside on the same machine.

But even for local Docker volumes, the mmap coherency issue is a real risk, not a theoretical one.

A Concrete Failure Scenario

Imagine a setup with one writer container and two reader containers, all mounting the same Docker volume at /data, with app.db in WAL mode.

Volume: /data/app.db
         /data/app.db-wal
         /data/app.db-shm

The writer inserts rows and commits. The WAL file gains new frames. The writer updates the -shm index to reflect the new frames. Container A’s reader, having recently mapped -shm, may or may not see the updated index depending on whether the kernel has flushed the page to the shared backing store and whether the reader’s mapping reflects it. If the reader has a stale index, it reads from the main database file and misses the committed data.

Worse: the writer triggers a checkpoint, which moves the WAL frames into the main database file and resets the WAL. If a reader was mid-read using a WAL frame that has now been checkpointed and overwritten, it may read partially overwritten data. The read-lock mechanism in -shm is supposed to prevent this, but that mechanism depends on coherent shared memory, which is not guaranteed across containers.

The Safe Configuration

For a Docker scenario with multiple containers accessing the same SQLite file, the correct journal mode is the default rollback journal:

PRAGMA journal_mode=DELETE;

The rollback journal uses a standard exclusive file lock on the main database file during writes. POSIX fcntl exclusive locks work correctly across processes and containers on the same kernel. Readers block during writes, and only one writer can proceed at a time, but the coordination is correct.

If you have an existing database in WAL mode, you can switch it:

PRAGMA journal_mode=DELETE;
PRAGMA journal_mode;  -- verify it changed

Delete the -shm and -wal files after switching (when no connections are open):

rm app.db-shm app.db-wal

For read-heavy workloads where WAL concurrency is genuinely needed, the correct architecture is a single container owning the SQLite file and exposing an HTTP or Unix socket API to other containers. The database stays local to one process; other services query it over the network. Datasette and similar tools are built for exactly this pattern.

Litestream and rqlite Are Different Problems

Litestream is frequently mentioned in SQLite-in-Docker discussions. It is a streaming replication tool that continuously ships WAL frames to S3 or other object storage, enabling point-in-time recovery and read replica restoration. Litestream does not solve the multi-container concurrent access problem. It is a disaster recovery and replication tool, and it is designed to run as a sidecar alongside a single SQLite writer.

rqlite wraps SQLite in a Raft consensus layer. Multiple rqlite nodes participate in leader election; only the leader accepts writes and replicates them to followers. This does solve the multi-container problem, but it adds significant operational complexity, changes the client interface, and is designed for distributed deployments rather than simple Docker setups.

For most cases where someone is reaching for multi-container SQLite, the actual requirement is either:

  1. A single container that owns the database and handles all reads and writes, with other containers calling it via an API.
  2. A proper client-server database like PostgreSQL, which is built from the ground up for networked concurrent access.

SQLite with rollback journal mode in a single container is fine. SQLite in WAL mode shared across containers is a trap.

The Detection Problem

What makes this particularly dangerous is that the failure is usually silent. Stale reads look like correct reads. Partial reads from mid-checkpoint pages do not always produce obvious errors. SQLite’s integrity checks (PRAGMA integrity_check) can detect structural corruption but may not catch logical inconsistency from stale index views.

The failure is also intermittent. Under light load, the kernel may flush page cache state frequently enough that containers see mostly-consistent data. Under heavy write load, the divergence accumulates. This means a setup can appear to work in testing and fail in production at elevated load, which is among the worst failure mode profiles a persistence layer can have.

If you have SQLite in WAL mode on a Docker volume and it has been running without obvious errors, the right response is not relief. Check whether more than one container has the database open, and if so, switch to rollback journal mode before the load increases enough to surface the problem.

The -shm file is doing something clever, and that cleverness has a well-defined scope. Staying inside that scope is straightforward once you understand where the boundary is.

Was this interesting?