Shared Volumes, Shared State: How SQLite WAL Actually Coordinates Across Docker Containers
Source: simonwillison
Simon Willison recently documented his experience running SQLite in WAL mode across Docker containers sharing a volume. It works, with caveats. The more interesting question is why it works at all, because the answer sits several layers below SQLite itself, in how Linux handles memory-mapped files across process boundaries.
What WAL Mode Actually Creates on Disk
When you enable WAL mode on a SQLite database, you get three files instead of one:
database.db # the main database file
database.db-wal # the write-ahead log
database.db-shm # the WAL index (shared memory index)
The .wal file is straightforward: writes go there first, sequentially, instead of directly modifying the main database file. This is what enables WAL’s headline feature, concurrent readers during a write. Readers consult the WAL file to find recent versions of pages before falling back to the main file.
The .shm file is where things get interesting. Its name is misleading. It is not a POSIX shared memory object (nothing like shm_open or /dev/shm). It is an ordinary file on disk that SQLite memory-maps into every connecting process’s address space. This file contains the WAL index: a hash table mapping database page numbers to their most recent positions within the .wal file. Without this index, every read would require scanning the entire WAL from the beginning.
Enabling WAL mode is a one-liner:
PRAGMA journal_mode=WAL;
This is persistent. Once set, the database stays in WAL mode across connections and restarts. You do not need to set it again each time you open the database, though checking the current mode and setting it conditionally is common practice.
How Two Containers End Up Sharing Memory
Docker containers are Linux namespaces: they get their own mount namespace, PID namespace, network namespace, and so on. But a named volume or bind mount punches through the mount namespace. Both containers see the same underlying inode on the same host filesystem.
When container A opens the SQLite database and SQLite calls mmap() on the .shm file, the Linux kernel maps those file-backed pages into container A’s virtual address space. When container B does the same, the kernel maps those same physical pages into container B’s virtual address space. This is the Linux page cache doing what it always does: one copy of a file’s contents in RAM, shared across every process that has it open.
The consequence is that when container A writes an updated WAL index entry to its mapped view of .shm, container B’s mapped view reflects that change immediately. There is no serialization, no copy, no IPC message. It is the same physical memory.
This is functionally identical to two ordinary processes on the same host sharing a SQLite database in WAL mode, which SQLite has supported since version 3.7.0 released in July 2010. The Docker layer changes nothing about the underlying mechanics, as long as the volume is backed by a local filesystem.
The Locking Layer
Shared memory coherency is only half the story. SQLite also needs to coordinate writes so that two containers do not corrupt the WAL simultaneously. WAL mode uses POSIX fcntl() byte-range locks on the database file for this coordination.
POSIX file locks are per-process on a given host. When container A holds a write lock, container B’s attempt to acquire the same lock will either block or return EBUSY, depending on whether F_SETLKW or F_SETLK was used. SQLite uses F_SETLK internally and surfaces lock contention as SQLITE_BUSY.
This is why busy_timeout matters so much in multi-container setups:
PRAGMA busy_timeout=5000;
This tells SQLite to retry on SQLITE_BUSY for up to 5000 milliseconds before giving up. Without it, any write contention between containers immediately surfaces as an error to your application. With a reasonable timeout, transient contention is invisible. The SQLite documentation recommends setting this for any application where multiple processes or connections share a database.
A few other pragmas are worth setting alongside journal mode:
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;
PRAGMA busy_timeout=5000;
synchronous=NORMAL is safe with WAL mode. Unlike the default FULL mode, it does not fsync after every transaction commit, only at checkpoints. The WAL architecture preserves durability across crashes even at NORMAL because a checkpoint that does not complete fully still leaves the WAL intact.
Checkpointing With Multiple Writers
The WAL file grows over time. SQLite automatically triggers a checkpoint when the WAL reaches 1000 pages by default, writing accumulated WAL contents back into the main database file. With multiple containers, any of them can trigger this checkpoint, which is fine. The checkpoint acquires appropriate locks before writing.
What can go wrong is WAL file unbounded growth. A checkpoint cannot proceed past any active reader. If one container holds a long-running read transaction, the WAL file will grow until that reader finishes. In a multi-container setup where containers might crash mid-transaction, you can end up with a phantom reader that never releases its lock, causing the WAL to grow indefinitely.
You can observe WAL state and trigger manual checkpoints:
-- See WAL file statistics
PRAGMA wal_checkpoint(PASSIVE);
-- Force a full checkpoint with WAL truncation
PRAGMA wal_checkpoint(TRUNCATE);
-- Adjust automatic checkpoint threshold (default 1000 pages)
PRAGMA wal_autocheckpoint=500;
In practice, if your containers can crash without cleanly releasing connections, you may want a periodic checkpoint job, or at minimum a startup routine that runs a checkpoint when a container initializes.
Where This Breaks Down
The same-host local volume case works because Linux page cache ensures mmap coherency. Remove that assumption and the whole thing collapses.
The first failure mode is network filesystems. NFS, CIFS, and most networked storage systems do not guarantee coherent mmap semantics across hosts. Two machines mapping the same NFS file can see divergent cached copies. SQLite explicitly documents this limitation: WAL mode should not be used on network filesystems. Rollback journal mode has similar problems but at least fails in ways that are more likely to produce visible errors rather than silent corruption.
The second failure mode is containers on different Docker hosts, even if they share a networked volume. From the kernel’s perspective, each host has its own page cache. Mmap on host A and mmap on host B pointing to the same NFS-backed file are two independent memory regions. The lock semantics may also be broken: NFS locks go through the network lock manager, which has its own failure modes and does not provide the same guarantees as local fcntl() locks.
If you need SQLite across multiple hosts, you need a different tool. Litestream provides continuous replication of SQLite databases to object storage, which works as a disaster-recovery layer for single-writer setups. Turso (built on libSQL, an SQLite fork) provides a distributed database with an HTTP API that handles the multi-host coordination layer for you. Neither of these is a drop-in replacement for shared-volume access, but they solve the actual distributed problem rather than pretending a local-filesystem primitive works across network boundaries.
The Setup That Works in Practice
For same-host Docker deployments, the pattern is straightforward. Mount the same named volume into multiple containers:
services:
api:
image: myapp-api
volumes:
- sqlite-data:/data
worker:
image: myapp-worker
volumes:
- sqlite-data:/data
volumes:
sqlite-data:
Ensure each container sets the relevant pragmas on connection open. Keep your write workload on one container if possible. WAL mode allows concurrent readers regardless of write activity, so a single-writer, multiple-reader split is both safe and efficient.
One container doing writes while two or three others handle read-heavy query workloads is a legitimate architecture for moderate scale. SQLite in WAL mode on an NVMe-backed local volume can handle several hundred write transactions per second and far more reads. That covers a lot of real workloads before you need a networked database at all.
The limits are well-defined: same host, local filesystem, correct pragma configuration. Inside those bounds, the kernel makes the coordination work transparently. Outside them, the guarantees disappear and corruption becomes possible without obvious errors to alert you.