· 8 min read ·

The Hidden Shared Memory Problem in SQLite WAL Mode and Docker Volumes

Source: simonwillison

Simon Willison wrote recently about running SQLite in WAL mode across Docker containers sharing a volume, which is the kind of thing that looks fine until it isn’t. The failure modes are subtle enough that many people ship this configuration and never hit them — right up until they do.

The core issue is not the one people usually assume. It isn’t that file locking doesn’t work across containers, or that the volume driver is unreliable. The deeper problem is that WAL mode’s correctness depends on a file that is designed to behave like shared memory, and shared memory semantics break down in ways that are difficult to observe until they cause data corruption or stale reads.

What WAL Mode Actually Does

By default, SQLite uses a rollback journal. Before modifying any page, it writes the original contents to a -journal file. On commit, the journal is deleted. Writers hold an exclusive lock the entire time, blocking all readers.

WAL mode flips this around. Instead of writing original content to a side file, new content is appended to a write-ahead log (the -wal file). The main database file is only updated when a checkpoint runs. Readers always see a consistent snapshot because they read from the main file and pull any newer pages they need from the WAL. The headline benefit is that readers and writers no longer block each other.

But WAL mode creates a third file alongside database.db and database.db-wal: the database.db-shm file. This is the one that matters for the Docker problem.

What the -shm File Actually Is

The -shm file is a WAL index. Its structure is documented in the SQLite source: it contains a header (duplicated for crash resilience) that stores the current WAL size in frames, plus a hash table mapping page numbers to WAL frame numbers. When any process needs to read page 42, it checks the hash table in the -shm file to find the most recent WAL frame containing that page, then reads from there rather than scanning the entire WAL linearly.

The critical design detail: every process opening the database in WAL mode mmap()s the -shm file directly into its address space. Writes to the WAL index happen through this memory mapping, not through standard file I/O. The name stands for “shared memory” — it is literally designed to function as shared memory between concurrent processes.

On a single machine, this works well. Multiple processes mmap() the same file, and because they are all running under the same Linux kernel, the kernel’s page cache ensures that all of them see the same physical memory pages. A write from process A at offset 0 is visible to process B’s mapping at the same offset because they share the same underlying page cache entry for that inode.

Where Docker Breaks the Model

The official SQLite documentation on WAL mode is direct about this constraint:

WAL mode does not work in a network filesystem environment. All processes using the database must be on the same machine and must use the same operating-system kernel.

“Same operating-system kernel” is the key phrase. Docker containers on a single host all run under the same kernel, so at first glance this seems fine. But the picture is more complicated.

The PID namespace problem. SQLite uses POSIX fcntl() byte-range locks to coordinate access to both the main database file and the WAL. POSIX lock ownership is identified by the (inode, PID) pair at the kernel level. Each Docker container runs in its own PID namespace: the first process in a container is PID 1 in that namespace, regardless of what PID the kernel assigns it on the host. But fcntl() locks use the host kernel PID, not the namespace PID, so this is actually fine for lock ownership. The real problem is a different POSIX lock behavior: closing any file descriptor to an inode in the same process releases all fcntl() locks that process holds on that inode. In multi-threaded applications that open and close SQLite connections frequently, this can cause one container to inadvertently release locks held by another container’s process on the same inode, depending on how the kernel resolves the owning process.

The overlayfs coherence problem. Docker’s default storage driver is overlay2, which uses overlayfs to provide container-private filesystem layers. The container’s own root filesystem goes through overlayfs, but bind mounts and named volumes bypass overlayfs entirely and map directly to the host filesystem. This distinction matters: if your SQLite database is on a bind-mounted host directory or a local named volume, the mmap coherence through the page cache actually works. If your database is inside the container’s own writable layer (not mounted externally), the overlayfs path is involved, and mmap coherence across containers is not guaranteed.

The -shm initialization race. If the -shm file does not exist when a process opens the database, that process creates and initializes it. If two containers race to open a WAL-mode database simultaneously and the -shm file doesn’t exist yet, both may attempt to create and initialize it. One will win the file creation, but the loser’s initialization writes may partially overwrite the winner’s, leaving a corrupted WAL index. SQLite has recovery logic for this but it relies on the RECOVER lock, which has its own cross-container reliability issues.

The checkpoint and WAL reset problem. When a container performs a TRUNCATE checkpoint (copying all WAL frames back to the main database file and truncating the WAL to zero bytes), the WAL salt values change. The salt is stored in the -shm header. Any other container that has cached an old salt and is in the middle of a read transaction will subsequently fail its consistency checks, potentially getting SQLITE_CORRUPT or silently reading stale data.

What Volume Types Are Actually Safe

The volume type determines whether the underlying single-kernel coherence guarantee applies.

Bind mounts and local named volumes on a single Docker host both map to ext4, xfs, or similar local filesystems through the host kernel’s page cache. Mmap coherence works because both containers share the same physical page cache entries. fcntl() locks work because they go through the same kernel. This is the configuration Willison was exploring, and it is the only Docker configuration where WAL mode has any chance of functioning correctly.

services:
  app1:
    volumes:
      - /host/data:/app/data   # bind mount — mmap coherence works
  app2:
    volumes:
      - /host/data:/app/data   # same bind mount

NFS mounts, EFS, GlusterFS, CIFS, and any other network-backed volume are all unsafe for WAL mode. These filesystems either do not support fcntl() locks reliably, or their mmap semantics do not provide the inter-process coherence that the WAL index requires. The nolock NFS mount option makes this actively worse: SQLite believes it holds locks it doesn’t have.

Practical Configurations That Work

If you need multiple containers to access the same SQLite database on a bind-mounted volume on a single host, the safest approach is to not use WAL mode:

PRAGMA journal_mode=DELETE;

The rollback journal mode gives up concurrent reads during writes, but it uses simpler lock semantics and doesn’t require -shm file coherence. For most SQLite use cases — local caching, embedded analytics, configuration storage — the performance difference is not significant enough to justify the correctness risk.

If you need WAL mode’s write performance (fewer fsyncs, sequential writes, no write amplification), and only one container ever writes while others are read-only, PRAGMA locking_mode=EXCLUSIVE on the writing container holds all WAL locks permanently after the first transaction. Readers will see SQLITE_BUSY until the writer releases them, which in EXCLUSIVE mode never happens between transactions. This is effectively single-writer-only access, which may be acceptable depending on your workload.

For containers that should only read, mounting the volume read-only prevents accidental writes and -shm creation races:

volumes:
  - /host/data:/app/data:ro

A read-only container can still read a WAL-mode database if the -shm file already exists and has been initialized by the writer. SQLite will attempt to open it in read-write mode (to update read marks), fail, and fall back to opening it read-only. This is a supported fallback that maintains read consistency.

When You Actually Need Multiple Writers

If your use case genuinely requires multiple containers to write to the same SQLite database, the shared-volume approach has no reliable solution. The options worth considering are:

Litestream for durability and replication without multi-writer access. A single container writes, Litestream tails the WAL and replicates to S3 or similar. This solves disaster recovery and enables read replicas but does not solve multi-writer access.

rqlite for a distributed SQLite cluster with Raft consensus. Each node maintains its own local SQLite database. Writes go through the Raft leader and are applied to all nodes. The SQLite file is never shared across processes. This adds roughly 1-5ms per write for the consensus round-trip, which is the correct trade-off for multi-writer distributed access.

A sidecar service pattern where one container owns the SQLite file and other containers communicate with it over HTTP or gRPC. Datasette works well for read-heavy cases where you want SQL query access. For write access, a thin HTTP wrapper around SQLite is straightforward to build.

Turso/libSQL if you want SQLite-compatible semantics with an embedded replica model. Each container has a local read replica; writes go to a primary server. Reads are local and fast; write consistency is eventual unless you call sync() explicitly.

The Broader Pattern

WAL mode is one of the best features in SQLite for single-process applications. The performance advantages are real: fewer fsyncs on the write path, sequential WAL writes instead of random journal writes, concurrent reads during writes. For a process running in a single container, there is rarely a reason not to use it.

The Docker shared volume scenario is where the assumption embedded in WAL mode’s design — that all concurrent openers share a kernel and can truly share memory through mmap — becomes visible. The -shm file is not a regular coordination file you can safely treat as a named pipe or lock file. It is a mmap-based shared memory structure with weaker durability guarantees than the main database, designed to be fast at the cost of relying on OS-level memory coherence.

Understanding that constraint makes the failure modes predictable: they show up exactly when you violate the single-kernel assumption, and they are invisible under light load and low concurrency, which is why they so often survive testing and fail in production.

Was this interesting?