· 8 min read ·

CoW at the VM Level: How Snapshot Forking Kills the Cold Start

Source: hackernews

Back in March 2026, a project called zeroboot showed up on Hacker News with a deceptively simple premise: boot Firecracker once, snapshot the entire VM state with Python and numpy already loaded, then spin up every subsequent sandbox by forking that snapshot using Copy-on-Write memory. The result was sub-millisecond VM startup times.

This post is a retrospective look at how that technique actually works at the kernel level, where it fits in the history of checkpoint-restore systems, and what the real trade-offs are when you apply it to multi-tenant code execution.

The Cold Start Problem Is a Layered Problem

Every time you launch an isolated Python execution environment from scratch, you pay for multiple layers of initialization. The Linux kernel boots inside the VM, the init system runs, the Python interpreter loads, the standard library imports, and then your actual user-installed packages like numpy get initialized. For a typical Python 3 environment with numpy, that chain can take 400ms to over a second depending on the VM and storage configuration.

Firecracker was already a dramatic improvement over full QEMU-based VMs for this workload. It strips the device model down to a minimal set: a virtio-net NIC, a virtio-block device, a serial port, and not much else. It starts a minimal Linux kernel in under 125ms with a tuned guest configuration. AWS Lambda and Fargate both run on Firecracker for exactly this reason.

But even 125ms is the floor only if you tune the kernel aggressively and pre-build your rootfs. Once you add Python and numpy, you’re back over 500ms before your user’s code runs a single line. Firecracker improved the hypervisor layer; it did not touch the guest initialization layers above it.

Snapshots and the Restore Path

The idea behind zeroboot is to separate initialization from execution entirely. You pay the full boot cost exactly once: boot the VM, let Python import numpy, let the interpreter reach a clean ready state, and then freeze everything. Firecracker has had snapshot support since version 0.23 (released in 2020), which serializes the full microVM state: guest memory pages, CPU registers, device state, and the KVM internal state.

Restoring from that snapshot gets you back to the exact moment Python was sitting at a clean interpreter prompt. On its own, snapshot restore is already much faster than a cold boot, typically in the range of 50-150ms depending on memory size, because you skip all kernel and interpreter initialization.

But loading a full snapshot still requires reading and mapping all the guest memory into the host process. If your Python environment plus numpy occupies 200MB of guest memory, you’re copying or loading 200MB before the first instruction executes. For sub-millisecond startup, that is still too slow.

Where Copy-on-Write Changes the Equation

Copy-on-Write is a mechanism the Linux kernel has used for fork() since the early days of Unix. When a parent process calls fork(), the child does not immediately get its own copy of every memory page. Instead, both parent and child share the same physical memory frames, marked read-only in their respective page tables. When either process writes to a shared page, the kernel catches the resulting page fault, copies the page, and updates the faulting process’s page table to point to the new copy. The other process continues to see the original.

Zeroboot applies this same principle at the VM level. Instead of giving each new sandbox VM its own copy of the snapshot memory, all VMs are backed by the same snapshot memory region, mapped as MAP_PRIVATE. On Linux, an anonymous or file-backed MAP_PRIVATE mapping behaves exactly like CoW: reads are satisfied from the underlying pages, writes trigger a page fault that causes the kernel to allocate a fresh page and copy the original content into it before the write completes.

KVM guest memory regions are just ranges of the host process’s virtual address space. When you create a KVM VM and configure its memory slots with KVM_SET_USER_MEMORY_REGION, you hand KVM a host virtual address range, and KVM translates guest physical addresses into that range. If that host virtual address range is a MAP_PRIVATE mapping of your snapshot’s memory file, every new VM starts sharing physical pages with the snapshot and with every other running VM backed by the same snapshot, without any upfront copy.

The VM starts in microseconds because there is no data to copy. The first instruction executes immediately. As the guest writes to memory, the kernel’s CoW machinery hands out fresh pages on demand, one page fault at a time.

Reconstructing CPU and Device State

Memory is the bulk of the cost, but a full VM restore also requires loading CPU state. Firecracker snapshots serialize the vCPU registers, FPU state, LAPIC state, and the full KVM-managed MSR and segment register state. This is restored via KVM_SET_REGS, KVM_SET_SREGS, KVM_SET_FPU, and related ioctls. The device state, including the virtio queue positions and network MAC configuration, is deserialized and replayed through Firecracker’s device model.

This part of the restore is computationally cheap: you’re writing a few kilobytes of state into kernel data structures. It’s a fixed cost regardless of guest memory size, and it completes in tens of microseconds on modern hardware.

Combined with the CoW memory trick, the total startup path becomes: allocate a new KVM file descriptor, configure memory slots pointing at the CoW mapping, restore CPU and device state, and unblock the vCPU execution thread. That is well under a millisecond for a modest VM configuration.

Prior Art: CRIU, Lambda SnapStart, and QEMU

CRIU (Checkpoint/Restore in Userspace) has done process-level snapshot-restore on Linux since around 2012. It serializes the full state of a running process: memory maps, open file descriptors, sockets, pipe state, and signal handlers. Restore creates a new process that continues execution from exactly where the checkpoint was taken. CRIU also supports lazy page restore via userfaultfd, which lets the restore process begin execution before all pages have been loaded, with missing pages faulted in on demand from the checkpoint image. This is conceptually close to what zeroboot achieves, but at the process level rather than the VM level.

The VM level offers stronger isolation guarantees. A sandboxed process shares the host kernel with all other processes; a VM does not. For untrusted code execution, this distinction is fundamental.

AWS Lambda SnapStart applies the same insight to JVM-based Lambda functions. The JVM is initialized at deploy time, a snapshot is taken, and cold starts are served by restoring that snapshot. Amazon reported cold start improvements of up to 90% for Java functions. The underlying implementation uses Firecracker snapshots, so it shares the same mechanism family as zeroboot, though Lambda SnapStart is opaque to users and focused specifically on the JVM lifecycle.

CRaC (Coordinated Restore at Checkpoint) takes a different angle on the same Java cold-start problem, building checkpoint-restore awareness directly into the JVM and requiring application code to participate via lifecycle hooks to handle resources that cannot be safely snapshotted (network connections, file handles, etc.).

QEMU has had savevm and loadvm since the early 2000s for live migration and development snapshots. The difference from zeroboot’s approach is that QEMU’s restore path copies the snapshot memory into the VM’s address space upfront rather than using CoW semantics for sharing across multiple VMs. It was designed for single-VM persistence, not for spawning hundreds of identical VMs cheaply.

Cloudflare Workers and Deno Deploy take a fundamentally different approach: instead of isolating at the VM level, they use V8 isolates, which share a process and kernel but isolate JavaScript execution contexts. The isolation is shallower and depends on V8’s security model, but startup times are microseconds because there is no OS to boot.

The Real Trade-offs

CoW VM forking is not free. The first write to any shared page triggers a page fault, which means early write-heavy workloads pay extra latency. If your sandboxed code immediately initializes a large array or fills a buffer, those writes will cause a burst of page faults before execution proceeds at full speed. The latency is distributed rather than front-loaded, which is often preferable for interactive use cases but can surprise benchmarks that measure total execution time for write-heavy tasks.

Memory pressure scales with divergence. When VMs share CoW pages, the physical memory cost of running N VMs is roughly: snapshot size plus the sum of pages each VM has dirtied. If your workloads are mostly read-heavy, this is excellent; you can run many VMs cheaply. If every VM immediately dirties most of the snapshot memory, the CoW benefit evaporates and you end up using N times the snapshot memory.

Snapshot hygiene matters significantly. The snapshot captures every piece of state the VM had at the moment of freezing: open file descriptors, entropy pool state, heap layout, cached data from previous initialization steps. Any randomness seeded at interpreter startup is the same across all VMs spawned from that snapshot. Code that depends on os.urandom() or random.seed() at import time will produce identical sequences in every fork until explicitly re-seeded after restore. This is a correctness concern that must be addressed in the execution harness.

Network and storage device state in the snapshot also requires careful handling. Firecracker’s snapshot format records virtio queue state, but restoring that state into a new network context requires either resetting the device or coordinating with the host-side network setup to avoid descriptor ring confusion.

The Broader Pattern

What zeroboot demonstrates concretely is that the initialization cost and the isolation cost of a VM sandbox can be almost completely separated. The initialization happens once per snapshot version. The isolation cost per execution is a handful of syscalls and a KVM file descriptor, not a full boot sequence.

This pattern is not new in theory. Fork-based server architectures like Gunicorn’s prefork model have exploited OS-level CoW for decades: a parent process imports application code, then forks worker processes that share the initialized memory until they diverge. Zeroboot is applying that same pattern one level down, at the hypervisor boundary rather than the process boundary, which preserves the VM-level isolation that makes untrusted code execution safe.

For anyone building code execution infrastructure, the lesson is worth internalizing. The question is not how to boot a VM faster; it is how to avoid booting at all. Snapshots plus CoW forking is currently the most practical answer at the VM isolation level, and projects like zeroboot make the mechanics concrete and reproducible.

Was this interesting?