The fork() Insight Applied to VMs: How CoW Memory Snapshots Eliminate Sandbox Cold Starts
Source: hackernews
The serverless cold start problem has a familiar shape. Boot a Linux VM, load a Python interpreter, import numpy, then finally run the function. The VM boot alone costs you 100-150ms on Firecracker, which is already the fastest microVM runtime in production. The Python interpreter adds another 50ms. Numpy’s import time is notorious enough to warrant its own benchmarks. By the time you are ready to execute user code, you have spent somewhere between one and three seconds just arriving.
Adam Miribyan’s zeroboot approaches this from a different direction. Boot the VM once, with Python and numpy already imported. Snapshot the entire VM state to RAM. Then, for every subsequent execution, create a new KVM VM instance backed by copy-on-write memory derived from that snapshot. Startup time: sub-millisecond.
This is the fork() idea applied to hardware virtualization, and it is older than it looks.
What CoW Memory Forking Actually Means at the KVM Level
When you run a Firecracker microVM, the guest’s physical memory is just a region of anonymous memory in the host process. Firecracker calls mmap(MAP_ANONYMOUS | MAP_PRIVATE) to allocate a chunk, then registers it with KVM via ioctl(KVM_SET_USER_MEMORY_REGION). From that point, KVM’s memory handling is essentially a translation layer: guest physical addresses map to host virtual addresses, which map to host physical pages through the normal Linux page tables.
This is the leverage point. Because the guest memory is ordinary Linux virtual memory, it inherits all of Linux’s virtual memory semantics, including copy-on-write.
When you want to fork a VM snapshot, you create a new anonymous mapping pointing to the same underlying physical pages as the snapshot, using MAP_PRIVATE. At the host level, Linux marks those pages copy-on-write: both the snapshot and the new VM share the same physical pages until one of them writes. When the guest running in the new VM writes to a page, KVM exits to the host, the host page fault handler sees a CoW fault, allocates a new physical page, copies the content, updates the page table, and resumes the guest. The guest never notices anything happened.
The net result is that a new “VM” starts with zero memory allocated and zero boot time. It shares the entire base snapshot until execution diverges. Pages that are only read, like the Python bytecode, the numpy shared library, the interpreter state, are never copied. Only pages written during execution get their own private copies.
The Linux userfaultfd subsystem, particularly the write-protect mode (uffd-wp) that landed in meaningful form in kernel 5.7 and was refined through 5.11, gives you an explicit userspace-controlled mechanism for this. You register a memory region with userfaultfd, mark pages write-protected, and receive notifications when writes occur, at which point your handler can copy the page and allow the write to proceed. This gives more control than relying purely on MAP_PRIVATE CoW: you can track which pages were dirtied, lazily populate the new VM’s memory from the snapshot, and implement custom eviction policies.
Why Firecracker Snapshot Restore Does Not Already Solve This
Firecracker has supported snapshot and restore since version 0.23. You can snapshot a running VM to disk, then restore it. AWS Lambda SnapStart for Java uses a similar mechanism at a higher level, snapshotting the initialized JVM state before first invocation.
The problem with disk-based snapshot restore is that it still costs time proportional to the amount of memory you need to restore. Even with on-demand paging, where pages are loaded from the snapshot file lazily as the guest accesses them, the first accesses still incur disk I/O. The Firecracker team’s own documentation acknowledges restore times of 150ms or more for VMs with realistic memory footprints. The Snapfaas paper from Stanford, published at USENIX ATC 2020, measured Firecracker snapshot restores at around 5ms for minimal VMs and significantly more for Python workloads once you account for the working set.
In-memory CoW forking skips disk entirely. The snapshot lives in RAM. Creating a new VM instance is a mmap call and a handful of KVM ioctls. The guest memory starts fully mapped and accessible without a single disk read. That is where the sub-millisecond figure comes from: there is almost nothing to do.
Prior Art and the Convergence This Represents
The idea of forking an initialized runtime to avoid repeated startup costs is genuinely old. fork() itself has been used this way since at least the Apache prefork model and Ruby on Rails’s Unicorn server, which forks after loading the full application so that worker processes share pre-loaded code pages through normal process CoW semantics.
For sandboxed execution specifically, Google’s V8 isolates, which power Cloudflare Workers and Deno Deploy, achieve sub-millisecond cold starts partly by keeping a snapshot of the initialized V8 heap and restoring it via memcpy for each new isolate. The technique is called V8 heap snapshots and has been in production since at least 2016. The constraint is obvious: it only works for JavaScript and WebAssembly.
The Catalyzer paper from OSDI 2020, from researchers at Shanghai Jiao Tong University, describes a nearly identical approach applied to gVisor. They call it “sfork” (sandbox fork) and demonstrate sub-100-microsecond function startup times by forking an initialized gVisor sandbox rather than booting a new one. The key difference from zeroboot is that gVisor uses a process-based isolation model, so the forking mechanism maps more directly onto Linux’s fork() semantics without needing to coordinate with KVM.
What zeroboot represents is applying this same insight to hardware VM isolation, which provides a stronger security boundary than process-level isolation. You get the security properties of a separate KVM guest, including a separate kernel, no shared kernel state, and hardware-enforced memory isolation, with startup costs approaching those of a process fork.
The Memory Cost Is the Real Constraint
The sub-millisecond headline obscures the actual engineering constraint: memory. A Python process with numpy imported occupies somewhere between 150MB and 300MB of RSS, depending on which parts of numpy have been paged in. If you are running 100 concurrent executions and each one dirties an average of 50MB of pages before exiting, your memory pressure is 5GB just for dirty pages, on top of the shared snapshot. On a host with 64GB of RAM you might support a few hundred concurrent functions before you start thrashing.
This is the same constraint that limits V8 isolates: Cloudflare Worker memory limits are 128MB per isolate partly because the economics of sharing a host require it.
The other pressure is snapshot staleness. The zeroboot snapshot captures Python and numpy at a specific version, with a specific set of pre-imported modules. If a function imports additional modules at runtime, those imports happen inside the forked VM and you pay the full import cost. The snapshot only accelerates what was included at snapshot time.
These are engineering constraints rather than fundamental objections. Tiered snapshotting, where you maintain multiple base snapshots at different levels of initialization depth, is a straightforward extension. REAP from USENIX ATC 2020 explores a related idea using working set pre-fetching to minimize cold pages that need to be faulted in on first access.
What This Unlocks
The practical implication is that true per-request VM isolation becomes feasible without the cold start tax. Right now, serverless platforms make an explicit trade-off between isolation and startup latency. AWS Lambda reuses execution environments for warm invocations precisely because cold starting a new VM for every request is too expensive. That reuse is a security boundary relaxation: state from one invocation can leak into the next through the filesystem, environment variables, or leftover process state.
With CoW VM forking in RAM, you could allocate a fresh isolated VM for every single request, pay nothing for the isolation, and still achieve the latency profile of a reused warm environment. The security model becomes substantially cleaner.
Zeroboot is a proof of concept rather than a production system. It is missing network namespace setup, cgroup accounting, snapshot lifecycle management, and eviction policies that a real implementation would require. But the core mechanism, creating new KVM guests backed by CoW memory from an in-RAM snapshot, is sound and demonstrated. The latency numbers are real.
The interesting question is whether the memory economics and operational complexity of maintaining in-RAM snapshots at scale make this viable in production, or whether it stays in the category of techniques that hyperscalers quietly implement internally while the rest of the ecosystem waits for someone to build the plumbing around it. Given that Firecracker itself took years to go from internal AWS infrastructure to an open-source project, that wait might be shorter than it seems.