Portable MicroVMs With Subsecond Coldstarts: What smolvm Is Actually Doing
Source: hackernews
The serverless and edge computing world has been quietly wrestling with the same tension for years: you want the isolation guarantees of a virtual machine, but you cannot afford the startup latency. smolvm landed on Hacker News recently promising subsecond coldstarts with portable virtual machines, and the response was warm enough (276 points, 91 comments) to warrant a closer look at what that actually means technically and where it fits in a crowded ecosystem.
The Coldstart Problem Is Not Simple
When people talk about VM coldstart latency, they often conflate several distinct steps. There is kernel boot time, device initialization, userspace init, and then application startup. Each of these is a separate optimization target, and different projects attack different parts of the chain.
A stock QEMU/KVM virtual machine running a full Linux distribution boots in somewhere between five and thirty seconds, depending on the rootfs size and how much init work the guest OS does. That is fine for long-lived workloads, completely untenable for serverless functions or any system where you are spawning VMs per-request.
Firecracker, AWS’s microVM hypervisor written in Rust, changed expectations when it shipped. By stripping the virtual device model down to a minimal set (virtio-net, virtio-block, a serial port, and not much else), Firecracker can bring a minimal Linux guest to readiness in roughly 125ms. That is the number AWS published for Lambda’s underlying infrastructure. In practice, with a real rootfs and application, you are looking at 300-800ms total before the process is ready to serve traffic. Still subsecond if your guest image is small and your init is lean, but tight.
At the other extreme, WebAssembly runtimes like Wasmtime and WasmEdge can instantiate a module in under a millisecond. Cloudflare’s Workers platform leans on V8 isolates for similar sub-10ms startup. The tradeoff is obvious: you’re constrained to code that compiles to WASM or runs inside a JavaScript engine. Arbitrary Linux binaries are not in scope.
What Portable Means Here
The “portable” claim in smolvm’s pitch is worth unpacking because it is doing real work. Most microVM solutions are KVM-only, which means Linux hosts with hardware virtualization support. Firecracker does not run on macOS. Cloud Hypervisor has the same constraint. Running these in CI on macOS runners, or deploying them to edge nodes where you might be on a Windows host with Hyper-V, has historically meant reaching for a heavier compatibility layer.
Portability in this context generally means one of a few things: a VM abstraction layer that maps to KVM on Linux, Hypervisor.framework on macOS, and Hyper-V on Windows; or userspace emulation for cases where hardware virtualization is not available. The former is what libkrun attempts, providing a lightweight VM runtime that wraps platform-native hypervisor APIs. The latter is what QEMU’s TCG mode does, at significant performance cost.
Getting subsecond coldstarts while maintaining cross-platform portability is a genuine engineering challenge because the fast path for boot latency almost always involves tight coupling to specific kernel APIs like KVM_CREATE_VM and memory mapping tricks that do not translate cleanly across hypervisor backends.
Snapshot-Restore as the Fast Path
The most reliable technique for hitting aggressive coldstart targets without sacrificing compatibility is snapshot-restore. Instead of booting a VM from scratch on each invocation, you boot it once, checkpoint the memory and CPU state at a point where the application is initialized and ready, then restore that snapshot on demand.
Firecracker supports this natively. Restoring a Firecracker snapshot takes roughly 8-20ms depending on memory footprint. Fly.io built their Machines product partly around this capability. The architectural consequence is that your VM image is no longer just a disk image; it includes a memory snapshot, which complicates the build pipeline and the storage model.
Snapshot-restore has subtleties that bite you in production. Any state that should not persist across instances (entropy pools, network connections, monotonic clocks) needs to be re-initialized after restore. Linux has a facility called restore in place that helps with some of this, but entropy re-seeding after a snapshot restore is a known footgun that has affected real systems.
The Unikernel Alternative
The other approach to fast VM boot, which smolvm’s architecture may draw from, is unikernels. A unikernel collapses the kernel and application into a single executable linked against a minimal OS library. There is no init system, no shell, no unused device drivers. The guest is exactly the application and the kernel code it needs.
MirageOS demonstrated this concept most thoroughly in the functional programming world. Unikraft has been doing serious engineering to make the approach practical for general-purpose workloads, with boot times in the 1-10ms range for minimal guests. The Unikraft paper from EuroSys 2021 is worth reading if you want to understand how much of a traditional kernel can be stripped before you lose meaningful compatibility.
The compatibility problem with unikernels is significant. Most real applications have implicit assumptions about POSIX semantics, filesystem layout, and signal handling that are annoying to port. Unikraft addresses this with a compatibility layer, but it adds complexity.
Where smolvm Sits
From the project description and community discussion, smolvm appears to occupy the space between Firecracker-style microVMs and unikernel approaches: a minimal hypervisor with a compact guest image format, designed from the start for fast snapshot-restore cycles and cross-platform host support. The “smol” framing aligns with a philosophy of stripping the stack to what is actually necessary.
The 276 upvotes on the HN thread suggest the developer community finds this combination appealing. The practical demand is real: teams building serverless platforms, edge runtimes, or isolated execution environments for untrusted code keep running into the same problem. Firecracker is excellent but Linux-only and operationally complex to deploy outside AWS’s infrastructure. WASM is fast but does not run arbitrary code. Something that bridges these with real portability and subsecond startup would fill a genuine gap.
Performance Numbers in Context
To frame what “subsecond” means in practice:
- QEMU full boot: 5,000-30,000ms
- Firecracker cold boot (minimal Linux): ~125ms
- Firecracker with real rootfs: 300-800ms
- Firecracker snapshot restore: 8-20ms
- kata-containers (OCI-compatible VMs): ~1,000-2,000ms
- Docker container start: 50-200ms
- Unikraft minimal guest: 1-10ms
- WASM module instantiation: <1ms
Claiming subsecond cold (not snapshot-restore) boot puts smolvm in Firecracker’s league or better. Whether that is achieved through aggressive kernel trimming, a purpose-built minimal guest OS, or some snapshot-restore trick labeled as a “coldstart” is the question that matters for evaluating the claim.
Practical Implications
For anyone building multi-tenant infrastructure, the isolation story matters as much as the latency. Containers share a kernel; a kernel vulnerability affects all tenants. VMs with a proper hypervisor boundary are harder to escape from, which is why AWS Lambda, Fly.io, and Cloudflare’s newer isolated workers all use hypervisor-based isolation under the hood despite the latency cost.
A genuinely portable microVM with subsecond coldstarts would make it significantly easier to build self-hosted serverless infrastructure without being locked to a specific cloud provider’s primitive. That is the actual value proposition, and it is worth watching this project develop. The implementation details, specifically how the guest image format works, how snapshot state is handled across instantiations, and what the network setup looks like, will determine whether it holds up under production workloads or stays a compelling demo.
The code is on GitHub. If this problem space is relevant to your infrastructure work, it is worth reading the source directly rather than waiting for the blog post circuit to catch up.