· 7 min read ·

Two Models of Process Containment: What the Capsicum vs seccomp Divide Reveals

Source: lobsters

Process sandboxing is one of those topics where implementation choices reveal deep philosophical differences about how security should work. Capsicum, FreeBSD’s capability framework, and seccomp, Linux’s syscall filtering mechanism, both exist to constrain what a process can do after it starts, but they take fundamentally opposite approaches to the problem.

Vivian Voss’s comparison article covers the surface-level mechanics well. What I want to dig into is the underlying security model behind each, why those models produce the trade-offs they do, and what the convergence between them in newer Linux kernels suggests about where process isolation is heading.

The Ambient Authority Problem

To understand why Capsicum exists, you need to understand what it’s reacting against. Traditional Unix processes run with ambient authority, meaning they can access any system resource that their user credentials and file permissions allow. An image editor running as your user account can, by default, open /etc/hosts, make network connections, execute other binaries, and read files you have never explicitly granted it access to. The permissions system says it’s allowed, so it can.

This creates a class of vulnerabilities related to the confused deputy problem, first described by Norm Hardy in 1988. A process with broad privileges can be tricked into performing privileged operations on behalf of malicious input. A PDF reader that happens to have network access can be exploited to exfiltrate data; a media player can be used as a pivot to read sensitive files. The process had no intention of doing those things, but ambient authority made them possible.

Capsicum’s answer is capability mode. Once a process calls cap_enter(), it voluntarily surrenders ambient authority entirely. From that point forward, the process can only interact with the file descriptors it already has open, and even those are constrained by explicit rights attached to each descriptor.

// Before sandboxing: open and restrict the file descriptor
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK, CAP_FSTAT);
cap_rights_limit(fd, &rights);

// One-way transition into capability mode
cap_enter();

// From here: no new file opens, no sockets, no exec
// Only operations permitted by cap_rights on held FDs

The cap_rights_t type provides fine-grained control. You can grant CAP_READ without CAP_SEEK, CAP_WRITE without CAP_FSYNC, or CAP_LOOKUP on a directory descriptor without granting write access to its contents. The original Capsicum paper by Watson et al., published at USENIX Security 2010, demonstrated this by sandboxing real applications like tcpdump and OpenSSH’s privilege-separated daemon with relatively small code changes.

seccomp’s Different Bet

seccomp starts from a completely different premise. Rather than asking what resources a process should be allowed to access, it asks what system calls a process should be allowed to make. The two questions sound similar but produce very different security models.

In its original form, shipped in Linux 2.6.12, seccomp was maximalist in its restriction: a process could only call read(), write(), exit(), and sigreturn(). This was essentially useless for most real applications. The useful form arrived in Linux 3.5 as seccomp-bpf, which allows processes to install Berkeley Packet Filter programs that examine each system call and return an action.

// Using libseccomp for readable policy definitions
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

// Any syscall not listed above kills the process
seccomp_load(ctx);

The BPF program runs in the kernel for every system call made by the sandboxed process. It can inspect the syscall number and the first six arguments as 64-bit integers, then return one of several actions: allow, deny with an error code, send SIGSYS, or kill the process. The kernel documentation on seccomp details the full set of return values and how they interact with signal handling.

Where Each Model Breaks Down

Capsicum’s limitation is the one-way door. Once cap_enter() is called, there is no going back, and the process can only work with what it already has. This requires applications to be architected with sandboxing in mind from the start, doing all resource acquisition before entering capability mode, then delegating privileged operations to a parent process that never entered it. Chrome’s architecture on FreeBSD does exactly this: a privileged broker process holds resources and passes file descriptors to sandboxed renderer processes.

seccomp’s limitation is that syscall numbers are a poor proxy for what a process is actually doing. You cannot tell from openat() whether the process is opening a configuration file or /etc/shadow. You cannot tell from ioctl() whether it’s a harmless terminal query or a privileged device operation. The BPF program can inspect argument values, but only as raw 64-bit integers, which means pointer arguments like filenames are opaque. The filter cannot dereference memory to check what filename is being passed to openat() without external coordination.

This creates policies that either over-allow or require complex argument matching that still cannot inspect pointer contents. Chrome’s seccomp policies, documented in the Chromium source, run to hundreds of lines and require constant maintenance as the kernel’s syscall surface changes.

Why seccomp Won on Linux

Despite Capsicum’s cleaner security model, seccomp has vastly more real-world adoption. Docker’s default seccomp profile blocks around 44 system calls. systemd’s SystemCallFilter directive lets administrators restrict service syscalls in unit files. Firefox, Chrome, and most Electron-based applications use seccomp on Linux. Capsicum, meanwhile, is primarily a FreeBSD feature with limited traction elsewhere.

Part of this is simply OS distribution: Linux runs on vastly more servers and desktops than FreeBSD. Part of it is ecosystem momentum; once Docker and systemd adopted seccomp, every container runtime and service manager followed. But there is also a philosophical match: seccomp fits the way Linux developers tend to think about security, as a series of explicit denials layered on top of permissive defaults.

Porting Capsicum to Linux has been attempted. Work from around 2014 to 2016 explored a kernel module approach, but it was never merged into mainline. The VFS changes required to properly implement capability-based file descriptor semantics in Linux are substantial, and the kernel maintainers had less motivation to pursue them when seccomp was already shipping and being adopted.

seccomp-notify as a Bridge

The most interesting recent development in seccomp is the notify mechanism, added in Linux 5.0. It addresses one of seccomp’s core weaknesses by allowing a supervisor process to intercept syscalls at runtime and make decisions with full access to process memory.

// In the sandboxed process: install notify action for openat
seccomp_rule_add(ctx, SCMP_ACT_NOTIFY, SCMP_SYS(openat), 0);
seccomp_load(ctx);

// Supervisor receives a pollable FD
int notif_fd = seccomp_notify_fd(ctx);
// Poll notif_fd for events, read struct seccomp_notif
// Examine args, read process memory via /proc/PID/mem
// Respond with seccomp_notif_resp: allow, deny, or errno

The supervisor can read the sandboxed process’s memory via /proc/PID/mem, check the actual filename being passed to openat(), and make a policy decision based on real information rather than an opaque pointer value. This is still more complex than Capsicum’s model, but it closes the gap considerably.

seccomp-notify is being used by container runtimes like gVisor to intercept and reimplement syscalls in userspace, and by tools like Sysbox to enable container-in-container setups without requiring full root privileges. The Linux kernel documentation for seccomp-notify covers the full protocol.

Performance in Practice

The performance profile of each approach reflects their implementation. Capsicum’s rights check is a table lookup on the file descriptor, effectively O(1) and adding on the order of 50 to 100 nanoseconds per syscall. There is no ongoing filtering cost beyond the initial setup of cap_rights_limit() on each descriptor.

seccomp’s cost depends on the BPF program’s complexity. Simple policies with a few allowed syscalls run in a few hundred nanoseconds. A renderer policy that handles dozens of syscall conditions and argument checks can approach a few microseconds per syscall. In practice, Google’s engineering has described seccomp overhead as acceptable for user-facing performance, though it does appear in profiling data for syscall-intensive paths.

For most applications, neither overhead is the bottleneck. The cost of the system call itself dominates, and the filtering or capability check is a small fraction of that. Where it matters is in high-frequency syscall paths: tight loops calling read() or write() many thousands of times per second, where even nanosecond-level overhead compounds.

Two Security Philosophies, One Problem

What Capsicum and seccomp reveal is that process sandboxing is not a solved problem with a single correct answer. Capsicum’s capability model is formally cleaner: it is easier to reason about what a sandboxed process can do when the entire security state is captured in the file descriptors it holds and the rights attached to them. seccomp’s syscall filtering model is more practical for gradual adoption; you can add seccomp restrictions to an existing application without redesigning its resource acquisition patterns.

Both approaches have been extended over time toward what the other offers natively. seccomp-notify gives seccomp some of Capsicum’s ability to make context-aware decisions. FreeBSD’s Capsicum implementation has continued gaining finer-grained rights. Neither system is standing still.

The comparison at Voss’s blog provides a useful side-by-side walkthrough of the APIs if you want to see them compared concretely. What that comparison ultimately shows is that your OS makes the decision for you to a large extent: seccomp if you’re on Linux, Capsicum if you’re on FreeBSD. The more interesting question is whether the Linux ecosystem will eventually want the cleaner model badly enough to pursue a genuine port of Capsicum’s capability semantics into the kernel, or whether seccomp-notify’s pragmatic middle ground will prove sufficient.

Given the trajectory so far, the pragmatic answer is winning.

Was this interesting?