· 6 min read ·

Capability Mode vs. Syscall Filtering: What Separates Capsicum from seccomp

Source: lobsters

Process sandboxing is one of those problems that looks solved on paper and turns difficult in implementation. Two mechanisms dominate the landscape: Capsicum, the capability model from FreeBSD, and seccomp, Linux’s syscall filter. This comparison puts both side by side; the deeper story is what the design philosophy behind each model costs you in practice.

Where They Come From

seccomp (secure computing mode) landed in Linux 2.6.12 in 2005, written by Andrea Arcangeli for Google’s compute infrastructure. The original SECCOMP_MODE_STRICT allowed only four syscalls: read, write, exit, and sigreturn. Too restrictive for most use cases, it found its modern form in Linux 3.5 (2012), when Will Drewry contributed SECCOMP_MODE_FILTER. This mode attaches a classic BPF program as a filter that runs on every syscall. The filter receives a seccomp_data struct containing the syscall number, architecture, instruction pointer, and up to six raw argument registers, then returns a verdict such as SECCOMP_RET_ALLOW, SECCOMP_RET_KILL_PROCESS, or SECCOMP_RET_ERRNO.

Capsicum emerged from Cambridge University, designed by Robert Watson, Jonathan Anderson, Ben Laurie, and Peter Neumann. Their 2010 USENIX Security paper described the model; FreeBSD 10.0 (2013) shipped it as stable. The core idea: give processes a way to enter capability mode, where all resource access must flow through explicitly-granted file descriptor capabilities.

The Core Design Difference

seccomp frames sandboxing as a question of which syscalls a process is permitted to invoke. Capsicum frames it as a question of which resources a process holds and what operations each fd permits.

This matters because syscall number and resource access are not the same thing. Consider read(fd, buf, len). A seccomp filter can allow or deny the read syscall globally, but it cannot express “allow reads on fd 3, deny reads on fd 7” without involving SECCOMP_RET_USER_NOTIF to route the decision to a supervisor process. The BPF program sees args[0] as an integer, so in principle it could compare it against a specific fd number, but fd numbers are allocated dynamically. You cannot know at filter-install time which integer a later open() will return.

Capsicum avoids this entirely by attaching rights directly to file descriptors via cap_rights_limit(2). When the process calls read(fd, buf, len), the kernel checks whether that fd’s capability bitmask includes CAP_READ. The capability travels with the fd through dup, fork, and sendmsg, always carrying its constraints:

/* Open a file and restrict what any holder of this fd can do with it */
int fd = open("/etc/passwd", O_RDONLY);

cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_FSTAT);
cap_rights_limit(fd, &rights);

/* From here, write(fd, ...) and lseek(fd, ...) will fail with ENOTCAPABLE */

After cap_enter() puts the process into capability mode, it can no longer call open() with absolute paths, create new sockets, or access global namespaces. The only way to reach a resource is through a pre-opened fd with appropriate rights, or through openat() relative to a directory fd carrying CAP_LOOKUP:

cap_enter(); /* irreversible */

/* Fails with ECAPMODE in capability mode */
int fd = open("/etc/something", O_RDONLY);

/* Works if dirfd carries CAP_LOOKUP */
int fd2 = openat(dirfd, "something", O_RDONLY);

The Pointer Problem

seccomp has a structural limitation that Capsicum does not: classic BPF programs cannot safely dereference userspace pointers. The args[] array in seccomp_data holds raw register values. For syscalls that take integer arguments, fd numbers, flags, or sizes, this is workable. But syscalls like openat(dirfd, path, flags) pass path as a pointer to userspace memory. The BPF filter sees the pointer’s address, not the string at that address.

This means seccomp cannot natively express path-based access control. It can deny openat entirely or allow it entirely, but not “allow openat only when the path argument begins with /tmp.” SECCOMP_RET_USER_NOTIF, added in Linux 5.0, allows routing the decision to a supervisor that can inspect the calling process’s memory via pidfd, but this adds an IPC round-trip on every filtered syscall.

Capsicum’s rights check happens at the moment of fd creation and is a bitmask comparison at syscall time. No pointer chasing, no supervisor round-trip.

Making Capability Mode Usable with casper

A real tension in Capsicum: a process in capability mode cannot do DNS lookups, read /etc/passwd, write to syslog, or perform any task that requires opening new files or sockets. This matters because sandboxed code frequently still needs some of these services.

FreeBSD’s solution is casper(3) and libcasper. Before entering capability mode, the process forks a casper service and communicates with it via a capability-wrapped unix socket. Service capsules like cap_dns, cap_pwd, cap_grp, and cap_syslog perform privileged operations on behalf of the sandbox, returning results through the socket. tcpdump, OpenSSH, and several FreeBSD base utilities use this pattern.

seccomp’s equivalent is less standardized. Programs pre-open resources before installing the filter, maintain a hand-rolled IPC channel to a privileged parent, or use SECCOMP_RET_USER_NOTIF for a more structured handoff. Chrome on Linux implements its own sandbox IPC layer for this purpose. The casper model gives Capsicum adopters a shared, tested infrastructure rather than per-application reinvention.

Why Linux Took a Different Path

Watson and colleagues proposed a Linux port of Capsicum, and there have been patch series over the years. The obstacles are architectural. Linux’s VFS layer is tightly coupled to path-based lookup starting from a global root. Adapting capability mode to Linux would require changes throughout a large body of kernel code that assumes processes can always access the filesystem by path. FreeBSD’s codebase was more amenable to the retrofit.

Instead, Linux has layered multiple mechanisms: seccomp-bpf for syscall filtering, namespaces (mount, pid, network, user) for resource isolation, and Landlock, merged in Linux 5.13 (2021). Landlock is the closest Linux has come to Capsicum’s model: it lets unprivileged processes restrict their own filesystem access through a ruleset of paths and permitted operations, using a landlock_create_ruleset / landlock_add_rule / landlock_restrict_self sequence. Landlock’s ABI has iterated through five versions across kernel releases from 5.13 through 6.10, adding truncate control, TCP network restrictions, and cross-directory rename semantics incrementally.

But Landlock’s coverage remains narrower than Capsicum’s, and the Linux mechanisms do not compose into a unified model. Chrome’s Linux sandbox combines seccomp-bpf, namespaces, and a setuid sandbox helper. Chrome’s FreeBSD sandbox uses Capsicum alone.

Performance Characteristics

seccomp runs a BPF program on every syscall invocation. A filter checking 30 syscall numbers on a program that performs a million read calls will execute that BPF program a million times. Overhead scales with filter complexity; simple filters add roughly 50 to 200 nanoseconds per syscall depending on hardware and filter length, and this compounds on syscall-heavy workloads.

Capsicum’s rights check is a bitmask AND against the fd’s capability set, costing a few nanoseconds at most. The performance cost lies elsewhere: restructuring a program to pre-open all required resources before calling cap_enter() is sometimes significant refactoring work. Programs with dynamic resource acquisition patterns, ones that open files or create sockets based on runtime input, need the most adaptation.

Two Models, One Problem

The two mechanisms are not competing for the same use case on the same platform. On Linux, seccomp and increasingly Landlock is what you have. On FreeBSD, Capsicum is the native mechanism.

The philosophical contrast is worth sitting with. Capsicum’s design is coherent as a capability system: rights are per-resource, capabilities are first-class kernel objects, and the mode boundary is strict and irreversible. seccomp is a filter layered over the existing syscall interface, expressive within the limits of what classic BPF can observe, but unable to express resource-level access control without additional infrastructure.

For applications that need fine-grained resource isolation, Capsicum’s model is more principled. For quickly reducing a process’s syscall surface without restructuring its code, seccomp-bpf is more practical. Linux traded design coherence for adoption breadth and incremental deployability. FreeBSD made the opposite bet. Both choices have shaped what sandboxing looks like in real deployed software.

Was this interesting?