Two Mental Models for Process Sandboxing: Capsicum and seccomp

Process sandboxing sits at an interesting intersection: the kernel has to enforce a policy you express, but the expressiveness of the policy language shapes what security properties you can even think about. Capsicum on FreeBSD and seccomp-BPF on Linux are both mechanisms for constraining what a process can do after startup, but they rest on different foundations and those foundations matter.

A recent writeup comparing the two prompted me to dig into something the straight comparison tends to gloss over: Capsicum and seccomp are not solving the same problem. They represent different theories about what process authority even is, and the practical consequences of that difference run deeper than API ergonomics.

Ambient Authority and Why It’s the Real Problem

Every standard POSIX process carries what capability theorists call ambient authority: the ability to access global system namespaces using nothing but its credentials. An unprivileged process with UID 1000 can call open("/home/user/secrets", O_RDONLY) and the kernel will permit it, because the path is world-readable or owned by that UID. The process’s authority to open that file comes not from any explicit capability it holds, but from the ambient context of who it is.

This sounds fine until you’re trying to sandbox a process that has already started. You want the post-sandbox process to be unable to open new files it shouldn’t see, but the ambient authority is baked into the POSIX model itself. You can’t revoke a UID.

Capsicum attacks this directly. seccomp does not.

Capsicum: Eliminating the Global Namespace

Capsicum was designed at Cambridge by Robert Watson and Jonathan Anderson, with early publication at USENIX Security 2010. The core idea is that file descriptors are already the right abstraction, since they’re kernel-managed unforgeable handles to objects. The problem is that processes also have the ability to acquire new file descriptors by name, through global namespace lookups. Capsicum severs that ability.

Calling cap_enter() puts the process into capability mode, irreversibly. From that point, any syscall that requires a global namespace lookup fails with ECAPMODE. You cannot call open("/etc/passwd"). You cannot call connect() to resolve a new address. You cannot execve() a new binary by path. The only file system access you have is through file descriptors you already hold, and those can be further restricted with cap_rights_limit():

cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK, CAP_FSTAT);
cap_rights_limit(fd, &rights);
cap_enter();

After those three lines, fd can only be read and seeked. Not written, not executed, not used for ioctl. And no new files can be opened. To work with files at all after cap_enter(), you pre-open a directory before entering capability mode, pass it CAP_LOOKUP, and then use openat() relative to that directory:

int dirfd = open("/var/myapp/data", O_RDONLY | O_DIRECTORY);
cap_rights_t rights;
cap_rights_init(&rights, CAP_LOOKUP, CAP_READ);
cap_rights_limit(dirfd, &rights);
cap_enter();

/* Works: relative path through a pre-opened directory fd */
int f = openat(dirfd, "config.json", O_RDONLY);

/* Fails ECAPMODE: absolute path requires namespace lookup */
int g = open("/etc/passwd", O_RDONLY);

This is the object-capability model applied to an operating system. Authority is represented entirely by objects the process holds, with rights that can only shrink, and that can be delegated by passing the fd over a socket (via SCM_RIGHTS) to another process. The global namespace is gone.

seccomp: Filtering the Mechanism

Linux’s approach is different in kind. seccomp-BPF, merged in Linux 3.5, lets you install a classic BPF program that runs at every syscall entry point. The program receives the syscall number, architecture, instruction pointer, and the six integer arguments, then returns a verdict: allow, kill, return an error, or escalate to a userspace supervisor.

#include <seccomp.h>

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
seccomp_load(ctx);

This blocks everything not explicitly allowed. The process cannot call open, openat, socket, or anything else not on the list. That sounds similar to Capsicum in effect, but the mechanism reveals a structural limit.

The BPF program can inspect the syscall number and the integer arguments, but it cannot safely dereference pointer arguments. When a process calls openat(dirfd, path, flags), the path argument is a user-space pointer, a 64-bit integer in the BPF context. The BPF program cannot follow that pointer into user memory. This means seccomp cannot implement “allow openat only to paths under /var/myapp”, at least not directly.

You can allow openat unconditionally, or block it unconditionally, or route specific calls through SECCOMP_RET_USER_NOTIF to a supervisor process that does the inspection in userspace, at the cost of significant complexity and TOCTOU exposure. The path the BPF program sees as a pointer is in user memory; between when the filter fires and when the supervisor reads it, the calling process could have mutated the string.

Capsicum doesn’t have this problem. The fd-based model means the kernel checks rights against a kernel data structure at operation time. There’s no pointer to follow and nothing for the process to swap underneath you.

The Syscall Number Problem

seccomp filters are also architecture-specific. The syscall number for openat on x86_64 is 257; on ARM64 it’s 56; on 32-bit x86 it’s 295. Any filter that skips the architecture check is vulnerable to 32-bit/64-bit confusion: a process on an x86_64 kernel that issues a int $0x80 syscall uses 32-bit numbering, where different numbers map to different syscalls. The mandatory architecture check is not enforced by the kernel; it’s the programmer’s responsibility.

libseccomp handles this automatically, which is why you should use it rather than hand-crafting BPF. But it means seccomp policies require maintenance as new architectures are added and new syscalls appear. Capsicum’s capability rights bitmap is architecture-independent.

What Landlock Is Doing

Landlock, merged in Linux 5.13 in 2021, represents Linux’s most serious attempt to close the gap Capsicum identified. It’s an unprivileged Linux Security Module that lets processes restrict their own filesystem access using path-based rules rather than syscall filtering:

struct landlock_ruleset_attr rs_attr = {
    .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
                         LANDLOCK_ACCESS_FS_WRITE_FILE |
                         LANDLOCK_ACCESS_FS_READ_DIR,
};
int rs_fd = syscall(SYS_landlock_create_ruleset, &rs_attr, sizeof(rs_attr), 0);

struct landlock_path_beneath_attr path_attr = {
    .allowed_access = LANDLOCK_ACCESS_FS_READ_FILE | LANDLOCK_ACCESS_FS_READ_DIR,
    .parent_fd = open("/usr", O_PATH | O_CLOEXEC),
};
syscall(SYS_landlock_add_rule, rs_fd, LANDLOCK_RULE_PATH_BENEATH, &path_attr, 0);

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
syscall(SYS_landlock_restrict_self, rs_fd, 0);

Landlock uses inode-based checks rather than path string matching, which avoids the TOCTOU problems of inspecting pathname arguments. Linux 6.7 extended it to TCP bind and connect rules. It stacks with seccomp and doesn’t require Capsicum’s full fd-refactor of the application.

The trade-off versus Capsicum: Landlock doesn’t eliminate ambient authority. A process with a Landlock ruleset that allows reads under /usr still has the ambient authority to open any file there, rather than only files reachable through a pre-opened handle. Capsicum’s model is stricter. But the porting burden is dramatically lower, which explains the adoption trajectory.

The Practical Calculus

Capsicum has been in FreeBSD since 9.0 and is used in tcpdump, dhclient, ping, the OpenSSH pre-auth sandbox, and Chromium’s renderer sandbox on FreeBSD. The design is clean, theoretically sound, and the libcasper library provides capability-safe wrappers for operations that need privilege delegation (DNS lookup, syslog, etc.). The main cost is that applications must be refactored to use openat() throughout and pre-open all needed resources before calling cap_enter(). For a program like tcpdump this is tractable; for a general-purpose application with complex I/O patterns, it’s a significant rewrite.

seccomp-BPF, by contrast, is used in Chrome, Firefox, Docker’s default profile, systemd unit files (SystemCallFilter=), OpenSSH, and effectively every container runtime. The filter says nothing about which files can be opened, only which syscalls can be called. For many threat models, that’s enough: a renderer process that can’t call execve, fork, or ptrace is substantially constrained even if it can still call openat. Combined with Linux namespaces for filesystem and network isolation, seccomp covers most of what Capsicum would cover for container workloads.

OpenBSD took a third path with pledge() and unveil(): promise-based coarse sandboxing combined with path restriction. pledge("stdio rpath inet", NULL) limits a process to I/O, read-only filesystem, and network. unveil("/var/myapp", "rwc") restricts which paths are visible at all. The ergonomics are the best of the three approaches; the granularity is the coarsest.

Where This Leaves You

If you’re writing software for FreeBSD or building something where the ambient authority model is genuinely the threat you want to address, Capsicum is the more principled tool. Its object-capability model makes security properties compositional and inspectable. The porting cost is real but the design pays off for long-lived, security-critical software.

On Linux, seccomp-BPF is unavoidable and worth understanding at the BPF level even if you use libseccomp day-to-day. The filter-as-program model is expressive but requires care around architecture handling and the inability to inspect pointer arguments. For filesystem access control, Landlock deserves attention: it’s in mainline since 5.13, requires no privilege, and handles the path-based restriction that seccomp structurally cannot express.

The gap between these two systems is ultimately a gap in how Linux chose to evolve its security model versus the cleaner redesign that Capsicum represents. The capsicum-linux patches have never been merged; the Linux position has been that namespaces plus seccomp plus Landlock covers the use cases. That may be true in practice. But working through Capsicum’s design, even if you’re shipping on Linux, clarifies what you’re actually trying to achieve when you sandbox a process, and that clarity is worth something regardless of which kernel you end up on.