· 8 min read ·

Two Models of Process Sandboxing: How Capsicum and seccomp Disagree on the Problem

Source: lobsters

Security engineers often talk about sandboxing as if it were a single technique, but the two most widely deployed process sandboxing mechanisms on Unix systems, Capsicum on FreeBSD and seccomp-BPF on Linux, reflect genuinely different theories about where the danger lies. Vivian Voss’s comparison post is a good starting point, and this post digs into the technical specifics and design implications that emerge when you actually use both.

The Core Disagreement

seccomp works by filtering system calls. You install a BPF program into the kernel that runs on every syscall the process makes, and that program decides whether to allow, deny, or handle the call in some other way. Capsicum works by eliminating ambient authority: once a process calls cap_enter(), it can no longer reference global namespaces. No opening files by path, no DNS lookups by name, no connecting to new network addresses. Everything the process needs must already be in its file descriptor table, and those descriptors carry explicit capability rights that constrain what operations are legal.

These are not just different implementations of the same idea. They encode different assumptions about what makes code dangerous.

seccomp: Syscall-Level Filtering

Linux seccomp has two modes. The original strict mode, introduced in 2.6.12, only permits read, write, _exit, and sigreturn. It’s essentially unusable for anything beyond trivial compute tasks. The useful mode is seccomp-filter, added in 3.5, which lets you attach a classic BPF (not eBPF) program that inspects the raw seccomp_data struct on each syscall entry:

struct seccomp_data {
    int nr;                    /* syscall number */
    __u32 arch;                /* AUDIT_ARCH_* value */
    __u64 instruction_pointer; /* CPU instruction pointer */
    __u64 args[6];             /* up to 6 syscall arguments */
};

You install a filter with:

struct sock_fprog prog = {
    .len = ARRAY_SIZE(filter),
    .filter = filter,
};
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

or the more modern seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog) syscall. The return values from your BPF program control the outcome: SECCOMP_RET_ALLOW lets the call through, SECCOMP_RET_KILL_PROCESS terminates immediately, SECCOMP_RET_ERRNO returns a synthetic error to userspace, and SECCOMP_RET_NOTIFY (added in 5.0) sends the decision to a supervising process over a file descriptor.

Most projects use libseccomp to avoid writing BPF bytecode by hand:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_load(ctx);

Filters are inherited across fork() and execve(), they stack (child filters cannot be more permissive than parent filters), and once installed they cannot be removed. That last property is deliberate: a compromised process cannot disable its own restrictions.

The Argument Inspection Problem

Here is where seccomp runs into a structural limitation. The args array in seccomp_data contains raw syscall arguments, integers or pointers. For syscalls like openat(dirfd, pathname, flags), the pathname argument is a pointer into user memory. The BPF program cannot dereference that pointer. The kernel explicitly prevents this, and for good reason: if the filter read from user memory and the syscall’s path resolution also read from user memory, an attacker could race the two reads and change the path between them. This is a classic TOCTOU attack.

The consequence is that seccomp cannot meaningfully filter on path arguments. You can allow or deny openat entirely, you can check the flags integer, but you cannot say “allow opens to /tmp but not to /etc”. Projects like Chrome work around this by using a dedicated privileged broker process that receives requests over IPC and performs path-based operations on behalf of the sandboxed renderer. The broker makes the policy decision; the sandbox just forbids direct syscalls.

Capsicum: Capability-Based Sandboxing

Capsicum, developed at Cambridge and first shipped in FreeBSD 9.0, takes the opposite approach. Rather than filtering what system calls a process can make, it changes what those calls can do.

cap_enter() is a one-way door. After calling it, the process is in capability mode. In capability mode:

  • open() fails with ECAPMODE. You cannot open a file by absolute path.
  • socket() with PF_INET fails. You cannot create new network connections by address.
  • chdir(), stat() on paths, access(), and other namespace-touching calls all fail.

What the process retains is the ability to operate on file descriptors it already holds. And those descriptors carry rights that can only be restricted, never expanded. You pre-load resources before calling cap_enter(), then strip rights down to what you actually need:

int fd = open("/var/data/input.dat", O_RDONLY);

cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK, CAP_FSTAT);
cap_rights_limit(fd, &rights);

cap_enter(); /* now in capability mode */

/* this is fine */
read(fd, buf, sizeof(buf));

/* this fails with ENOTCAPABLE */
write(fd, buf, sizeof(buf));

Capability rights are granular. The full list includes CAP_READ, CAP_WRITE, CAP_SEEK, CAP_MMAP, CAP_FSTAT, CAP_CONNECT, CAP_ACCEPT, CAP_BIND, CAP_LISTEN, CAP_FCHMOD, CAP_LINKAT_SOURCE, CAP_LINKAT_TARGET, and many others. You can also restrict which ioctl codes are permitted on a descriptor with cap_ioctls_limit(), and which fcntl commands with cap_fcntls_limit().

The key insight is that you cannot have a TOCTOU problem when you’re working with descriptors rather than paths. An fstat(fd) on a capability-restricted fd operates directly on the kernel object the descriptor refers to. There is no second lookup, no race window.

Relative Opens with openat

Capsicum does not require processes to pre-open every file they will ever need. Instead, it uses openat() extensively. A process in capability mode can hold a descriptor for a directory and use openat(dirfd, "relative/path", flags) to open files within it. The capability rights on dirfd must include CAP_LOOKUP, and the result inherits appropriate rights. This is how tools like tcpdump and dhclient were adapted for Capsicum on FreeBSD: they receive a pre-opened directory descriptor and perform all their file access relative to it, eliminating any ability to escape into the broader filesystem.

Casper: Privileged Services Outside the Sandbox

For operations that genuinely require global namespace access, FreeBSD provides Casper, a library for building privileged helper services. A sandboxed process can hold a Casper channel descriptor and make requests through it. The Casper service runs outside the sandbox with full privileges, validates requests against a policy, and performs the operation. This is structurally similar to Chrome’s broker process model, but Casper provides standard library services: cap_dns for DNS resolution, cap_grp for group database lookups, cap_pwd for password entries, cap_sysctl for sysctl reads, and others.

The difference from an ad-hoc broker is that Casper services are composable and policy is explicit. You tell cap_dns which hostnames you’re allowed to resolve; it refuses everything else.

Porting Effort and Adoption

This is where the tradeoffs become practical. seccomp can usually be added to an existing Linux program without restructuring it. You decide which syscalls to allow, install the filter, and run. The program’s call sites for open(), connect(), and everything else remain unchanged. The work is in enumerating the syscall surface, which is non-trivial but tractable for most programs.

Capsicum generally requires restructuring. You need to identify all resources the program will need before entering capability mode, open them in advance, strip their rights appropriately, and then change every subsequent access to use descriptors rather than paths. For programs that open files on demand based on user input, this can require significant refactoring. OpenSSH, for example, was adapted for Capsicum in FreeBSD with a privilege-separated design where the unprivileged child enters capability mode holding only the descriptors it needs for the active connection.

On Linux, Capsicum support exists through the capsicum-linux project, and there is libcapsicum for userspace, but neither is part of the mainline kernel. The practical reach of Capsicum is largely FreeBSD and derivatives.

The Philosophical Difference

seccomp says: here is the set of kernel interfaces this process may call. Capsicum says: here are the objects this process holds, and here is what it may do with them. The seccomp model is closer to a firewall, inspecting traffic at the boundary. The Capsicum model is closer to a type system, encoding authority into the objects themselves.

seccomp’s weakness is that a large syscall allowlist can still leave considerable attack surface. Allowing mmap, mprotect, clone, and execve in combination is almost equivalent to no restriction at all. Building a minimal allowlist requires deep knowledge of what the program actually calls, and that knowledge gets stale as libraries change. Projects like syscall2seccomp and the --export-dynamic linker approach to static analysis help, but they are imperfect.

Capsicum’s weakness is that the model only constrains what you can do with kernel objects. If the process has an fd to a writable directory with CAP_LOOKUP, it can create, modify, and delete files within that directory. A compromised library component in a Capsicum-sandboxed process has the same access to those fds as the rest of the process. Capsicum does not address intra-process isolation; for that you’d need separate processes with separate capability sets.

Used in Practice

On Linux, seccomp is used by Chrome, Firefox (using its own RLBox sandboxing framework), Docker (OCI runtime spec requires seccomp filtering by default), systemd (the SystemCallFilter= directive), OpenSSH since 6.0, and many others. The SECCOMP_RET_NOTIFY mechanism has made seccomp more useful for container runtimes, since a supervisor can intercept and handle specific syscalls rather than just blocking them.

On FreeBSD, Capsicum is used in the base system for tcpdump, dhclient, hastd, and parts of OpenSSH. The Chromium port on FreeBSD uses Capsicum in its sandboxing layer. The adoption has been steady but slower than seccomp’s, partly because Capsicum’s adoption requires deeper code changes and partly because FreeBSD’s install base is smaller.

Which One to Reach For

If you are writing Linux software and want to reduce attack surface, seccomp is available, mature, and well-documented. The cost of adopting it is low, the tooling is solid, and SECCOMP_RET_NOTIFY gives you a reasonable path to handling edge cases without dropping the filter. Start with a permissive filter in audit mode, capture what your program actually calls, and tighten from there.

If you are writing FreeBSD software, or writing software that will run on both and care about correctness rather than just surface reduction, Capsicum forces a program structure that is harder to exploit even if part of the program is compromised. Pre-opening and right-limiting resources is not just a sandboxing step; it documents and enforces what the program is supposed to access. That discipline has value beyond security.

The deeper point is that these two mechanisms are not substitutes. A POSIX-path-based openat() call on a Capsicum-restricted fd is structurally race-free in a way that seccomp cannot replicate. Conversely, seccomp’s syscall-level filtering catches a class of privilege escalation attempts that Capsicum’s object model is silent on. The most defensible posture, if you are on a platform that supports both, is to use them together.

Was this interesting?