· 7 min read ·

Object Capabilities vs Syscall Interposition: Two Philosophies of Process Sandboxing

Source: lobsters

Process sandboxing is one of those areas where the right answer depends heavily on which question you’re asking. A recent comparison of Capsicum and seccomp lays out how both mechanisms work at a practical level. What I want to explore is why they diverge so sharply in design philosophy, what that difference means for the security invariants you can actually reason about, and how the ecosystem has responded over time.

The Ambient Authority Problem

Before getting into either API, it helps to understand what both mechanisms are trying to solve. A freshly forked process inherits a substantial amount of authority from its parent: open file descriptors, environment variables, signal dispositions, and the ability to make arbitrary system calls to acquire more resources. This inherited authority is sometimes called “ambient authority”: it exists by default, without any explicit grant to the process that now holds it.

Ambient authority is dangerous because it is invisible. A process doing one legitimate thing can quietly also do something else, using authority it inherited but was never supposed to exercise. Both Capsicum and seccomp are trying to strip that ambient authority away, but they do so at completely different layers of the system.

Capsicum: Restricting What You Can Access

Capsicum was developed at the University of Cambridge and described in a 2010 USENIX Security paper by Robert Watson and Jonathan Anderson. It was integrated into FreeBSD 9.2 and has been refined steadily since. The model operates at the level of objects rather than system calls. The core abstraction is the capability, which in Capsicum’s case is a file descriptor with a restricted set of rights attached to it.

The workflow is straightforward to reason about. Before entering sandbox mode, you prepare all the file descriptors your process will need, narrow their rights to exactly what is necessary, and call cap_enter(). After that, the process cannot open new files by name, cannot resolve paths, and cannot acquire new file descriptors through most channels. Everything it can access, it already has, and each file descriptor carries its own permission set.

// Restrict an FD to read-only before entering sandbox
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK, CAP_FSTAT);
cap_rights_limit(fd, &rights);

// Enter capability mode - there is no going back
cap_enter();

// The process can only use what its FDs allow

The key property here is that capabilities are unforgeable. A file descriptor is a kernel-managed integer. You cannot manufacture one without the kernel handing it to you, and once you are in capability mode, the kernel will not hand out new ones based on paths or names. If a process has a capability, it received it through an explicit grant. There is no escalation path through ambient authority the process did not know it had.

Capsicum’s model has deep roots in formal capability theory, going back to Lampson’s 1974 paper “Protection”. The principle is that access rights should be explicit, unforgeable tokens tied to objects, not derived from ambient state. The security invariant is local and verifiable: if a process does not hold a file descriptor for a resource, it cannot access that resource.

For operations that legitimately require privilege, the casper daemon handles them. A sandboxed process holds a capability to the casper service, makes a restricted request (a DNS lookup, a user database query), and receives back only what it is permitted to see. It is a capability transfer rather than a privilege escalation, and the boundary remains clean.

OpenSSH on FreeBSD is the canonical example of Capsicum in production. The privilege-separated child process calls cap_enter() after receiving the socket from the parent, leaving it with no path to acquire additional resources even if an attacker achieves code execution inside it.

seccomp-bpf: Restricting What You Can Call

seccomp took a completely different approach. Rather than restricting which objects a process can access, it restricts which system calls the process can make. The seccomp-bpf mode, introduced in Linux 3.5 by Will Drewry at Google, lets you attach a BPF program that runs on every system call and decides whether to allow it, kill the process, return a specific errno, or notify a tracer.

struct sock_filter filter[] = {
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
};

Writing raw BPF is unpleasant, and most practitioners reach for libseccomp, which wraps the interface in something more legible:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_load(ctx);
seccomp_release(ctx);

seccomp-bpf can also inspect syscall arguments, which is where things get flexible and complicated simultaneously. You can allow mmap() only when called with MAP_PRIVATE, or restrict socket() to specific address families. But argument inspection only sees the raw 64-bit register values, not the underlying data in memory. If you allow open() based on the flags argument, you are not checking which path is being opened; you are checking how the caller wants to open it. Path resolution happens in the kernel after the seccomp filter has already run.

Chromium’s renderer process policy shows the practical scope of this approach. It is a substantial file because the renderer needs a substantial list of permitted syscalls, and each one has to be reasoned about individually.

The Model Comparison

The fundamental asymmetry between the two mechanisms comes down to what security invariant you are enforcing. Capsicum enforces: “this process can only access resources it was explicitly granted.” seccomp enforces: “this process can only invoke system calls that pass this filter.”

Capsicum’s invariant is compositional. You can reason about each capability in isolation. The security of the whole system follows from the security of each individual capability grant. seccomp’s invariant requires enumerating the complete attack surface: every system call that could be misused, every argument pattern that could cause harm, with a single omission leaving a gap in coverage.

The ioctl() system call illustrates the difference well. It is one system call that performs hundreds of different operations depending on the request argument. seccomp can restrict which ioctl request codes are permitted, but doing it correctly requires per-device, per-operation knowledge. Capsicum sidesteps this by restricting the file descriptor itself; without CAP_IOCTL on the descriptor, ioctl() cannot be called on it at all. The security policy is attached to the object, not to a list of ways the object might be manipulated.

Performance and Overhead

The two mechanisms have meaningfully different performance profiles. Capsicum’s checks happen when the process attempts to use a file descriptor. The check is a bitmask comparison against the stored capability rights, and the overhead is negligible for most workloads.

seccomp-bpf filters run on every system call before the system call executes. For simple filters on a single syscall match, the overhead per call is roughly 5 to 20 nanoseconds on modern hardware. Complex filters with multiple conditions and argument comparisons can push toward 100 nanoseconds per call. Chromium’s filters are extensive enough that the overhead is measurable in syscall-heavy code paths, though the security benefit justifies the cost at Chromium’s threat model.

One subtle implication: seccomp filter complexity tends to grow over time as applications need new syscalls and policies get extended. Capsicum’s capability sets do not have this property; narrowing a capability set is always a one-way operation.

Landlock: Linux Moving Toward the Capability Model

Linux does not have Capsicum, but the kernel has been moving toward capability-style restrictions with Landlock, a Linux Security Module merged in kernel 5.13. Landlock lets you restrict a process’s filesystem access using a ruleset that grants specific access rights to specific path hierarchies, which is closer to Capsicum’s model than seccomp is.

struct landlock_ruleset_attr attr = {
    .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
                         LANDLOCK_ACCESS_FS_READ_DIR,
};
int ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);

struct landlock_path_beneath_attr rule = {
    .allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
    .parent_fd = open("/etc", O_PATH | O_CLOEXEC),
};
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &rule, 0);

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
landlock_restrict_self(ruleset_fd, 0);
close(ruleset_fd);

You are granting access to objects rather than allowing system calls, which means the security reasoning is closer to Capsicum’s compositional model. Landlock v4 (kernel 6.7) extended this to network restrictions, allowing you to restrict which TCP ports a process can bind or connect to. The project is still filling out its feature set, but the design direction is clear.

Why seccomp Won in Practice

Capsicum requires application changes. You cannot sandbox an existing binary with Capsicum without modifying its code to prepare file descriptors, narrow their rights, and call cap_enter(). seccomp can be applied externally without touching the sandboxed program, which is why it fit naturally into container runtimes, systemd service units, and browser architectures.

Docker’s default seccomp profile blocks roughly 44 system calls out of the roughly 300 available on Linux x86_64, applied to every container without any container-side configuration. Kubernetes lets you attach seccomp profiles through pod annotations. systemd service files expose SystemCallFilter= as a straightforward directive. The deployment path required no changes to the code being sandboxed.

This is a pattern that appears repeatedly in security tooling: the mechanism with the cleaner theoretical model often loses adoption to the one that is easier to retrofit into existing systems. Capsicum’s object-capability model is more composable, easier to reason about formally, and less susceptible to the enumeration problem that plagues syscall filtering. seccomp met operators where they already were.

For FreeBSD applications that can invest in the privilege separation upfront, Capsicum remains the right choice, and OpenSSH shows how it looks when done well. For Linux systems today, the practical answer is usually seccomp-bpf for syscall restriction combined with Landlock for filesystem access control, using the two mechanisms together rather than choosing between them. Landlock’s continued development suggests the Linux ecosystem is gradually converging on something closer to Capsicum’s model, even if the path there has been circuitous.

Was this interesting?