· 6 min read ·

Naming vs. Invocation: The Design Split Behind UNIX Process Sandboxing

Source: lobsters

Process sandboxing on UNIX has a well-known split. Linux reaches for seccomp; FreeBSD reaches for Capsicum. The two mechanisms appear to solve the same problem, and a side-by-side comparison makes the contrast vivid. But the deeper point is that they are not actually competing alternatives. They operate at different layers of the security stack, against different failure modes, and the gap between them explains why Linux had to invent Landlock a decade later.

Understanding that split requires going back to the root problem.

The Ambient Authority Problem

A UNIX process inherits broad authority from the moment it starts. It can open any file its user can access, by any path it knows. It can create sockets, look up other processes by PID, and in many cases invoke the kernel’s most sensitive subsystems. This is ambient authority: authority that comes with existence, not with explicit delegation.

Sandboxing is the problem of removing ambient authority from processes that do not need it. Two design questions follow from this: what should be removed, and how should removal be expressed? Capsicum and seccomp give different answers to both.

Capsicum: Restrict Naming

Capsicum, designed by Robert Watson and Ben Laurie and first shipped in FreeBSD 9.0 (2012), addresses ambient authority at the naming layer. The core primitive is cap_enter(). After this single call, the process can no longer reach global namespaces. Opening a file by path fails with ECAPMODE. Creating a network socket with socket() fails. Looking up a process by PID fails. The transition is one-way; there is no cap_leave().

if (cap_enter() < 0)
    err(1, "cap_enter");

From this point, the process can only operate on file descriptors it already holds. Capsicum layers per-fd rights on top of this via cap_rights_limit(2), which narrows the set of permitted operations on any given fd:

cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK);
cap_rights_limit(fd, &rights);

Rights can only be narrowed, never expanded. A file descriptor passed to a child process carries only the rights it was given. This is object-capability security in the classical sense: the capability (the fd) carries the authority, and that authority is unforgeable by the recipient.

For operations that legitimately need global namespace access after sandbox entry, such as DNS resolution or syslog writes, FreeBSD ships casper(8): a daemon that holds a pre-forked helper outside capability mode and serves requests over a socket pair. The sandboxed process sends structured requests; the helper performs the privileged operation and returns results. The trust boundary is explicit and auditable.

seccomp: Restrict Invocation

seccomp takes a different approach. Rather than locking down namespace access, it places a filter on kernel entry points. Each time the sandboxed process issues a syscall, the kernel evaluates a cBPF program against the syscall’s metadata and decides whether to allow it, block it, or handle it in some other way.

The filter receives a seccomp_data struct:

struct seccomp_data {
    int   nr;                   /* syscall number */
    __u32 arch;                 /* AUDIT_ARCH_* value */
    __u64 instruction_pointer;
    __u64 args[6];              /* raw syscall arguments */
};

Return values range from SECCOMP_RET_ALLOW through SECCOMP_RET_KILL_PROCESS, with SECCOMP_RET_ERRNO for synthetic errors and SECCOMP_RET_USER_NOTIF (Linux 5.0+) for routing syscalls to a supervisor process. The libseccomp library abstracts away raw BPF bytecode generation:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_load(ctx);

Chrome’s renderer sandbox, Firefox’s content process, Docker’s default container profile, Android’s app sandbox via Zygote, and most systemd service hardening all rely on seccomp in some form. For blocking dangerous syscalls, ptrace, perf_event_open, kexec_load, io_uring_setup, from processes that have no business calling them, seccomp is precise and effective.

The Pointer Problem

seccomp’s fundamental limitation is that BPF programs run at syscall entry and receive only the arguments as raw values. For integer arguments, a file descriptor number, a flag bitmask, a byte count, this is sufficient. For pointer arguments, the filter sees an address, not the data at that address. BPF programs cannot dereference user-space pointers; the kernel enforces this by design.

The consequence is that a seccomp filter cannot distinguish:

openat(AT_FDCWD, "/etc/passwd", O_RDONLY)   /* dangerous */
openat(AT_FDCWD, "/tmp/cache",  O_RDONLY)   /* benign */

Both calls have the same syscall number, the same first argument, and the same flags. The path string lives at a user-space pointer that the filter cannot read. seccomp can block openat entirely, or allow it entirely; it cannot allow it conditionally on which path is being opened.

This is not a gap in the implementation. It is a consequence of where in the kernel the filter runs. Fixing it would require something fundamentally different.

Capsicum sidesteps this by design. In capability mode, openat with a pathname argument fails with ECAPMODE. There is no path string to inspect because path-based naming is disabled. A process acquires new file descriptors only by having directory fds with CAP_LOOKUP rights passed to it, then calling openat relative to those fds. The authority travels with the fd, not the path string.

Landlock: Linux Fills the Gap

Linux acknowledged this gap in 2021. Landlock, merged in Linux 5.13 by Mickaël Salaün, provides path-based filesystem access control for unprivileged processes without requiring any special privilege:

struct landlock_ruleset_attr ruleset_attr = {
    .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
                         LANDLOCK_ACCESS_FS_WRITE_FILE,
};
int ruleset_fd = syscall(SYS_landlock_create_ruleset,
                         &ruleset_attr, sizeof(ruleset_attr), 0);

struct landlock_path_beneath_attr path_attr = {
    .allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
    .parent_fd = open("/tmp", O_PATH | O_CLOEXEC),
};
syscall(SYS_landlock_add_rule, ruleset_fd,
        LANDLOCK_RULE_PATH_BENEATH, &path_attr, 0);

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
syscall(SYS_landlock_restrict_self, ruleset_fd, 0);

Landlock is closer in spirit to Capsicum: describe which filesystem objects the process may access, rather than which syscalls it may call. Linux 6.7 extended Landlock to cover TCP bind and connect operations on specific ports, narrowing the gap with capability mode’s network restrictions further.

Landlock is a complement to seccomp, not a replacement. It does not restrict IPC, signals, or arbitrary syscall invocations. The practical Linux sandbox now layers all three: seccomp for syscall surface reduction, Landlock for filesystem access control, and user namespaces for process and network isolation. This combination approaches what Capsicum provides with a single mechanism, though with considerably more moving parts and more surface for misconfiguration.

pledge and unveil: The Pragmatist’s Route

OpenBSD’s pledge(2) (OpenBSD 5.9, 2016) and unveil(2) (OpenBSD 6.4, 2018) are deliberately coarser than any of the above. pledge takes a string of capability tokens:

pledge("stdio rpath wpath inet dns", NULL);

unveil builds a minimal filesystem view by specifying exact paths and permission strings:

unveil("/tmp", "rwc");
unveil("/usr/share/zoneinfo", "r");
unveil(NULL, NULL);   /* lock down: no further unveil calls allowed */

The granularity is much lower than Capsicum or seccomp-bpf. A pledge promise for rpath allows reading any file the process’s user can access; it does not restrict to specific paths, though unveil fills that role. Neither mechanism provides per-fd rights in Capsicum’s sense.

The tradeoff is ergonomics. Theo de Raadt’s stated goal was that any developer could add sandboxing to an existing program in an afternoon. The result is broad adoption throughout OpenBSD’s base system: tcpdump, ssh, curl, network daemons, and most system utilities carry pledge and unveil calls. The low adoption cost produced coverage that more expressive mechanisms have not achieved on other platforms.

Choosing the Right Layer

The comparison is most useful as a guide to threat models, not a feature checklist. For blocking specific dangerous syscalls from processes that have no business calling them, seccomp is the right tool. For restricting which files a process on Linux can access by path, Landlock is the right tool. For the structural guarantee that a process cannot name resources it was not explicitly given authority over, Capsicum is the most coherent single-mechanism answer available, and it remains the strongest solution to ambient authority as a category.

The original Capsicum paper observed in 2010 that POSIX ambient authority makes security retrofits extremely difficult. That observation has not dated. seccomp and Landlock together bring Linux considerably closer to addressing it, but the fact that they are separate mechanisms developed a decade apart reflects that the original problem was never designed away. It was worked around, layer by layer, as the gaps became impossible to ignore.

Was this interesting?