· 9 min read ·

Ambient Authority Is the Root Problem: What Capsicum and seccomp Disagree About

Source: lobsters

Process sandboxing has two dominant schools of thought on Linux and FreeBSD, and they are frequently compared as though they were interchangeable mitigations doing the same job at different levels of granularity. They are not. Capsicum and seccomp differ on a more fundamental question: what is the actual source of a process’s dangerous authority, and where should it be removed?

Understanding that difference matters for anyone building sandboxed applications today, because the choice of mechanism shapes what guarantees you can make about a compromised process.

The Problem with Ambient Authority

In a conventional UNIX process, authority is ambient. A process running as a particular user can access any file that user can read, open any socket, send any signal to any process with the same UID, and invoke any syscall the kernel exposes. None of this authority needs to be explicitly granted at runtime; the process inherits it by virtue of its identity. The kernel consults credentials on every operation and decides whether to permit it.

This is convenient, but it means that a compromised process immediately inherits everything the running user could do. An attacker who gains control of a process serving PDF rendering, or a grammar checker, or a codec decoder gets the full ambient authority of that user account. The process never needed most of that authority to do its job; it simply had it by default.

Capsicum’s core argument, laid out in the 2010 USENIX Security paper by Watson, Anderson, Laurie, and Kennaway at the University of Cambridge, is that ambient authority is the structural defect. The correct fix is to make all authority explicit: force every resource the process needs to be materialized as a capability (a file descriptor with specific rights), then revoke access to all global namespaces. After that, the process can only operate on objects it demonstrably holds references to.

seccomp takes a different position. It does not try to eliminate ambient authority. Instead, it reduces the syscall surface available to a process, limiting which kernel interfaces an exploit can reach. A process with a seccomp filter installed can still have broad ambient authority over files and network resources; the filter constrains which syscalls can be invoked to exercise that authority. The goal is attack surface reduction rather than capability confinement.

These are both useful properties. They are not the same property.

How Capsicum Works

Capsicum (production on FreeBSD since version 9.0 in 2012, mature since FreeBSD 10.0 in 2014) provides two primitives.

The first is capability mode, entered via cap_enter(2). This call is irreversible: once made, the process cannot open files by path, create new sockets, send signals by PID, or access any global kernel namespace. It receives ECAPMODE on any attempt to do so. The process continues executing, but it can only operate on file descriptors it already holds.

The second is capability rights restriction, via cap_rights_limit(2). Each open fd can have a 64-bit rights mask applied to it, narrowing what operations are valid on that descriptor. The mask can only be narrowed, never expanded:

#include <sys/capsicum.h>

cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_FSTAT);
cap_rights_limit(fd, &rights); // fd can now only be read from or fstat'd

cap_enter(); // no global namespaces from this point

The design pattern is consistent: before calling cap_enter, the process opens every file, socket, and device it will need, restricts each fd to the minimum necessary rights, then discards all ambient authority. Pre-opening resources and materializing authority as explicit capabilities is the entire model.

The FreeBSD base system has adopted this broadly. Since FreeBSD 10 and 11, utilities including tcpdump, grep, sort, gzip, and dhclient use Capsicum. For processes that legitimately need services requiring global namespace access after entering capability mode (DNS resolution, syslog, group lookups), FreeBSD provides the Casper library (libcasper), which runs a supervised helper process that the sandboxed process communicates with over a socket. The Capsicum process holds a CAP_READ/CAP_WRITE fd to the Casper socket; Casper performs the privileged operation and returns the result.

How seccomp Works

seccomp (Secure Computing Mode) entered Linux in 2.6.12 in 2005 as a strict mode limiting processes to four syscalls: read, write, exit, and sigreturn. The practically useful seccomp-bpf mode arrived in Linux 3.5 in 2012, driven largely by the Chrome sandbox team.

In filter mode, the process installs a classic BPF program that runs on every syscall entry. The program receives a struct seccomp_data containing the syscall number, architecture, instruction pointer, and six register arguments. It returns an action:

#include <linux/seccomp.h>
#include <sys/prctl.h>

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

struct sock_filter filter[] = {
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, arch)),
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_THREAD),
};
struct sock_fprog prog = { .len = 7, .filter = filter };
seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog);

Note the architecture check at the start. This is mandatory: syscall numbers vary across architectures (on x86-64, open is syscall 2; on 32-bit x86, it is syscall 5; on ARM64, it is syscall 56), and a 64-bit process can invoke 32-bit syscalls via int 0x80, bypassing an x86-64-numbered filter entirely unless AUDIT_ARCH is verified first. The libseccomp library handles this portability concern automatically when generating filters from higher-level rules.

Filters stack: each seccomp(SECCOMP_SET_MODE_FILTER, ...) call adds to the chain, with all filters running and the most restrictive result winning. The NO_NEW_PRIVS requirement prevents a sandboxed process from regaining privileges through setuid exec.

seccomp adoption is extremely broad on Linux. Chrome has used seccomp-bpf for renderer processes since 2012. Firefox followed for content processes in 2016. systemd uses SystemCallFilter= directives across many of its own services. Android has applied per-app seccomp filters since Oreo. Docker ships a default profile blocking roughly 44 syscalls. The ecosystem around seccomp is mature.

The Confused Deputy Problem

Norm Hardy described the confused deputy problem in 1988: a program acting on behalf of two principals can be tricked into using authority granted by one principal for the benefit of the other. A compiler with write access to system files, taking an output path from user input, could be tricked into writing to /etc/passwd by a user who supplies that path. The compiler holds the authority to write there; it just shouldn’t be exercising it in that context.

Capsicum eliminates this class of attack by construction. A Capsicum-sandboxed process holds no authority to resources it was not explicitly given. An attacker who compromises a capability-confined PDF renderer gets exactly the fds that process held when it entered capability mode, nothing else. There is no ambient authority to confuse. The renderer cannot write to /etc/passwd because it holds no fd for that file and can no longer open files by path.

seccomp does not address this. If write is in the allowed syscall list, the sandboxed process can write to any file descriptor it holds. If a process has an open fd to a sensitive file at the time the filter is installed, seccomp places no constraint on using it. The mechanism restricts the vocabulary of syscalls, not the objects those syscalls can be applied to. For attack surface reduction this is useful; for the confused deputy problem it provides no formal guarantee.

This distinction is not theoretical. Consider a process that opens a configuration file, a log file, and a database socket before installing a seccomp filter. The filter allows read, write, close, and a short list of others. An attacker who exploits a memory-corruption bug in that process can write to the database socket or log file using the allowed write syscall. Capsicum’s model would instead require the process to have held explicit capability fds for each resource with narrowed rights, and even then only with the rights originally granted.

Landlock: Linux Moves Toward Object-Level Control

Landlock, merged into Linux 5.13 in 2021, reflects an explicit effort to bring something closer to Capsicum’s model to Linux. Rather than filtering syscalls, Landlock restricts which filesystem paths a process can access and (since Linux 6.7, in version 4 of the ABI) which TCP ports it can bind or connect to.

struct landlock_ruleset_attr attr = {
    .handled_access_fs =
        LANDLOCK_ACCESS_FS_READ_FILE |
        LANDLOCK_ACCESS_FS_WRITE_FILE,
};
int ruleset_fd = syscall(SYS_landlock_create_ruleset, &attr, sizeof(attr), 0);

struct landlock_path_beneath_attr pb = {
    .allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
    .parent_fd = open("/etc", O_PATH | O_CLOEXEC),
};
syscall(SYS_landlock_add_rule, ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &pb, 0);

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
syscall(SYS_landlock_restrict_self, ruleset_fd, 0);

Landlock is path-based rather than fd-based, which is more approachable but theoretically less strict than Capsicum. A process constrained by Landlock still has ambient authority over paths it was explicitly permitted; Capsicum eliminates the ambient namespace entirely. Landlock is best understood as a complement to seccomp rather than a replacement for it, and the two are commonly combined in modern Linux sandboxing (Flatpak and Firefox both use this combination).

The ABI versioning of Landlock also highlights a practical concern: access rights are added in each kernel version (LANDLOCK_ACCESS_FS_TRUNCATE in 6.2, network controls in 6.7), and applications must query ABI_VERSION and handle missing capabilities gracefully to remain portable across kernel versions. Capsicum on FreeBSD has a more stable API surface by comparison, owing partly to the BSD model of shipping the kernel and base system together.

OpenBSD’s Pragmatic Middle Ground

OpenBSD’s pledge(2) (OpenBSD 5.9, 2016) and unveil(2) (OpenBSD 6.4, 2018) offer a different trade-off. Rather than requiring applications to be restructured around explicit capability objects, pledge lets a process declare a set of high-level promise categories:

pledge("stdio rpath inet", NULL);

From that point, the kernel kills the process if it invokes any syscall outside the declared promises. unveil restricts the filesystem namespace to a declared set of paths, analogous to Landlock but with a simpler API and integrated into the pledge model.

The practical result is that nearly every utility in OpenBSD’s base system uses pledge, because adding a single call requires little architectural change. The granularity is coarser than Capsicum (you permit the inet category rather than specific socket fds with specific rights), but the porting burden is far lower. Theo de Raadt’s design explicitly prioritized adoption over theoretical completeness, and the OpenBSD base system shows the payoff.

Capsicum is theoretically superior at expressing minimal authority, but that superiority has a cost. Porting an application requires pre-opening all needed resources, restructuring initialization sequences, and often integrating with Casper for services that need global namespace access. The result is correct and verifiable, but the barrier to entry is high enough that adoption outside FreeBSD’s base system has remained limited.

Choosing Between Them

For anyone building on Linux today, the decision space is mostly seccomp plus Landlock, with the formal properties of Capsicum available only if you are targeting FreeBSD. The mechanisms are not in competition; they address different layers. seccomp reduces syscall attack surface, Landlock restricts filesystem and network access by path and port, and neither eliminates ambient authority the way Capsicum does.

The comparison at vivianvoss.net covers the mechanical differences clearly. The more durable takeaway is that the two mechanisms embody genuinely different theories of what sandboxing is for. seccomp assumes that limiting the syscall vocabulary limits the damage an exploit can do. Capsicum assumes that ambient authority is the root defect and that making authority explicit is the only complete fix. Both assumptions are defensible; they lead to different tools with different guarantees.

For threat models centered on memory-corruption exploits using unexpected kernel interfaces, seccomp is mature, widely supported, and battle-tested. For threat models where a compromised component must not be able to reach resources beyond its stated purpose, Capsicum’s object-capability model provides the right primitives, even if it requires more architectural investment to use correctly.

Was this interesting?