The Networking Stack Behind pfSense, Netflix's CDN, and Three Decades of Firewall Appliances
Source: hackernews
When a recent post on it-notes.dragas.net surfaced on Hacker News and pulled 500 points, the thread predictably split between people enumerating FreeBSD’s coherent codebase and those asking why it lost the Linux wars. Both conversations are worth having. But the one I find more technically interesting concerns a narrower question: why do the people building firewall appliances, ISP broadband gateways, and high-throughput CDN nodes keep reaching for FreeBSD specifically when they need to do something serious with packets?
The answer is not a single feature. It is a combination of networking primitives that accumulated in FreeBSD over thirty years and compose unusually well.
Two Firewalls, Different Philosophies
FreeBSD ships two mature, in-kernel packet filters: IPFW and PF. Having both is not redundancy for its own sake; they represent genuinely different design philosophies.
IPFW is FreeBSD’s native firewall, present since FreeBSD 2.x. Rules are numbered 0 to 65535 and evaluated sequentially, with the first match winning. Stateful rules use dynamic rule generation: a keep-state keyword causes the kernel to create per-connection state entries, which are checked before the static ruleset via an explicit check-state rule. IPFW integrates with dummynet, a traffic shaping subsystem created by Luigi Rizzo at the University of Pisa, which landed in FreeBSD around 1998.
Dummynet works through pipes and queues. A pipe is a configurable bandwidth limiter with delay and packet loss parameters; traffic directed into a pipe experiences those characteristics in-kernel, without any userspace involvement:
# 1 Mbit/s pipe with 50ms RTT emulation and 1% random loss
ipfw pipe 1 config bw 1Mbit/s delay 50ms plr 0.01
# Send traffic to a specific host through that pipe
ipfw add 1000 pipe 1 ip from any to 203.0.113.5
ipfw add 1010 pipe 1 ip from 203.0.113.5 to any
This makes a FreeBSD box inserted inline between two network segments a precise WAN emulator with no additional software. Schedulers available through dummynet include FIFO, WFQ, Round Robin, QFQ, and FQ-CoDel, selectable per-queue.
PF arrived from a different direction. The OpenBSD team developed PF in 2001 as a replacement for IPFilter after a license dispute, and it was ported to FreeBSD starting with 5.3 (2004). PF uses a declarative, macro-based syntax with a fundamentally different approach to NAT, state tracking, and traffic normalization. The scrub directive reassembles fragmented packets before rule evaluation, a security measure that predates most application-layer firewalls:
# Macros
ext_if = "em0"
tcp_services = "{ 22, 80, 443 }"
# Normalize and reassemble
scrub in all fragment reassemble
# NAT outbound
nat on $ext_if from 192.168.1.0/24 to any -> ($ext_if)
# Stateful pass with connection rate limiting
pass in on $ext_if proto tcp to ($ext_if) port 22 \
keep state (max-src-conn 5, max-src-conn-rate 3/10, \
overload <brute_force> flush global)
The overload mechanism moves source IPs that exceed the rate limit into a named table, which can then be blocked by another rule. This kind of composable, table-driven policy is harder to express cleanly in IPFW’s numbered-rule model.
pfSense (now running on FreeBSD 14.0) and OPNsense are built around PF as their core firewall engine. Both projects also depend on pfsync, a protocol-level interface that synchronizes PF state tables between HA peers, enabling stateful failover without TCP resets when the primary node goes down.
CARP: Failover Without Patent Liability
High availability in network appliances requires address failover: multiple physical nodes share a virtual IP address, and traffic continues flowing when the primary dies. The IETF standard for this, VRRP, is covered by Cisco patents. OpenBSD’s team designed CARP (Common Address Redundancy Protocol) in 2003 specifically to accomplish the same thing without licensing exposure, and it was ported to FreeBSD and is in the base system today.
CARP uses multicast advertisements on 224.0.0.18. The master sends periodic announcements; backups wait and promote themselves if the master goes silent within a configurable dead interval. All CARP messages are HMAC-SHA1 authenticated with a shared password, preventing rogue hosts from hijacking addresses.
The priority model uses two tunable values, advbase (advertisement interval in seconds) and advskew (a 0-255 skew), which together determine how quickly a node will advertise. Lower effective interval equals higher priority. On a two-node pair:
# /etc/rc.conf on primary (advskew 0 = highest priority)
cloned_interfaces="carp0"
ifconfig_carp0="vhid 1 pass shared_secret advskew 0 192.168.1.1/24"
# Secondary (advskew 100 = lower priority)
ifconfig_carp0="vhid 1 pass shared_secret advskew 100 192.168.1.1/24"
CARP also supports demotion counters tied to interface tracking. If the primary’s upstream link fails, its demotion counter increments, raising its effective advskew and triggering failover, even though the CARP interface itself remains up. pfSense and OPNsense expose all of this in their high-availability configuration pages, and it works reliably enough that two-node FreeBSD firewall pairs with sub-second failover are standard practice in production.
Netgraph: The Framework With No Linux Equivalent
Netgraph is a kernel-space, graph-based networking framework introduced in FreeBSD 3.4 (1999) by Archie Cobbs and Julian Elischer. It has no real counterpart in Linux. The core idea is that networking functions are implemented as composable kernel modules called nodes, connected via named hooks. Packets, as kernel mbufs, flow between nodes along hook connections without touching the socket layer or returning to userspace.
Node types include Ethernet interfaces (ng_ether), Layer 2 bridges, PPP stacks, PPPoE encapsulation, L2TP, MPPC compression, BPF filter nodes, NAT, VLAN tagging, and others. The topology is built and reconfigured at runtime via ngctl:
# Create a PPPoE client graph
ngctl mkpeer em0: pppoe lower pppoe
ngctl name em0:lower pppoe0
ngctl mkpeer pppoe0: ppp session iface
ngctl connect pppoe0: ppp: pppoe iface
The entire PPPoE client stack, including negotiation, compression, and routing, runs in-kernel as a graph of nodes. ISPs that use FreeBSD as broadband network gateway equipment run Netgraph with ng_pppoe to terminate thousands of subscriber sessions without any userspace daemon involvement in the data path.
Linux handles PPPoE through pppd and a kernel module, which is functional but architecturally different: data exits the kernel, goes through pppd in userspace, and re-enters. For high subscriber density, the context-switch overhead matters. Netgraph eliminates it by design.
VNET: Per-Jail Network Stacks
FreeBSD jails have supported full virtual network stack isolation since FreeBSD 8.0 (2009), via VNET. Rather than sharing global kernel networking state, a VNET jail gets its own interface list, routing table, ARP cache, TCP and UDP connection tables, firewall state, and sysctl namespace for network tunables.
The implementation uses macro-based variable virtualization: kernel variables that would otherwise be globals are accessed through a per-VNET context pointer stored in the current thread. From inside a jail, ifconfig shows only the interfaces assigned to that jail; netstat shows only that jail’s connections; pfctl manages only that jail’s PF instance.
Virtual Ethernet pairs (epair) provide connectivity between jails and the host:
# Create a virtual Ethernet pair
ifconfig epair0 create
# epair0a stays in host; epair0b goes into the jail
ifconfig epair0b vnet myjail
# Inside the jail, configure the interface normally
jexec myjail ifconfig epair0b inet 10.0.0.2/24
Tools like Bastille and iocage automate this wiring with ZFS-backed jail filesystems. The result is container-style isolation where the network stack is fully independent, without the assembly-of-namespaces model Linux uses for Docker and LXC.
What Netflix Put Into the Stack
Netflix runs FreeBSD on its Open Connect Appliance CDN nodes, co-located at ISPs to deliver streaming traffic. Their engineering contributions to FreeBSD’s networking stack represent some of the most significant performance work in the codebase.
The RACK TCP stack, implemented primarily by Randall Stewart (a long-time FreeBSD committer who works at Netflix), implements RFC 8985’s time-based loss detection. FreeBSD supports pluggable TCP stacks, switchable per-socket or system-wide:
# Switch the system default TCP stack to Netflix's RACK implementation
sysctl net.inet.tcp.functions_default=rack
RAck uses recent-acknowledgment timing rather than duplicate ACK counting to detect loss, which performs better on high-bandwidth paths and reduces tail latency through Tail Loss Probes sent before RTO expires.
The more impactful contribution for raw throughput is the combination of sendfile and kernel TLS. FreeBSD’s sendfile(2) syscall maps file pages from the VM page cache directly into mbuf structures as DMA references, avoiding any copy between kernel and userspace. The NIC’s DMA engine reads directly from the page cache to wire.
Kernel TLS (landed in FreeBSD 12.0, substantially improved in 13.x) hands TLS record encryption to the kernel after the handshake completes in userspace. On capable NICs (Chelsio T6, Mellanox/NVIDIA ConnectX-5 and later), encryption is offloaded entirely to the card:
/* Application configures TLS normally; kernel takes over encryption after handshake */
SSL_CTX *ctx = SSL_CTX_new(TLS_server_method());
SSL_CTX_set_options(ctx, SSL_OP_ENABLE_KTLS);
/* SSL_write() now routes records through the kernel TLS path */
With sendfile and ktls combined: file data is read by DMA from disk into the page cache, TLS encryption happens in the kernel or on the NIC, and encrypted data is DMA’d to the wire. CPU involvement in the data path is minimal. Netflix has published OCA throughput figures of 200+ Gbps per server on modern hardware with this path active, with CPU utilization remaining well under the levels required by userspace TLS at equivalent throughput.
Linux gained ktls support merged in kernel 4.13 (2017), but FreeBSD’s sendfile and ktls integration, particularly the NIC offload path, has been more mature and production-validated for longer.
Why This Accumulation Matters
None of these pieces is magic in isolation. PF, CARP, Netgraph, VNET, sendfile, and ktls are all technically achievable in Linux, and Linux equivalents exist in various forms. The difference is depth of integration and cohesion.
PF’s pfsync works cleanly with CARP because both are maintained in the same project. VNET jails run their own PF instance because the same team that added per-VNET state management also maintained PF. Netflix’s ktls work landed in the base system and improved sendfile because both are owned by the same project, not coordinated across an upstream kernel team, a libc team, and a distribution.
For teams building network appliances, firewalls, or CDN infrastructure, that integration density means fewer integration bugs, fewer surprise interactions between components at version boundaries, and man pages that describe the actual behavior of the installed system. That may not matter for a web application deployment. For a device where the entire product is the network stack, it matters considerably.