· 7 min read ·

How NetBSD's TCP Stack Loses Throughput and What It Takes to Get It Back

Source: lobsters

BSD networking has a long and genuinely impressive history. Van Jacobson’s congestion control work, which landed in 4.3BSD in 1988, changed the internet. The socket API that came out of Berkeley became the universal interface every OS copies. But that history also means the code carries decisions made for hardware that no longer exists, and those decisions compound over time in ways that are hard to notice until you run a benchmark and wonder where your throughput went.

This two-part series on BSD TCP performance is an unusually detailed look at what that actually means in practice for NetBSD. Part one diagnoses the problems; part two fixes them. What makes it worth reading carefully is that it shows the work as it happened rather than just presenting the result.

The Bandwidth-Delay Product Problem

TCP throughput is bounded by a simple relationship: the rate at which you can push data is constrained by the size of the pipe you’re willing to fill. That size is the bandwidth-delay product, the amount of data that can be in flight at once given the link speed and round-trip latency. On a 1 Gbps link with a 10ms RTT, you need roughly 1.25 MB of in-flight data to saturate the link. The socket send and receive buffers have to be large enough to hold that.

NetBSD’s defaults for net.inet.tcp.sendspace and net.inet.tcp.recvspace have historically been conservative. The values that made sense on the hardware and link speeds of the 1990s are inadequate for a machine with a modern NIC. When the buffer fills before the window opens far enough, the sender stalls waiting for acknowledgments, and throughput collapses well below what the hardware can deliver.

FreeBSD addressed this years ago with auto-tuning: the socket buffer size grows dynamically up to kern.ipc.maxsockbuf based on observed throughput. Linux has had similar auto-scaling since at least 2.6. NetBSD’s approach has been more static, which means a knowledgeable administrator can tune it manually, but the defaults leave performance on the table.

This isn’t a NetBSD-specific failure so much as the accumulated cost of slower development pace. The fixes exist elsewhere; integrating them takes someone willing to sit down with the kernel source and trace the path from tcp_output through the socket buffer machinery.

Socket Buffer Internals in BSD

The BSD socket buffer is built around a struct sockbuf, which tracks the current byte count, the high-water mark, and a chain of mbufs. The high-water mark (sb_hiwat) is what the user-space SO_SNDBUF and SO_RCVBUF options control. When sb_cc approaches sb_hiwat, the send path blocks or returns EWOULDBLOCK.

The issue with static sizing is that sb_hiwat gets set at connection time and typically doesn’t change. If it’s set too low relative to the bandwidth-delay product of the actual path, the window never gets large enough to keep the link busy. You can verify this by watching the TCP window size in tcpdump or netstat -s; if it’s consistently smaller than the BDP, the buffer is the bottleneck.

FreeBSD’s auto-tuning adjusts sb_hiwat up dynamically when it detects that the connection is bandwidth-constrained rather than buffer-constrained, up to a ceiling controlled by kern.ipc.maxsockbuf (which defaults to 2 MB in recent FreeBSD versions, compared to much lower historical defaults). Getting equivalent behavior in NetBSD requires either the same auto-tuning logic or at minimum raising the defaults to something reasonable for modern hardware.

TSO, Checksum Offloading, and the Ways Hardware Help Can Hurt

TCP Segmentation Offload is one of those features that is almost always a net win: instead of the kernel constructing many small segments, it hands a large buffer to the NIC, and the NIC segments it according to the MSS before transmission. This reduces CPU overhead and often improves throughput substantially.

But TSO interacts badly with several things. If the hardware’s TSO implementation has bugs, or if the driver doesn’t correctly advertise TSO capabilities, or if the TSO path in the kernel doesn’t handle certain packet formats correctly, you get either silent corruption or performance that’s worse than without TSO. Checksum offloading has similar failure modes: if the hardware computes checksums incorrectly, or if software fallbacks activate silently for offloaded traffic, you can end up retransmitting a lot.

Kernel networking performance investigations almost always include a pass where you disable TSO and checksum offloading entirely to establish a baseline, then re-enable them selectively to find which component is causing the regression. If throughput jumps when you disable TSO, the TSO path has a bug. If it jumps when you change the driver parameters, the driver is at fault. The BSDs expose most of these knobs through ifconfig flags like -tso and -rxcsum.

NetBSD has historically had less rigorous driver coverage for TSO compared to FreeBSD, which benefits from having more driver contributors and more scrutiny of the network paths. This is part of why performance tuning for NetBSD often involves verifying hardware offload behavior as a first step rather than assuming it works.

The Receive Path and Interrupt Handling

High-speed networking performance also depends heavily on what happens when packets arrive. The original interrupt-per-packet model, where every received packet triggers a hardware interrupt, does not scale. Linux’s NAPI (New API) switched to a polling model under high load: take one interrupt to start processing, then poll for more packets in a batch before returning to interrupt mode. FreeBSD has an equivalent mechanism through its interrupt filter and handler split, plus iflib for modern drivers.

NetBSD’s interrupt handling for network receive is worth examining when throughput is unexpectedly low. If the system is spending more time in interrupt context than processing data, the problem isn’t the TCP stack at all; it’s the driver and interrupt subsystem. Profiling tools like systat -ifstat and watching the IRQ distribution with vmstat -i can surface this. If a single CPU is handling all network interrupts while others sit idle, that’s a strong signal that interrupt steering (IRQ affinity) needs attention.

What the Diagnostic Methodology Looks Like

The approach in the series mirrors what any competent kernel networking investigation looks like: establish a baseline, change one thing, measure again. The tools are mostly standard: iperf3 for bulk throughput, tcpdump to verify what the stack is actually sending, netstat -s to track retransmission and window statistics, and reading the kernel source to understand what the numbers mean.

What makes this particular write-up worth citing is the level of specificity. It’s not generic advice about increasing buffer sizes; it’s a trace through what NetBSD’s implementation actually does, where it diverges from FreeBSD’s, and what specific changes bring the behavior in line. That kind of documentation is what makes it reproducible and useful to anyone who hits the same wall later.

Comparing against FreeBSD is a sound methodology because the two kernels share ancestry and the FreeBSD networking stack has received substantially more performance work in the last decade. When a FreeBSD system significantly outperforms a NetBSD system on the same hardware with the same configuration, the delta is usually in specific code paths, and FreeBSD’s history can point you toward where to look.

Where NetBSD Stands Relative to the Other BSDs

FreeBSD has been the BSD most focused on server workloads, and its networking stack reflects that. The sendfile optimization, zero-copy receive paths, SO_MAX_PACING_RATE for pacing, and TCP_NOTSENT_LOWAT for backpressure control in streaming applications all landed in FreeBSD before the other BSDs. OpenBSD has different priorities, erring toward correctness and security over throughput. NetBSD sits in an interesting position: it’s the most portable kernel, running on an enormous range of hardware, and that breadth makes it harder to optimize for any specific case.

But portability and throughput are not actually in conflict. The buffer sizing and window management improvements that matter for performance are not architecture-specific. The work described in this series is portable by nature; it’s fixing parameters and behaviors that are wrong on every platform, not tuning for a specific CPU or NIC.

Linux, for comparison, has an entire team at major network equipment vendors contributing to its stack. Its defaults for TCP buffer auto-tuning, its BPF-based congestion control hooks, and its highly optimized receive path are the product of enormous engineering investment. The BSDs are not going to match that with volunteer effort, but that’s not the relevant comparison. The question is whether NetBSD’s stack performs reasonably on modern hardware, and the answer has too often been no, not because of any fundamental limitation, but because the tuning work hasn’t been done.

Why Documented Kernel Work Has Value Beyond the Fix

The broader point here is about how knowledge accumulates in small open-source projects. When someone at a Linux distribution tunes the networking stack, it tends to get incorporated into distributions quickly and the details get written up in blog posts and conference talks. In NetBSD’s world, that pipeline is shorter and the documentation is patchier.

A write-up like this one fills a gap that goes beyond the specific patches. It gives future contributors a methodology, a set of tools to use, and a documented baseline for what the stack should do. That has compounding value: the next person who hits a NetBSD networking performance problem can start with this work rather than from scratch.

For anyone running NetBSD on hardware where network throughput matters, reading both parts of this series is worth doing. The fixes described reflect real understanding of where the stack was losing performance, not surface-level tuning. And the diagnostic approach is transferable to other parts of the kernel where similar accumulated debt sits waiting to be addressed.

Was this interesting?