· 7 min read ·

NetBSD TCP and the Hidden Cost of Conservative Kernel Defaults

Source: lobsters

If you’ve ever run a file transfer over NetBSD and felt like something was off, you weren’t imagining it. A recent investigation into NetBSD TCP performance goes through the kernel internals and identifies concrete problems. The findings are worth unpacking in more depth, because they touch on design patterns that go back to the original 4.4BSD TCP implementation and have propagated through the BSDs in different ways ever since.

Where NetBSD’s TCP Comes From

NetBSD’s networking stack is a close descendant of the 4.4BSD-Lite2 code, released in 1994. That code predates most of the internet as it exists today. Gigabit Ethernet was not a thing. Trans-Pacific RTTs were not a common engineering concern. The defaults in that codebase reflect a world of 10Mbps LANs and dialup links, and a lot of those defaults survived into the 21st century with only modest revision.

FreeBSD took a more aggressive modernization path. Their TCP stack received substantial rewrites, including per-connection SACK state, the tcp_hostcache for caching per-host parameters, and gradual adoption of fine-grained locking to replace the global softnet_lock. OpenBSD went a different direction, prioritizing correctness and minimal attack surface over raw throughput. NetBSD has been somewhere in the middle: closer to the original code than FreeBSD, less radically trimmed than OpenBSD.

This history matters because the bugs and performance problems you find in NetBSD TCP today are often not implementation bugs in the conventional sense. They are the accumulated cost of defaults that were never wrong for their time.

The Socket Buffer Problem

The most common cause of poor TCP throughput, on any platform, is socket buffer sizing. The TCP receive window is bounded by the socket receive buffer. If the buffer is small, the window advertised to the remote peer is small, and the remote side cannot have more than that many bytes in flight at once.

NetBSD’s defaults:

$ sysctl net.inet.tcp.recvspace net.inet.tcp.sendspace
net.inet.tcp.recvspace = 32768
net.inet.tcp.sendspace = 32768

32KB. That is the default window ceiling for every TCP connection unless the application or administrator explicitly raises it.

To understand why this hurts, the key formula is the bandwidth-delay product: the amount of data that must be in flight simultaneously to keep a link fully utilized.

BDP = bandwidth * round-trip-time

Example: 1 Gbps link, 50ms RTT
BDP = 1,000,000,000 bits/sec * 0.05 sec
    = 50,000,000 bits
    = ~6.25 MB

With a 32KB window, the maximum theoretical throughput on that same link is:

max_throughput = window_size / RTT
              = 32768 bytes / 0.05 sec
              = 655,360 bytes/sec
              = ~5.2 Mbps

On a gigabit link with 50ms latency, you would be capped at about 5 Mbps. The pipe could carry 1000 Mbps. You are using 0.5% of it.

This is not a corner case. Any connection crossing a WAN link, a VPN, or a cloud provider’s backbone will see RTTs in the 20-100ms range. Even local datacenter links with 1-5ms RTT will see meaningful throughput limits with 32KB windows if the bandwidth is high enough.

Autotuning: What Linux Did Differently

Linux introduced TCP socket buffer autotuning in kernel 2.4. Rather than a fixed buffer size, Linux dynamically grows the receive buffer as the connection ramps up, guided by the observed RTT and bandwidth. The relevant sysctls are:

# Linux defaults
net.ipv4.tcp_rmem = 4096 87380 6291456
# min default max (bytes)

net.ipv4.tcp_wmem = 4096 16384 4194304

The kernel watches the actual data rate and RTT of each connection and adjusts buffer sizes accordingly, up to the configured maximum. This means a localhost connection does not waste memory on a 6MB buffer, while a high-latency remote connection gets the memory it actually needs.

NetBSD’s equivalent mechanism exists but is less aggressive. The tcp_autorcvbuf and tcp_autosndbuf sysctls control whether per-connection buffer growth is enabled, but the defaults and the growth algorithm have historically been more conservative. The global SB_MAX constant in sys/sockbuf.h also imposes a hard ceiling on how large any socket buffer can grow, and it has historically been set lower than Linux’s equivalent.

TCP Window Scaling and the RFC 1323 Trap

RFC 1323, published in 1992, introduced TCP window scaling to allow windows larger than 65535 bytes. The original TCP header encodes the window field in 16 bits, capping it at 64KB. Window scaling multiplies this by a power of two negotiated during the handshake.

The catch: both sides must agree to window scaling during the SYN exchange. If either side omits the window scale option, the negotiated scale is zero, and both sides are stuck with a 64KB maximum window regardless of their socket buffer sizes.

This has historically been a source of subtle bugs. Middleboxes that normalize TCP headers sometimes strip window scale options. Some older BSD implementations had bugs in the SYN/SYN-ACK handling that caused the window scale to be set incorrectly. Diagnosing this requires looking at the actual packet exchange, not just the application-level throughput:

# Capture and inspect window scale negotiation
tcpdump -i em0 -v 'tcp[tcpflags] & (tcp-syn) != 0'

# Look for wscale in the options field
# A healthy output looks like:
# Flags [S], ... options [mss 1460,nop,wscale 6,...]

If you see SYN packets without wscale in the options, your connections are capped at 64KB regardless of sysctl settings.

Delayed ACK and Nagle Interaction

Another performance sink that shows up repeatedly in BSD TCP investigations is the interaction between delayed ACKs and the Nagle algorithm.

Delayed ACK (RFC 1122) says the receiver should not immediately acknowledge every segment; instead, it can wait up to 200ms for data to piggyback the ACK on. This is sensible for reducing ACK traffic on high-volume streams, but it creates a problem with small writes.

The Nagle algorithm (RFC 896) buffers small outgoing segments until either the previous outstanding segment is acknowledged or the buffer fills to MSS. The intent is to avoid the “silly window syndrome” of sending many tiny packets.

The two interact badly in a request-response pattern with small messages. The sender buffers a small write waiting for an ACK. The receiver delays the ACK for up to 200ms. Both sides wait. The result is up to 200ms of unnecessary latency on every round trip, which on a connection doing many small exchanges compounds into catastrophic throughput degradation.

BSD implementations have historically used a 200ms delayed ACK timer, matching the RFC minimum. Linux also defaults to 200ms but the actual observed behavior often differs because Linux has an adaptive delayed ACK mechanism that shortens the interval when it detects a request-response pattern.

For applications sensitive to this, TCP_NODELAY disables Nagle and forces every write to be sent immediately:

int flag = 1;
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));

But this is an application-level workaround for a kernel-level heuristic that is not always appropriate.

Locking and Concurrency

Beyond the algorithmic issues, NetBSD’s networking stack has historically used a coarse-grained locking model. The softnet_lock mutex serializes a significant portion of the network stack. On a multicore system under high connection rates, this becomes a bottleneck: only one core can process network packets at a time through the locked section.

FreeBSD addressed this systematically with the WITNESS framework for lock ordering verification and gradual replacement of global locks with per-connection and per-interface locks. NetBSD’s MP-safe networking work has been ongoing but has progressed more slowly.

For workloads involving many simultaneous connections, like a web server or a file transfer daemon, this lock contention shows up in profiling as high time spent in the kernel networking stack even on machines that are not CPU-bound overall.

How to Actually Measure This

The right approach when investigating TCP performance on NetBSD is to start with netstat -s to see if the kernel is dropping packets or running out of buffer space:

$ netstat -s -p tcp | grep -i 'overflow\|drop\|limit'

Then check whether window scaling is being negotiated correctly with tcpdump. Then verify the actual sysctl values in play and compare them against the BDP of the connection you care about.

For a baseline throughput test, iperf3 between two machines gives you the ground truth:

# On receiver:
iperf3 -s

# On sender, with window size tuning:
iperf3 -c receiver_host -w 4M -P 4 -t 30

The -w flag overrides the default socket buffer, and -P 4 runs four parallel streams to see whether concurrency changes throughput meaningfully. If single-stream throughput is low but parallel streams scale linearly, the bottleneck is window size, not locking. If parallel streams do not scale, locking or CPU is more likely.

What This Reveals About BSD Development

The deeper issue is that NetBSD’s defaults have lagged behind what modern networking conditions demand, and the gap is not always visible without deliberate measurement. A developer running file transfers locally will never see these problems. The issues surface on cross-region connections, cloud deployments, or any scenario with more than a few milliseconds of RTT.

FreeBSD has been more willing to change defaults in ways that break from the original BSD behavior. NetBSD’s conservatism is in some ways a feature: the system is predictable and the codebase is readable. But predictably wrong defaults are still wrong defaults.

The good news is that most of these issues are sysctl-tunable without kernel changes. Raising net.inet.tcp.recvspace and net.inet.tcp.sendspace to something like 1MB or 4MB, enabling autotuning, and verifying window scaling negotiation will recover most of the performance on offer. The harder work, the locking improvements and the algorithmic modernization, is ongoing in the NetBSD tree. For anyone running NetBSD in production where network throughput matters, watching that work and testing against the -current branch periodically is worth the effort.

Was this interesting?