· 7 min read ·

Small Buffers, Frozen Windows: The NetBSD TCP Performance Trap

Source: lobsters

TCP performance bugs are among the most satisfying to investigate because the symptoms look mysterious until they suddenly make complete sense. A machine with a gigabit NIC transferring at 30 Mbps. Retransmit counters near zero. Packets flowing freely. Everything looks healthy from the outside, and then you look at the window field in the TCP headers and see it hover near zero for long stretches. That is the NetBSD TCP receive path telling you it has run out of room.

This investigation into NetBSD TCP throughput follows a well-worn path that anyone who has debugged BSD networking will recognize, and the core problem it uncovers is instructive precisely because it is not exotic. It is a consequence of defaults and design assumptions that made sense in 1990 and have not aged well.

The Bandwidth-Delay Product Is Not Optional

TCP performance on any kernel comes down to one constraint: the amount of data that can be in flight at any given moment is bounded by the smaller of the congestion window and the receive window. The receive window is bounded by how much free space the receiver has in its socket buffer. If the socket buffer fills up before the sender has used all available bandwidth, throughput caps out well below line rate.

The bandwidth-delay product (BDP) is what determines how large the in-flight data needs to be. A 1 Gbps link with a 10ms round-trip time requires roughly 1.25 MB of data in flight to stay saturated. On a 100ms WAN path that becomes 12.5 MB. If the receive socket buffer is 32 KB, you are limited to 32,768 bytes / 0.010 seconds = about 26 Mbps on that 10ms link, regardless of what the NIC and the congestion control algorithm would otherwise allow.

NetBSD’s default for net.inet.tcp.recvspace has historically been 32,768 bytes. That was reasonable for a 10 Mbps Ethernet with machines behind it doing modest file transfers in 1993. It is not reasonable for modern hardware.

Linux learned this lesson and encodes it in net.ipv4.tcp_rmem, a three-value sysctl that specifies minimum, default, and maximum receive buffer sizes. The default is 4096 131072 6291456, giving auto-tuning up to 6 MB. The key mechanism is tcp_moderate_rcvbuf, enabled by default since the 2.6 era, which adjusts the receive buffer dynamically based on measured throughput and RTT. The kernel runs tcp_rcv_rtt_measure() on each ACK to estimate how fast the application is reading, then grows or shrinks the buffer accordingly. The application never has to set SO_RCVBUF explicitly.

How BSD Socket Buffers Work

In BSD kernels, socket buffer management lives in sys/kern/uipc_socket2.c. Each socket has two buffers, so_rcv and so_snd, each of type struct sockbuf. The key fields are sb_hiwat (the high-water mark, effectively the buffer size limit) and sb_cc (current byte count in the buffer).

When the TCP input path receives a segment, it calls sbappend() or sbappendstream() to add data to so_rcv. The space available for advertisement in the next ACK is computed as:

win = sbspace(&so->so_rcv);
/* which expands to: */
win = (long)so->so_rcv.sb_hiwat - (long)so->so_rcv.sb_cc;

When the application reads data via recv() or read(), the kernel sends a PRU_RCVD message up the protocol stack. TCP uses this as a cue to recalculate the window and potentially send a window update to the peer. If the application reads slowly, or if sb_hiwat is small to begin with, the window advertisement shrinks. When it hits zero the sender must stop transmitting and start sending window probes, which adds latency and destroys throughput.

BSD has had mechanisms for buffer auto-sizing, but they have historically been more conservative and less automatic than Linux’s implementation. The SB_AUTOSIZE flag can be set on a socket buffer to allow it to grow, but the growth policy in NetBSD is cautious and not always active by default. The maximum is capped by kern.sbmax, which defaults to 262,144 bytes on some NetBSD versions, far below Linux’s ceiling.

What the Investigation Reveals

The pathology in practice follows a predictable sequence. A sender opens a connection and starts pushing data. Initially the receive buffer has headroom and the window is advertised generously. As data arrives faster than the application drains it, the buffer fills. The advertised window shrinks in proportion. The sender, compliant with the spec, slows its transmission rate. By the time the application catches up and frees buffer space, the sender has backed off significantly and needs time to ramp back up through slow start or congestion avoidance.

The result is throughput that oscillates rather than running at line rate. A tcpdump trace makes this visible immediately: look at the win field in the TCP headers of ACK packets. On a well-tuned connection it should be large and relatively stable. On a buffer-starved connection it oscillates between some fraction of the buffer size and values approaching zero.

$ tcpdump -n -r capture.pcap 'tcp[13] & 16 != 0' | awk '{print $NF}'
# win=32768
# win=28000
# win=14000
# win=2000
# win=0         <- sender stalls here
# win=32768     <- window update after application reads

This sawtooth pattern is the fingerprint of a receive buffer that is too small relative to the bandwidth-delay product.

netstat -s gives aggregate statistics that confirm the diagnosis. The rcvbuf overflow counter tracks segments dropped because the receive buffer was full. High values there, combined with low retransmit counts (meaning the network itself is not lossy), indicate buffer exhaustion rather than congestion.

The Diagnostic Methodology

Debugging TCP performance on any BSD system benefits from a layered approach. Start with netstat -s -p tcp to get a snapshot of cumulative counters, run the transfer, then diff the counters. Pay attention to:

  • tcpRcvbufOvflow or equivalent: segments dropped due to full receive buffer
  • tcpRcvPackAfterWin: data arriving outside the advertised window
  • tcpRcvAfterClose: segments arriving after the connection closes (usually noise)
  • tcpPersistTimeout: how many times the sender has had to send window probes

High tcpPersistTimeout values are particularly damning. They mean the sender has repeatedly been forced into the persist timer loop, waiting for a window update that the receiver was slow to send. Each persist interval is at minimum 5 seconds by default, so even a handful of these events can noticeably inflate transfer times.

For kernel-level visibility, BSD offers the TCP_DEBUG socket option, which logs TCP state transitions to a circular buffer accessible via trpt(8). It is coarse but does not require kernel recompilation. DTrace probes in tcp::: and ip::: provide finer granularity on systems where DTrace is available, though NetBSD’s DTrace support has historically been incomplete compared to FreeBSD or Solaris descendants.

Fixes and Their Trade-offs

The immediate fix is sysctl tuning. Raising net.inet.tcp.recvspace to match the BDP of the target network is straightforward:

# For a 1 Gbps LAN with 1ms RTT:
# BDP = 1e9 * 0.001 / 8 = 125,000 bytes
# Round up with overhead headroom:
sysctl -w net.inet.tcp.recvspace=262144
sysctl -w net.inet.tcp.sendspace=262144
sysctl -w kern.sbmax=16777216

Raising kern.sbmax is important because it sets the ceiling that socket buffer auto-sizing is allowed to reach. Without it, even applications that set SO_RCVBUF to a large value via setsockopt() will be silently capped.

The deeper fix is improving auto-tuning. FreeBSD has done more work here than NetBSD, adding receive buffer auto-sizing that tracks measured throughput and adjusts sb_hiwat dynamically, similar in spirit to Linux’s approach but with a different implementation. NetBSD’s auto-tuning is less aggressive. The SB_AUTOSIZE path exists but the growth heuristics have not been tuned as carefully, and the default maximum is conservative.

There is a legitimate reason to be conservative: each socket’s receive buffer is physical memory, and a server handling thousands of connections cannot give each one a 6 MB buffer without exhausting RAM. Linux handles this with a global TCP memory accounting system (tcp_mem sysctl) that throttles buffer growth when the system is under memory pressure. BSD has kern.sbmax as a per-socket ceiling, which is a cruder instrument.

Why This Keeps Coming Up

The reason BSD TCP performance debugging articles appear periodically, and the reason the same issues surface in NetBSD, OpenBSD, and to a lesser extent FreeBSD, is that each BSD variant is independently maintained with limited resources. A fix landed in FreeBSD in 2010 does not automatically appear in NetBSD. Networking improvements require someone to port them, test them, and navigate the commit process, and that work happens unevenly.

Linux benefits from corporate investment in network performance from Google, Meta, Red Hat, and others who have strong incentives to squeeze throughput. The result is a TCP stack that has received continuous performance attention for two decades, with auto-tuning, pacing, BBR congestion control, and careful buffer accounting that most BSD stacks still lack or have only partially implemented.

None of that makes BSD TCP wrong. For many workloads, particularly on modest hardware doing modest transfers, the defaults are fine. But for anyone pushing real throughput through a NetBSD machine, understanding the receive path, measuring the buffer usage, and tuning the relevant sysctls is not optional. The kernel is not going to do it for you by default, and now you know why.

Was this interesting?