SO_REUSEPORT and a 1981 RFC: How TCP Hole Punching Works at the Socket Level
Source: hackernews
The robertsdotpm article on TCP hole punching that surfaced on Hacker News recently builds its case around a state machine transition that RFC 793 defined in 1981 and that almost no application code deliberately triggers: simultaneous open. Understanding why this matters requires going through what NAT actually does with SYN packets, and what the socket API needs to look like for the OS to cooperate.
Why UDP Hole Punching Is Easier
NATs maintain state tables mapping internal (IP, port) pairs to external ones. When a UDP datagram leaves an internal host, the NAT records a mapping. Subsequent packets from the remote address that match the tuple get forwarded inbound. The hole-punching recipe for UDP is: both peers discover their public address via STUN, exchange those addresses through a signaling server, and then both send UDP packets to each other’s public address at roughly the same time. Both NATs create outbound mappings, both inbound packets match those mappings, and communication proceeds.
TCP complicates this because SYN packets are not treated like ordinary datagrams by many NATs. The NAT expects to see the canonical three-way handshake: outbound SYN, inbound SYN-ACK, outbound ACK. When an inbound SYN arrives for a port that has an outbound SYN mapping but no established connection, some NATs drop it outright, interpreting it as an unsolicited connection attempt rather than as the first half of a simultaneous open. The fraction of NATs that behave this way has shrunk over time as firmware has improved, but it is not zero.
What Simultaneous Open Actually Looks Like
RFC 793, Section 3.4 describes simultaneous open explicitly. Both ends start in CLOSED. Both call connect() to the other’s address. Both send SYN. Both receive the other’s SYN while in SYN_SENT state. Both transition to SYN_RCVD. Both send SYN-ACK. Both receive SYN-ACK. Both transition to ESTABLISHED. No server, no listen(), no accept(). The state machine handles it.
The segment exchange looks like this:
Peer A NAT A NAT B Peer B
| | | |
|--- SYN ---------->| | |
| |--- SYN --->| (dropped) |
| | |<--- SYN -----------|
| (dropped) |<--- SYN ---| |
... (retry after backoff, now both NATs have outbound mappings) ...
|--- SYN ---------->| |<--- SYN -----------|
| |--- SYN --->| |
|<-- SYN (forwarded)| |--- SYN (forwarded)->|
|--- SYN-ACK ------>| |<--- SYN-ACK -------|
|<-- SYN-ACK -------| |--- SYN-ACK ------->|
| ESTABLISHED |
NAT A sees an outbound SYN from A toward B’s public address and creates a mapping. When Peer B’s SYN arrives at NAT A from B’s public address destined for A’s public address, NAT A has a matching tuple and forwards it inward. Peer A’s kernel sees an inbound SYN while in SYN_SENT state and transitions to SYN_RCVD. The same happens symmetrically at NAT B. The connection completes without a server on either side.
The Socket API Mechanics
Getting the OS to cooperate requires two socket options working together. On Linux:
import socket
def connect_with_hole_punch(local_port, remote_ip, remote_port):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
s.bind(('0.0.0.0', local_port))
# Both peers call this at nearly the same time
s.connect((remote_ip, remote_port))
return s
SO_REUSEADDR allows re-binding to a port in TIME_WAIT state, which matters when retrying after a failed attempt. SO_REUSEPORT allows multiple sockets to share a port, necessary when you also want a listening socket on the same port as a fallback path. The critical point is that bind() locks in the local port before connect() sends the SYN. That bound port is the one the NAT creates a mapping for, and it is the same port that an inbound SYN must arrive at to be handled by this socket.
The simultaneous open transition happens automatically in the kernel. When a socket in SYN_SENT state receives an inbound SYN from the same remote address it is connecting to, the kernel transitions to SYN_RCVD and sends SYN-ACK. The application does not need to observe or respond to this; connect() eventually returns success once the full exchange completes.
OS portability is a real constraint. Linux handles simultaneous open correctly. Windows has historically not completed the simultaneous open transition reliably, with behavior varying across versions. FreeBSD and macOS implement it but have different timing characteristics around socket option semantics. Any implementation targeting multiple platforms needs explicit testing on each.
The Timing Problem and Coordination
The hard operational constraint is timing. If Peer A’s SYN reaches NAT B before Peer B has sent its own SYN, NAT B has no outbound mapping and drops A’s packet. Peer B’s SYN then reaches NAT A with the same result. Neither connection proceeds, and both sides see a timeout.
The coordination mechanism is a signaling server combined with STUN (RFC 5389) for address discovery. Each peer opens a socket, binds a local port, queries a STUN server from that socket to learn its public (IP, port), and shares those values through the signaling channel. The signaling server tells both peers to begin connecting simultaneously. The acceptable timing skew depends on NAT processing latency and network round-trip time, but a window of a few hundred milliseconds is typically sufficient.
Using the same socket for both the STUN query and the subsequent connect() is important for symmetric NATs and port-restricted NATs. The NAT maps the local port to an external port; that external port is what the STUN server observes and what the peer needs to target. If you open a new socket for the connect() call, you get a different external port allocation and the address you shared becomes stale.
NAT Types and Failure Cases
The 2005 paper by Ford, Srisuresh, and Kegel categorized NAT behavior into four types: full cone, address-restricted cone, port-restricted cone, and symmetric. Full cone NATs forward any inbound packet to a mapped internal address regardless of source. Symmetric NATs allocate a fresh external port for each distinct (destination IP, destination port) pair, so the port observed during STUN discovery differs from the port used during the hole-punch connect() if those go to different destinations.
Symmetric NAT is the case that defeats simultaneous open cleanly. Port prediction, where one peer tries to guess the next port the other’s NAT will allocate, works against NATs with sequential port allocation but fails with randomized allocation, which has become standard. The clean approach is to accept symmetric NAT as a case requiring TURN relay (RFC 8656) rather than adding unreliable heuristics to handle it sometimes.
TURN relays all traffic through a server. It works universally but adds latency, consumes bandwidth at the relay, and introduces a server dependency. Direct TCP via simultaneous open, when the NAT cooperates, is peer-to-peer with none of those costs. ICE (RFC 8445), used in WebRTC, tries candidates in priority order: direct UDP, direct TCP via simultaneous open, then TURN relay. Simultaneous open fits naturally into ICE as a TCP candidate type.
Projects like libp2p, which underpins IPFS and several other distributed systems, implement TCP hole punching for peer discovery. Tailscale documents its NAT traversal approach in detail; it focuses on UDP because WireGuard is UDP-only, but the NAT state machine reasoning is identical.
What Elegance Means Here
The word elegant in the article title describes a specific property: the algorithm does not require special-casing, bolt-on state, or fighting the underlying abstractions. The TCP state machine as defined in RFC 793 handles simultaneous open correctly when triggered. The NAT, when it behaves as a stateful packet forwarder rather than a connection-state tracker, forwards the inbound SYN when the outbound mapping exists. SO_REUSEPORT lets the socket API express the binding constraint without raw socket manipulation.
The contrast is with approaches that race between connect() and accept() on separate ports, or that use raw sockets to craft SYN packets with specific sequence numbers, or that probe NAT port allocation patterns before attempting a connection. Those work in some cases but accumulate complexity proportional to the number of NAT behaviors they try to handle.
The Hacker News comments thread surfaces useful counterpoints: specific NAT firmware versions that mishandle the inbound SYN despite having the mapping, OS scheduler interactions that affect timing precision, and cases where SO_REUSEPORT semantics differ from what the implementation assumes. These are worth reading alongside the article. The elegance holds at the algorithm level; the implementation surface still has rough edges that vary by environment.