· 7 min read ·

TCP Hole Punching and the Elegance of Simultaneous Open

Source: lobsters

The premise is deceptively simple: two computers behind NATs want to connect to each other without a relay. For UDP, this problem was solved cleanly enough that it became standard practice in P2P systems a decade ago. For TCP, the same goal involves a corner of RFC 793 that most developers have never encountered, specific socket options that are easy to get wrong, and NAT behavior that varies enough between devices to make any solution feel fragile.

This article by Matthew Roberts presents an algorithm for TCP hole punching that earns the word “elegant” by working within the constraints of how NATs actually behave, rather than fighting them. To understand what makes it elegant, it helps to understand the full stack of what makes TCP hole punching difficult in the first place.

The NAT Problem, Precisely

A NAT device sits between a private network and the public internet. When an internal host at 192.168.1.5:45000 sends a packet to some external server, the NAT rewrites the source to its own public address, say 203.0.113.1:9001, and records that mapping. When the server responds to 203.0.113.1:9001, the NAT translates it back and forwards it to 192.168.1.5:45000. This works for client-initiated connections.

The problem for P2P is that both peers are internal hosts. Peer A at 192.168.1.5 behind NAT-A, and Peer B at 10.0.0.3 behind NAT-B. A’s public address is opaque to B, and B’s is opaque to A. Neither can initiate a connection to the other because neither NAT has an inbound mapping for the other peer’s packets.

NATs fall into a rough taxonomy based on how permissive their mapping behavior is. The most important distinction is between cone NATs and symmetric NATs. A cone NAT gives a host the same external port for all outbound connections, regardless of destination. A symmetric NAT allocates a fresh external port for each distinct destination. Hole punching works reliably only against cone NATs; symmetric NAT on either side makes direct P2P connection essentially impossible without a relay, no matter how clever the algorithm.

UDP Hole Punching: The Baseline

Bryan Ford’s 2005 paper at USENIX ATC defined the terminology and the core technique. For UDP, the algorithm is clean. Both peers connect to a public rendezvous server, which observes each peer’s external NAT endpoint from the incoming packets. The server tells each peer the other’s external endpoint simultaneously. Each peer sends a UDP packet to the other’s external endpoint. The outbound packet causes each peer’s NAT to create a mapping permitting inbound traffic from that destination. When the other peer’s packet arrives, the NAT already has an entry for it and forwards it to the internal host.

The reason this works gracefully for UDP is that UDP is stateless at the NAT layer. Sending a packet from A to B’s endpoint creates A’s mapping regardless of whether B’s NAT accepts that first packet. Timing is lenient. If A’s first packet arrives before B has punched its hole and gets dropped, subsequent packets from A will get through once B’s outbound packet has created B’s mapping. The application layer handles retries naturally.

Where TCP Gets Complicated

TCP is connection-oriented, and NATs track TCP connection state. When a NAT sees an outbound SYN from an internal host, it creates a mapping and expects to see the corresponding SYN-ACK inbound. If it instead receives an inbound SYN on that port (as would arrive from the other peer attempting to connect simultaneously), many NATs either drop it or send a TCP RST back, because an inbound SYN does not match what they consider a valid next packet for that connection.

Beyond NAT behavior, TCP itself imposes a timing constraint that UDP does not. Both peers must send their SYNs close enough together in time that each peer’s outbound SYN creates a NAT mapping before the other peer’s SYN arrives. If A’s SYN reaches B’s NAT before B has sent its own outbound SYN, B’s NAT has no mapping for A’s source address and port, and the SYN is dropped. TCP’s retransmit timer starts at 3 seconds and backs off exponentially, so missing the simultaneous open window means waiting 3, then 6, then 12 seconds for the next attempt.

Simultaneous Open: The RFC 793 Corner Case

TCP has a feature called simultaneous open that most programmers never encounter in practice. RFC 793 specifies that if both endpoints send SYN to each other at the same time, each receives the other’s SYN while in SYN_SENT state. Each transitions to SYN_RECEIVED, sends a SYN-ACK, receives the other’s SYN-ACK, and the connection reaches ESTABLISHED without either side ever being in LISTEN state. Both sides are active openers.

A: SYN_SENT  → receives B's SYN → sends SYN-ACK → SYN_RECEIVED
B: SYN_SENT  → receives A's SYN → sends SYN-ACK → SYN_RECEIVED
A: receives B's SYN-ACK → ACK → ESTABLISHED
B: receives A's SYN-ACK → ACK → ESTABLISHED

This state machine is the mechanism that makes TCP hole punching possible. The challenge is engineering the conditions where both peers’ SYNs arrive at each other’s NATs only after their own outbound SYNs have created NAT mappings.

The Socket-Level Implementation

The practical requirement that makes TCP hole punching hard to implement correctly is port reuse. When a peer connects to the rendezvous server, the connection uses a specific local port. The NAT mapping that the rendezvous server observes is tied to that exact local port. To punch a hole using that mapping, the outbound SYN to the peer must come from the same local port.

On Linux, this requires setting SO_REUSEADDR before binding, and in some configurations SO_REUSEPORT. The socket must be bound to the specific local port before calling connect().

int sock = socket(AF_INET, SOCK_STREAM, 0);

int opt = 1;
setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
setsockopt(sock, SOL_SOCKET, SO_REUSEPORT, &opt, sizeof(opt));

struct sockaddr_in local = {0};
local.sin_family = AF_INET;
local.sin_port = htons(LOCAL_PORT);  // same port observed by rendezvous server
local.sin_addr.s_addr = INADDR_ANY;
bind(sock, (struct sockaddr*)&local, sizeof(local));

struct sockaddr_in remote = {0};
remote.sin_family = AF_INET;
remote.sin_port = htons(peer_port);
inet_pton(AF_INET, peer_ip, &remote.sin_addr);
connect(sock, (struct sockaddr*)&remote, sizeof(remote));

The connect() call sends the SYN. If the timing is right and the remote NAT has already been punched, the SYN-ACK comes back. If not, the OS retransmits according to its backoff schedule.

A complication arises on Windows, where SO_REUSEPORT behaves differently from Linux and where the TCP stack’s handling of simultaneous open has historically been inconsistent. Any cross-platform implementation needs to account for these differences, usually through platform-specific branches.

What Makes an Algorithm Elegant Here

The difficulty with TCP hole punching is not the concept, which is straightforward. The difficulty lies in the failure modes: NATs that RST inbound SYNs, timing windows that are too narrow, OS implementations that do not correctly handle the simultaneous open state transition, and symmetric NATs that make the approach inapplicable.

An elegant algorithm minimizes the number of things that need to go right simultaneously. One approach is to use the relay connection, already established with the rendezvous server, to precisely synchronize timing. libp2p’s DCUtR protocol does this: it measures the relay round-trip time between peers, then schedules the simultaneous SYNs to fire at a computed wall-clock offset that accounts for each peer’s latency to the rendezvous point. The result is that both SYNs depart at approximately the same instant, maximizing the probability that each arrives after the other peer’s NAT has been punched.

Another dimension of elegance is graceful NAT type detection. STUN’s change-request mechanism, defined in RFC 5389, can classify a NAT as cone or symmetric before attempting hole punching, allowing the algorithm to fall back to a TURN relay immediately rather than wasting time on a technique that will fail. Coupling NAT classification tightly into the connection setup flow removes a class of timeouts that otherwise make P2P software feel unreliable.

The Modern Context

QUIC, standardized in RFC 9000, sidesteps much of this complexity by running over UDP while providing TCP-like reliability, multiplexed streams, and built-in TLS 1.3. Because it is UDP-based, hole punching reverts to the simpler, timing-tolerant UDP technique. WebRTC uses QUIC in newer transports partly for this reason, treating TCP hole punching in ICE as a lower-priority fallback.

That said, TCP hole punching remains relevant. Some network environments block all UDP traffic, leaving TCP as the only viable transport. Firewalls configured for corporate security postures often pass TCP on ports 80 and 443 while dropping UDP wholesale. In these environments, getting TCP hole punching right is the difference between having a direct connection and routing everything through a relay with its associated latency and bandwidth costs.

The ICE specification for TCP candidates (RFC 6544) defines three candidate types: tcp-active (initiates), tcp-passive (listens), and tcp-so (simultaneous open). The tcp-so type is the formal standardization of TCP hole punching within the ICE framework. Its implementation remains optional in many stacks, and support is uneven across WebRTC implementations, which is part of why new treatments of the algorithm keep appearing.

Getting two peers to connect directly through NAT is a problem that looks solved at a distance and reveals substantial complexity up close. TCP’s version of it sits at the intersection of socket programming, NAT device behavior, and a little-used TCP state machine transition. The space for careful, elegant solutions has not been exhausted.

Was this interesting?