TCP Hole Punching Is Harder Than You Think, and That's What Makes It Interesting
Source: hackernews
Most developers who’ve dealt with NAT traversal reach for UDP without thinking twice. UDP is stateless, forgiving about timing, and the hole punching story is well-understood: send a packet out, a mapping appears in your NAT table, the other side sends a packet to that mapping, and you’re connected. Clean, simple, repeatable.
TCP hole punching works on the same fundamental principle, but almost everything about the execution is harder. A recent article on robertsdotpm.github.io calls its approach “most elegant,” and the elegance is real — but it’s worth understanding exactly why TCP requires elegance at all, and what the gaps look like in practice.
The RFC 793 Path Nobody Uses
The foundational piece of knowledge here is that TCP simultaneous open is not a hack. It’s a documented, valid state machine transition defined in RFC 793 from 1981, preserved through the 2022 consolidation in RFC 9293.
In normal TCP, one side is in LISTEN, the other sends a SYN, and you get the familiar three-way handshake. In simultaneous open, both sides are in SYN_SENT when they receive each other’s SYN. The state machine handles this without drama:
SYN_SENT + receive SYN (no ACK bit set)
→ SYN_RECEIVED (simultaneous open branch)
→ send SYN+ACK
→ receive SYN+ACK
→ ESTABLISHED
Both sides send a SYN, both receive a bare SYN, both transition to SYN_RECEIVED, both send SYN+ACK, and both reach ESTABLISHED. This is a four-way exchange rather than three-way, and it works on any RFC-compliant TCP stack. The spec authors explicitly anticipated this scenario as a legitimate, if unusual, connection pattern.
The hole punching algorithm exploits this path. Each peer’s outgoing SYN creates a NAT mapping. If the SYN from the remote peer arrives while that mapping exists, many NAT implementations will forward it inward rather than dropping it. The TCP state machine then takes over via simultaneous open semantics.
Why Timing Is the Core Problem
With UDP, you can send multiple probes and be generous with the timing window. If a probe arrives too early, nothing breaks — the NAT just drops it, and you try again. TCP is unforgiving. A SYN arriving at a NAT for a port that has no outbound mapping gets dropped or, worse, gets a RST. A SYN that arrives after the outbound mapping exists but before the receiving kernel’s TCP stack is ready to see it as a simultaneous open also gets RST’d, because the kernel has no socket in SYN_SENT state on that port from the perspective of the inbound packet.
This means both peers need their SYN packets to be in flight at roughly the same time. If peer A’s SYN reaches peer B’s NAT before peer B has sent its own SYN outward, B’s NAT drops A’s SYN. The hole hasn’t been punched yet. The timing window is on the order of the round-trip time to the rendezvous server — which is typically tens to hundreds of milliseconds, not generous.
The rendezvous coordination protocol has to account for this. The libp2p dcutr protocol handles it by measuring each peer’s RTT to the relay server and issuing a synchronized “connect now” signal timed so that both peers’ SYN packets are likely to be crossing in the network at the same moment. The implementation is in Rust at protocols/dcutr in rust-libp2p and in Go in go-libp2p.
The OS-Level Friction
Beyond NAT behavior, there’s an OS-level problem that makes TCP hole punching awkward to implement. You need to bind a local port, use it to connect() to the remote peer’s NAT endpoint, and simultaneously make that same port available to accept() an incoming connection in case the remote peer’s SYN arrives first.
This requires SO_REUSEPORT on Linux. The basic skeleton looks like this:
import socket
import threading
LOCAL_PORT = 54321
REMOTE_ADDR = ('203.0.113.42', 9000) # remote peer's public endpoint
def make_socket():
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
s.bind(('0.0.0.0', LOCAL_PORT))
return s
# Listener thread
def listen():
srv = make_socket()
srv.listen(1)
conn, addr = srv.accept()
print(f'Accepted connection from {addr}')
# Connector thread
def connect():
c = make_socket()
try:
c.connect(REMOTE_ADDR)
print('Connected outbound')
except ConnectionRefusedError:
pass # may need retry
t1 = threading.Thread(target=listen)
t2 = threading.Thread(target=connect)
t1.start()
t2.start()
Both sockets bind to the same port. The kernel allows this with SO_REUSEPORT. The connect() sends the outgoing SYN that creates the NAT mapping. The listen()/accept() path catches the incoming SYN from the remote peer if it arrives as an ordinary connection (not simultaneous open). The simultaneous open path is handled transparently by the kernel’s TCP state machine if both SYNs cross.
On Windows, SO_REUSEADDR has different semantics and you need careful socket ordering to avoid port conflicts. BSD systems behave differently again. Cross-platform TCP hole punching implementations spend a meaningful amount of code on these variations.
What NATs Actually Do
NAT behavior is theoretically governed by RFC 5382, which mandates that NATs support simultaneous open and preserve TCP mappings for at least two hours for established connections. In practice, consumer NATs vary widely.
The four traditional NAT categories from Bryan Ford’s 2003 peer-to-peer communication paper still hold:
- Full Cone: Any external host can reach the mapped port. Hole punching is trivial.
- Address-Restricted Cone: External host must have received a packet from the internal host first. Standard hole punching works.
- Port-Restricted Cone: External host and port must match a previous outbound packet. Simultaneous SYN required.
- Symmetric: Different external port used per destination. Hole punching fails without port prediction.
The IMC 2022 paper from the libp2p team measured these rates in production across roughly 2,700 peers. TCP hole punching succeeded about 70% of the time; QUIC (over UDP) succeeded about 80% of the time. The gap reflects NATs that pass UDP but are stricter about TCP SYN handling.
Carrier-Grade NAT (CGN, RFC 6888) is the most problematic modern deployment. ISPs use CGN at scale, and these are almost always symmetric NATs with short mapping timeouts and aggressive SYN filtering. A meaningful fraction of mobile and residential users now sit behind CGN that no hole punching technique can reliably traverse.
The RST Problem
One failure mode that doesn’t get enough attention: some NAT implementations send a RST when they receive an inbound SYN for a port that has an outbound SYN_SENT mapping. This is non-compliant with RFC 5382 but common enough to affect real-world success rates significantly.
From the NAT’s perspective, it sees an outbound SYN (which it might interpret as the start of a half-open connection it should track), then an inbound SYN to the same mapped port from an unexpected source. Some implementations treat this as a port scan or a bogus connection attempt and actively reset it rather than forwarding it to the internal host.
This is one reason why the simultaneous open timing matters so much. If both SYNs cross in transit simultaneously, neither NAT has received the peer’s SYN before its own SYN has created the mapping. The inbound SYN arrives at a NAT that already has a mapping in SYN_SENT state, which compliant NATs should forward. The race is won by being synchronized enough that the mappings exist on both sides before either SYN arrives.
QUIC Has Made This Easier, But Not Obsolete
Most modern P2P systems prefer QUIC over UDP for hole punching. QUIC’s stateless initial handshake means NAT traversal is essentially identical to UDP hole punching, with none of the simultaneous open complexity. Tailscale, ZeroTier, libp2p, and WebRTC all prefer UDP/QUIC paths when available.
But TCP hole punching remains relevant for a specific and important reason: corporate and institutional firewalls commonly block UDP entirely or rate-limit it aggressively, while allowing TCP on ports 80 and 443. A system that can only do UDP hole punching will fail in a significant fraction of enterprise environments. TCP hole punching, potentially on port 443, is the fallback for these environments.
WebRTC addresses this through RFC 6544 TCP candidates in the ICE framework. ICE tries candidates in priority order: direct host connections first, then STUN-discovered server-reflexive addresses, then TURN relay. TCP simultaneous-open candidates are included in this hierarchy. When UDP paths fail, ICE can fall back to TCP hole punching before resorting to full TURN relay.
Fallback Is Not Optional
No production system should rely on TCP hole punching without a relay fallback. The engineering target, based on the libp2p production data, is roughly 70% direct connections and 30% relayed. TURN (RFC 8656) is the standard relay protocol; the IETF MASQUE working group is developing HTTP/3-based UDP and TCP proxying as a higher-performance alternative.
The relay is not a fallback in the sense of a failure mode to be ashamed of. It’s part of the design. Symmetric NAT exists, CGN exists, misconfigured firewalls exist. A P2P system that works 70% of the time and silently fails the other 30% is not production-ready. One that works 70% of the time directly and 100% of the time including relay is.
Why the Algorithm Matters
What makes a TCP hole punching algorithm “elegant” is primarily about handling the timing coordination cleanly. The core mechanics are not complicated — RFC 793 simultaneous open does the work, SO_REUSEPORT handles the OS constraint, a rendezvous server provides address exchange. The elegance is in the synchronization: measuring RTT to both peers, issuing the connect signal at the right moment, handling the race between the outbound SYN and the inbound SYN without requiring the application layer to know which path won.
Libp2p’s dcutr protocol is a good reference implementation of these ideas. The dcutr spec is readable and the RTT-based synchronization approach is well-documented. For anyone building a system that needs P2P connectivity through NAT, reading dcutr alongside the Roberts article gives a complete picture of what the production version of this problem looks like.