· 8 min read ·

TCP Hole Punching and the Simultaneous Open Nobody Uses

Source: hackernews

NAT traversal has a reputation for being solved. UDP hole punching is well-documented, widely deployed, and the foundation of most peer-to-peer systems from WebRTC to WireGuard. The UDP case is tractable because UDP is stateless from the NAT’s perspective: you send a packet out, the NAT records the outgoing four-tuple, and for a window of time it allows inbound traffic from the destination you sent to. No handshake, no connection object, no server-side state to establish first.

TCP is a different problem. The protocol was designed around an asymmetric relationship: one side listens, one side connects. The three-way handshake encodes this asymmetry. NAT implementations internalize it. When your TCP stack issues a SYN, the NAT creates a state entry for that outgoing session. When an unsolicited SYN arrives from outside, the NAT drops it, because nothing about the existing state says to expect it. This mismatch is why most P2P systems default to UDP even when they would prefer TCP’s reliability and ordering guarantees.

But TCP hole punching is possible. The technique is underused partly because it is harder to implement correctly than the UDP version, and partly because the crucial mechanism it relies on, simultaneous open, is a corner of the TCP spec that most programmers have never needed to think about.

This write-up by robertsdotpm explores what a clean formulation of the algorithm looks like, and the elegance observation is worth unpacking at the protocol level.

The State Machine Branch Nobody Teaches

RFC 793, the original TCP specification from 1981, defines a state machine with two distinct opening sequences. The normal case: one socket calls listen(), the other calls connect(), SYN goes out, SYN-ACK comes back, ACK is sent, connection is established. This is what every networking course teaches.

The other case, simultaneous open, handles what happens when both sides call connect() at the same time. When a socket in the SYN_SENT state receives an incoming SYN (not a SYN-ACK), it does not treat this as an error. It transitions to SYN_RECEIVED and responds with a SYN-ACK. The remote side is in the same state, sees the same thing, sends its own SYN-ACK. Both sides receive SYN-ACK, both send ACK, both reach ESTABLISHED. Neither side was the server. No listen(), no accept().

Linux, BSD, macOS, and Windows all implement this correctly at the kernel level. The spec requires it. What varies is how easy the socket API makes it to trigger this path intentionally.

The Socket API Problem

To punch a TCP hole, you need a socket that is both connecting outward to the remote peer and capable of receiving the remote peer’s incoming SYN on the same local port. The standard socket API resists this. Two bind() calls to the same address and port fail with EADDRINUSE unless you set socket options specifically to permit sharing.

The options are SO_REUSEADDR and, on Linux 3.9+ and BSD, SO_REUSEPORT. Their semantics differ subtly.

SO_REUSEADDR primarily allows a socket in TIME_WAIT to be reused, and also allows multiple sockets to bind to the same port if none of them are connected. It does not, by itself, allow two fully-connected sockets on the same local endpoint.

SO_REUSEPORT goes further: it allows multiple sockets to bind to the same address and port combination, with the kernel disambiguating incoming traffic by the full four-tuple (local address, local port, remote address, remote port). This is what makes hole punching viable. Two sockets sharing a local port will each only receive traffic that matches their specific remote address.

Here is the skeleton of what this looks like in Python:

import socket
import threading

def make_punch_socket(local_addr):
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
    s.bind(local_addr)
    return s

def tcp_hole_punch(local_addr, remote_addr):
    # The connector: sends our SYN, opens the NAT hole
    connector = make_punch_socket(local_addr)

    # The listener: receives the peer's SYN if they connect first
    listener = make_punch_socket(local_addr)
    listener.listen(1)

    result = [None]

    def try_connect():
        try:
            connector.connect(remote_addr)
            result[0] = connector
        except OSError:
            pass

    def try_accept():
        try:
            conn, _ = listener.accept()
            result[0] = conn
        except OSError:
            pass

    t1 = threading.Thread(target=try_connect)
    t2 = threading.Thread(target=try_accept)
    t1.start()
    t2.start()
    t1.join(timeout=5)
    t2.join(timeout=5)

    return result[0]

The outgoing connect() sends a SYN and creates a NAT state entry. When the peer’s SYN arrives (because they are doing the same thing toward your public address), your NAT forwards it because the outgoing state entry matches. The simultaneous open completes without either side having acted as a server in the traditional sense.

Why Port Reuse Is the Key Insight

The cleanest formulations of TCP hole punching reuse the same local port that was used to contact the rendezvous server. This is not an arbitrary choice.

When you connect to a rendezvous server from an ephemeral local port, your NAT assigns you a public IP and port for that session. That public endpoint is what the rendezvous server sees and what it relays to your peer. If you then initiate the hole punch from a different local port, your NAT will assign a different public port, and the peer will be connecting to an address nobody is listening at.

By reusing the same local port, with SO_REUSEADDR and SO_REUSEPORT set before the new bind(), you guarantee that the NAT is already maintaining state for that local port and, on endpoint-independent NATs, the same public port will remain in use. The peer connects to exactly the address the rendezvous server reported.

This is the coordination protocol:

  1. Both peers open a persistent connection to the rendezvous server from the port they intend to use for P2P traffic.
  2. The server records both peers’ public endpoints as observed from the network.
  3. The server sends each peer the other’s public endpoint, triggering both to begin the hole punch simultaneously.
  4. Both peers issue connect() to each other’s public endpoint on the same local port used for the rendezvous connection, with the listen socket also bound to that port.
  5. Both SYNs traverse both NATs, simultaneous open completes, the connection is established.

The timing of step 3 matters. The server should send both “go” messages as close together as possible. The window you have is the NAT’s SYN timeout for the hole you opened, which is typically 30 to 75 seconds for most consumer NAT hardware, so timing pressure is not extreme, but the symmetry of the algorithm benefits from simultaneity.

NAT Types and Where This Fails

RFC 5128 classifies NAT behavior along several dimensions. The critical one for hole punching is the mapping behavior.

Endpoint-independent NATs (sometimes called full-cone or restricted-cone) assign one public port per local socket, regardless of where the socket is sending to. This is the case hole punching relies on. Your public port for a given local port is stable across different remote destinations.

Symmetric NATs assign a different public port per destination. The port you used to contact the rendezvous server is not the same port the NAT assigns when you connect to your peer. The peer connects to the wrong address. The hole punch fails.

There is no clean workaround for symmetric NAT without either a relay server (TURN) or port prediction, which attempts to guess what public port the NAT will assign based on observed allocation patterns. Port prediction has poor reliability in practice and is not worth implementing when TURN is available as a fallback.

The ICE protocol, which WebRTC uses for NAT traversal, handles this gracefully: it tries direct connection, then hole punching, then TURN relay, in priority order. TCP candidates are supported alongside UDP in ICE, using exactly the simultaneous open mechanism described here.

OS-Level Complications

Even when the network cooperates, OS behavior introduces failure modes.

SO_REUSEPORT semantics differ between Linux and macOS. On Linux, the kernel load-balances incoming connections across all sockets sharing a port, with the full four-tuple used for demultiplexing when remote addresses are distinct. On macOS, the behavior is similar but has historically had edge cases around which socket receives a given incoming connection when the remote address is not yet fixed in the kernel’s internal state.

Linux before 3.9 did not have SO_REUSEPORT at all. If you need to support old kernels, you are limited to SO_REUSEADDR alone, which restricts what you can do with port sharing.

Windows supports port reuse through a different mechanism. The SO_REUSEADDR option on Windows has semantics closer to Linux’s SO_REUSEPORT, and Microsoft added SO_REUSEPORT in Windows Server 2022, but cross-platform implementations need careful testing.

The race condition to watch: if the peer’s SYN arrives at your listening socket before you have also issued connect(), the accept() will complete but the socket you get is a normal server-initiated connection, not a simultaneous open. This is fine for the application, but it means you may not need the connector socket at all. Implementations should handle both paths and clean up whichever socket does not complete first.

Where This Gets Used

libp2p, the networking library used by IPFS and several blockchain systems, implements TCP hole punching as part of its NAT traversal stack, with STUN-like address discovery and circuit relay as the fallback. The Go implementation is a reasonable reference for seeing the socket mechanics done in production code.

WebRTC’s ICE implementation in browsers handles TCP hole punching transparently, though most browser P2P traffic goes over DTLS/SRTP on UDP because the latency profile is better. The TCP path exists for environments where UDP is blocked.

For custom implementations in Go, pion/ice provides a full ICE stack including TCP candidate support. If you are building a P2P system and do not need to own the NAT traversal layer, starting there is more practical than implementing raw simultaneous open.

What Makes the Algorithm Elegant

The simultaneous open technique is elegant in a specific sense: it uses TCP’s own state machine as the connection mechanism rather than fighting it. You are not bypassing the handshake or faking packets. You are using a code path the RFC explicitly specifies, on a feature every compliant TCP stack must implement, to create a symmetric connection between two clients without either acting as a server.

The elegance breaks down in implementation because the socket API was designed for the asymmetric case and does not expose simultaneous open as a first-class operation. You have to infer it from the combination of SO_REUSEPORT, careful binding order, and concurrent connect/accept calls. The port reuse trick closes another gap by ensuring the NAT mapping you discover through the rendezvous server is the same one your hole punch will use.

TCP hole punching will fail more often in deployment than UDP hole punching, and symmetric NAT is common enough in corporate networks that a TURN fallback is not optional for production systems. But for the cases where it works, it produces a genuine TCP connection with no relay overhead, no extra latency, and none of the complexity of layering reliability on top of UDP yourself.

Was this interesting?