· 8 min read ·

How Native IPv6 Flattens the Edge Kubernetes Networking Stack

Source: lobsters

Henrik Gerdes’s writeup on native IPv6 Kubernetes for edge routing is a good practical account of setting this up, but the more interesting question underneath it is why the approach works as well as it does architecturally. The answer has everything to do with what NAT costs you at the edge and what IPv6 makes structurally unnecessary.

The Edge Routing Problem

In a managed Kubernetes cluster on a major cloud provider, you mostly do not think about how traffic reaches your pods. A cloud load balancer accepts external traffic, the cloud CNI manages pod networking, and the provider’s control plane stitches it together. The routing problem is solved by the infrastructure you are renting.

At the edge, none of that exists. You are running Kubernetes on bare metal or on small VMs in a facility that hands you a physical or virtual uplink and tells you to figure the rest out. That uplink carries one or a few IPv4 addresses. Your pods need to be reachable, your services need to be reachable, and the only tool traditionally available for this is NAT in various configurations: DNAT for inbound, SNAT or masquerade for outbound, with NodePort or MetalLB bridging the gap between the pod network and the external world.

This works, but it introduces statefulness, asymmetric routing risks, and protocol-level problems for anything that embeds IP addresses in its application layer. SIP, FTP, some gRPC-over-QUIC configurations, and ICE-based WebRTC all behave differently or worse behind NAT. More mundanely, every hop through NAT is a table lookup with a finite capacity. kube-proxy’s iptables-based NAT has well-documented scaling problems past a few thousand services, and the nftables backend, while better, does not change the fundamental architecture.

What “Native Routing” Means

Before getting to IPv6, it is worth being precise about what “native routing” means in the Kubernetes CNI context, because the term gets used loosely.

All CNIs give pods IP addresses and allow pod-to-pod traffic. Where they differ is in how they carry that traffic between nodes. Overlay CNIs like Flannel with VXLAN or Calico with IPIP encapsulate pod-to-pod packets inside node-to-node packets. The pod network is a virtual topology layered on top of the physical one. This works anywhere you can route between nodes, but it adds encapsulation overhead (VXLAN is typically 50 bytes per packet), requires MTU adjustments to avoid fragmentation, and hides pod IPs from the physical network.

Native routing CNIs, of which Cilium in native routing mode is the most capable, forward pod packets directly at the IP layer without encapsulation. Nodes need to know routes to each other’s pod CIDRs, either via static routes, a routing daemon, or BGP. The advantage is lower overhead and the fact that pod IPs are visible to the physical network. That visibility is what makes true edge routing possible.

IPv6 Changes the Premise

With IPv4, the fundamental problem at the edge is address scarcity. You have one or a handful of public addresses. Your pod CIDR is RFC 1918 space. The gap between those two realities is bridged by NAT. The complexity of Kubernetes service networking, the existence of NodePort, the baroque iptables rules kube-proxy generates, all of this is downstream of the original sin of not having enough public addresses.

IPv6 eliminates address scarcity by construction. A /48 prefix, which is a standard allocation for a site, gives you 65,536 /64 subnets, each with 2^64 addresses. Your pod CIDR can be publicly routable space. Your service CIDR can be publicly routable space. There is no architectural reason for NAT to exist in the path.

This is not just a nice-to-have. It means that with proper BGP routing, every pod IP is directly reachable from the public internet. Services do not need NodePort. LoadBalancer-type services can be announced directly via BGP without needing MetalLB to manage a separate pool of VIPs that then get NATed. The routing model becomes flat.

Configuring a Single-Stack IPv6 Cluster

Kubernetes has supported dual-stack (simultaneous IPv4 and IPv6) as stable since 1.23, and single-stack IPv6 has worked for longer. For a true edge deployment where your uplink is IPv6-native, single-stack is often the cleaner choice.

With kubeadm, a single-stack IPv6 cluster requires setting both the pod and service CIDRs to IPv6 ranges:

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
networking:
  podSubnet: "fd00:10:244::/48"
  serviceSubnet: "fd00:10:96::/112"

The service subnet uses a /112 because Kubernetes reserves the first address in the range and needs sequential addresses for each service. A /112 gives you 65,536 service IPs, which is plenty. The pod subnet is larger to accommodate per-node allocations, typically a /64 per node carved from a larger block.

For the API server, you also need to ensure it binds to IPv6:

kind: InitConfiguration
nodeRegistration:
  kubeletExtraArgs:
    node-ip: "2001:db8::1"

If you are using k3s instead of kubeadm, the equivalent flags are --cluster-cidr and --service-cidr passed to the server command:

k3s server \
  --cluster-cidr fd00:10:244::/48 \
  --service-cidr fd00:10:96::/112 \
  --flannel-backend=none \
  --disable-network-policy

Note the --flannel-backend=none because you will be replacing Flannel with Cilium.

Cilium and BGP Control Plane

Cilium is the CNI that makes this architecture practical. Its native routing mode requires no encapsulation, and its BGP Control Plane, which became production-ready in Cilium 1.13, handles the route announcements that make pod IPs externally reachable.

Installing Cilium for native IPv6 routing:

helm install cilium cilium/cilium \
  --namespace kube-system \
  --set ipv6.enabled=true \
  --set ipv4.enabled=false \
  --set routingMode=native \
  --set autoDirectNodeRoutes=true \
  --set bgpControlPlane.enabled=true \
  --set ipam.mode=kubernetes

With BGP Control Plane enabled, you configure peering through a CiliumBGPPeeringPolicy custom resource:

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeeringPolicy
metadata:
  name: edge-router-peer
spec:
  nodeSelector:
    matchLabels:
      kubernetes.io/os: linux
  virtualRouters:
    - localASN: 65001
      exportPodCIDR: true
      neighbors:
        - peerAddress: "2001:db8:ffff::1/128"
          peerASN: 65000
          eBGPMultihop: false
          connectRetryTimeSeconds: 10
          holdTimeSeconds: 90
          keepAliveTimeSeconds: 30

This tells Cilium to peer with your upstream edge router at 2001:db8:ffff::1 using eBGP. With exportPodCIDR: true, Cilium announces each node’s pod CIDR to the upstream router as the node comes up. The router then knows to forward traffic for those prefixes to the corresponding Kubernetes node, which Cilium routes onward to the destination pod without any NAT in the path.

You can also announce specific LoadBalancer service IPs by adding serviceSelector blocks to the virtual router configuration, which lets you control which services are externally reachable at the BGP level rather than through firewall rules.

The NDP Proxy Problem

One nuance that does not show up until you test this is Neighbor Discovery Protocol proxying. NDP is IPv6’s equivalent of ARP. When your upstream router wants to send traffic to a pod IP that falls within your announced prefix, it needs to resolve the MAC address for the next hop.

If your pod IPs are on the same /64 as your node’s uplink interface, NDP works naturally. If pod IPs are on a separate prefix that you are announcing via BGP, the upstream router will send NDP neighbor solicitations for pod addresses to the node, but the node needs to respond on behalf of the pods. This requires either NDP proxy mode on the node, which you enable with:

sysctl -w net.ipv6.conf.eth0.proxy_ndp=1
ip -6 neigh add proxy 2001:db8:1:2::1 dev eth0

or a routing configuration where the upstream router uses the node’s link-local address as the next-hop for the pod prefix rather than trying to resolve pod IPs directly. The latter is cleaner and is what a proper BGP setup achieves: the router learns that fd00:10:244:0::/64 is reachable via next-hop fe80::1 on interface facing the node, and it never needs to ARP or NDP for individual pod addresses. Cilium’s native routing mode operates this way when the peering is configured correctly.

Single-Stack vs Dual-Stack Trade-offs

The case for single-stack IPv6 at the edge is simplicity. Every component, the CNI, the service CIDR, kube-proxy or Cilium’s kube-proxy replacement, DNS, ingress controllers, only needs to handle one address family. Configuration surfaces are halved. Failure modes are more legible.

The cost is that you cannot directly reach IPv4-only endpoints from your pods without a transition mechanism. The two standard options are DNS64 with NAT64 or a dual-stack overlay for external traffic. DNS64 synthesizes AAAA records for A-record-only hosts, mapping them into a well-known prefix so that IPv6 clients can address them. NAT64 then translates those synthesized IPv6 packets back to IPv4 at the border. This is the mechanism used in mobile carrier networks that have moved to IPv6-only internal infrastructure while still reaching the IPv4 internet.

For clusters where all upstream peers and external services you care about are reachable via IPv6, which is increasingly true in 2026, single-stack is the right call. For clusters that need to reach legacy IPv4-only infrastructure, dual-stack is worth the complexity overhead, or you deploy a NAT64 gateway at the cluster boundary.

Sysctl and MTU Considerations

A few kernel-level details matter for this to work reliably.

First, forwarding must be enabled for IPv6 on nodes that route pod traffic:

sysctl -w net.ipv6.conf.all.forwarding=1

With forwarding enabled, nodes by default stop processing Router Advertisements on their uplink interface, which means they lose their default route. The fix is to set accept_ra=2 on the uplink interface, which processes RAs even when forwarding is on:

sysctl -w net.ipv6.conf.eth0.accept_ra=2

Second, MTU. IPv6 mandates a minimum MTU of 1280 bytes. Native routing with no encapsulation means you can use the full MTU of your physical interface (typically 1500 on Ethernet, or 9000 on jumbo-frame links). This is a concrete advantage over overlay modes where VXLAN’s 50-byte overhead forces you to either reduce the pod MTU to 1450 or rely on fragmentation. At line rate, that overhead adds up, and Cilium’s native routing mode simply avoids it.

The Resulting Architecture

What you end up with, when this is configured correctly, is a Kubernetes cluster where pod IPs are first-class participants in your network topology. Traffic from the internet reaches a pod without a single DNAT rule in the path. Outbound traffic from a pod reaches the internet with its pod IP as the source address. Load balancing happens at the BGP level by announcing the same service prefix from multiple nodes with ECMP, not inside the cluster via iptables chains.

This is the architecture that cloud providers have been running internally for years. The pods in a GKE or EKS cluster are not all NATed either; the difference is that the cloud provider’s control plane handles the routing plumbing for you. Henrik’s setup demonstrates that the same architecture is achievable on bare metal edge hardware once you stop treating IPv4 NAT as a given and design around IPv6 from the start.

The tooling to do this, Cilium’s BGP Control Plane, single-stack IPv6 cluster configuration, NDP-aware native routing, is mature enough in 2026 that the main barrier is familiarity with IPv6 networking concepts rather than missing software features. The cases where you still need NAT are shrinking as IPv6 adoption continues, and at the edge, where you control the uplink configuration, those cases can often be eliminated entirely.

Was this interesting?