· 6 min read ·

The Hidden Cost of Topology Spread Constraints During Kubernetes Rolling Updates

Source: lobsters

Cloudflare recently published details about a one-line Kubernetes fix that recovered 600 hours of deployment time per year. The change is a single YAML field added to their topology spread constraints. The underlying problem is a subtle interaction between scheduling constraints and rolling updates that affects more production clusters than most teams realize.

What Topology Spread Constraints Do

Kubernetes introduced topologySpreadConstraints as a structured way to distribute pods across failure domains. Before this feature, you had pod affinity and anti-affinity rules, which were expressive but verbose, and they gave you no way to control the degree of imbalance, only whether specific pods could coexist. Topology spread constraints let you say: spread my pods across availability zones, and don’t let any zone have more than one extra pod compared to the least-loaded zone.

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-service

maxSkew: 1 means the difference in pod count between any two topology domains cannot exceed 1. whenUnsatisfiable: DoNotSchedule makes this a hard constraint: if placing a pod would violate the skew limit, the scheduler will not place it. The feature graduated to stable in Kubernetes 1.19 and has been widely adopted since.

This is sound practice for high-availability services. Spreading pods across zones means a single AZ outage doesn’t take down your entire workload. The DoNotSchedule mode ensures the cluster doesn’t drift into an imbalanced state over time.

The Rollout Problem

During a rolling update, Kubernetes replaces old pods with new ones incrementally. The default rolling update strategy creates a new pod, waits for it to become ready, then terminates an old one. The maxSurge and maxUnavailable settings control how many pods can be in transition at once.

The topology spread constraint’s labelSelector matches pods by label. In most deployments, that label selector matches the application label, which means it matches both the old pods from the previous ReplicaSet and the new pods from the current one simultaneously.

Consider a deployment with 30 pods spread evenly across 3 zones: 10 per zone. A new version starts deploying. The scheduler counts all pods matching the label selector, including the 30 old pods. The zones are balanced at 10 each, so placing the first new pod in zone A creates an 11/10/10 distribution, still within maxSkew: 1. The rollout continues.

Problems emerge as old pods are terminated and zone counts diverge temporarily. With maxSkew: 1, each termination creates a brief imbalance that restricts where the next new pod can go. If the only valid zone doesn’t have capacity, or other scheduling constraints conflict, the new pod stays Pending. The rolling update controller waits for it to become ready, and the rollout stalls until the scheduler can place it.

With larger deployments, this isn’t occasional interference; it’s a systematic pattern. A deployment with 200 pods rolling across dozens of nodes hits constraint violations repeatedly throughout the rollout. Each violation adds latency. The cumulative time adds up quickly.

The Root Cause

The label selector in topologySpreadConstraints has no concept of pod generations. It treats all matching pods as equivalent for the spread calculation, whether they belong to the old ReplicaSet or the new one. During a rolling update, the scheduler is computing spread across a population that is, by design, temporarily inconsistent.

This is a semantic mismatch between the constraint’s intent and its implementation. The goal is to ensure the running workload is spread evenly for resilience purposes. But the constraint doesn’t distinguish between “currently serving traffic” and “about to be replaced.” It sees a mixed-generation population and enforces the spread limit against that entire population, including pods that will be terminated within seconds.

The Fix: matchLabelKeys

Kubernetes added a field called matchLabelKeys to topology spread constraints, first as alpha in 1.25, promoted to beta in 1.27, and stable in 1.29. It lets you specify additional label keys whose values are ANDed with the labelSelector when identifying pods for the spread calculation. When you include pod-template-hash, the scheduler computes spread only across pods that share the same template hash, which corresponds to a single ReplicaSet.

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-service
    matchLabelKeys:
      - pod-template-hash

That single addition changes the spread calculation so it only considers the new pods being scheduled, not the old ones still running. The rollout no longer sees interference from the previous generation. New pods spread across zones as if the old ones weren’t there. Once the rollout completes and old pods are terminated, the spread reflects only the current generation, which is what the constraint was always intended to enforce.

Kubernetes automatically adds the pod-template-hash label to every pod in a ReplicaSet, with a value derived from a hash of the pod template spec. Pods from different revisions of a Deployment get different hash values, so scoping the spread calculation by this label gives you per-revision isolation without any extra labeling work.

Why This Wasn’t the Default

matchLabelKeys wasn’t added until Kubernetes 1.25, years after topology spread constraints shipped. The gap reflects the general-purpose design of the field rather than a missed edge case.

matchLabelKeys isn’t specific to pod-template-hash or Deployment rollouts. It’s a mechanism for scoping spread calculations to any label dimension. Some workloads use StatefulSets, DaemonSets, or custom controllers that don’t use ReplicaSets at all. Building pod-template-hash awareness into the scheduler as a special case would have been inconsistent with the constraint model, and it would have introduced implicit behavior that’s hard to reason about without reading scheduler internals.

Backward compatibility also shapes the decision. Changing the default spread calculation for existing constraints would break any deployment that depended on cross-generation spread semantics, whether intentionally or by accident. Making matchLabelKeys opt-in was the right call, even if it leaves this footgun in place for teams that don’t know to look for it.

Alternatives and Their Tradeoffs

The other common response to rollout stalls is switching whenUnsatisfiable: DoNotSchedule to whenUnsatisfiable: ScheduleAnyway. This converts the hard constraint into a soft preference. Rollouts proceed without stalling, but the cluster can drift into imbalanced zone distributions during periods of churn. For services where zone balance matters for failure isolation, this trades away the guarantee you were trying to enforce in the first place.

Increasing maxSkew from 1 to 2 or 3 is a middle ground. It relaxes the constraint enough to accommodate temporary imbalance during rollouts without abandoning hard enforcement entirely. The risk is that a maxSkew of 2 on a large deployment can allow meaningful imbalance in practice, and choosing the right value requires understanding the deployment’s scaling and rollout behavior in detail.

Adding matchLabelKeys: [pod-template-hash] is the cleaner solution because it preserves the original constraint semantics. The hard limit still applies; it just applies to the generation being deployed, which is what the constraint was designed to express.

Scale Makes It Visible

The rollout stall behavior is invisible on small clusters or low-replica deployments. With 5 replicas across 3 zones, the constraint rarely interferes, and when it does, the delay is a matter of seconds. The problem only becomes measurable with many replicas, frequent deployments, and tight spread requirements: the conditions that large-scale Kubernetes deployments exist to manage.

Cloudflare’s 600-hour figure reflects straightforward arithmetic. A large organization running hundreds of services, each with dozens or hundreds of replicas, deploying many times per day, can accumulate minutes of extra rollout latency per deployment. That compounds into substantial time across a year. The Kubernetes scheduler is working correctly; the configuration just wasn’t complete.

Most teams copy topology spread constraint configuration from documentation or internal templates that predate the matchLabelKeys field. The basic configuration looks right in testing, and the stall behavior only manifests at scale. If you’re running Kubernetes 1.29 or later with whenUnsatisfiable: DoNotSchedule on rolling deployments, auditing your constraints for this field is a small amount of work with potentially large payoff.

Was this interesting?