· 6 min read ·

The Arithmetic of Kubernetes Deployment Overhead

Source: lobsters

Cloudflare published a post describing how a single configuration change saved 600 engineer-hours per year. The specific fix is worth understanding, but the broader pattern is more instructive: at sufficient deployment velocity, seemingly harmless Kubernetes defaults become expensive in aggregate.

The arithmetic is straightforward. If an organization runs 200 rolling deployments per day across a fleet of services, and each deployment incurs an unnecessary 30-second stall per pod rollout, the annual cost is roughly:

200 deployments/day × 30s × 365 days = 2,190,000 seconds ≈ 608 hours

That is not a hypothetical. It is the regime Cloudflare operates in, and it is the regime any reasonably active engineering organization will eventually reach. The fix itself was one line of YAML. The discovery required understanding exactly where those seconds were going.

Where the Time Goes: The Pod Termination Sequence

To understand where Kubernetes hides latency, you need to trace what happens when a rolling update replaces a pod. The sequence is not as simple as “old pod stops, new pod starts.” There are several actors involved, and they do not coordinate synchronously.

When Kubernetes decides to terminate a pod during a rolling update, it does three things approximately simultaneously. It removes the pod from the relevant Endpoints (and EndpointSlices) object, it executes any preStop lifecycle hook, and it sends SIGTERM to the container’s main process. The termination grace period clock starts ticking.

The problem is that “removes from Endpoints” does not mean traffic immediately stops flowing to the pod. The kube-proxy daemon on each node watches the API server for endpoint changes and updates iptables or IPVS rules accordingly. In a large cluster, this propagation takes time. Kubernetes SIG Network documentation acknowledges that there is no synchronization guarantee between endpoint removal and proxy rule updates.

So you have a window, often several seconds, where a pod has received SIGTERM and is shutting down, but is still receiving new incoming connections because some nodes’ proxy rules have not yet updated. If the application exits cleanly on SIGTERM during that window, those connections get reset. That produces errors.

The preStop Sleep: A Workaround Masquerading as a Best Practice

The conventional solution to the endpoint propagation race is a preStop hook with a sleep:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]

This delays the pod from acting on SIGTERM long enough for proxy rules to propagate. It is effective. It is also everywhere in Kubernetes documentation and Stack Overflow answers, which means it gets copy-pasted into manifests without anyone verifying whether the sleep duration is appropriate for the specific cluster and workload.

A 15-second preStop sleep on every pod in a deployment with 10 replicas, using maxUnavailable: 1, means the rolling update requires at minimum 150 seconds of artificial stalling before any application-level work happens. Increase the replica count, decrease maxUnavailable, and those seconds compound fast.

The same dynamic applies to terminationGracePeriodSeconds. The default is 30 seconds, which is reasonable for most applications. But it is common to find manifests where someone set it to 120 or 300 seconds to be “safe,” and the application actually exits in under a second. Every pod termination now waits for the full grace period before the rollout can proceed.

minReadySeconds and the Readiness Probe Lag

There is a less-discussed source of deployment latency: the interaction between minReadySeconds and readiness probe configuration.

minReadySeconds specifies how long a newly created pod must be in a Ready state before Kubernetes considers it available and proceeds with the next step of a rolling update. The default is 0, which means Kubernetes moves immediately once the readiness probe passes.

This sounds fine until you account for probe timing. If a readiness probe is configured with:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

Then a pod that starts up in 2 seconds will still wait at least 10 seconds before the first probe fires, and potentially 20-30 seconds before Kubernetes marks it Ready. With minReadySeconds: 30 added on top, that single pod contributes 60 seconds of latency to the rollout. Multiply by replica count and you see the pattern.

The fix is often just removing an overly conservative initialDelaySeconds and using startupProbe instead, which is the recommended approach since Kubernetes 1.18 for applications with variable startup times.

Pod Disruption Budgets and the Serialization Tax

PodDisruptionBudgets interact with rolling updates in ways that surprise people. A PDB with maxUnavailable: 0 combined with a Deployment rollout strategy of maxUnavailable: 1 does not cause an error; the more restrictive constraint wins. But a PDB with minAvailable equal to the total replica count has the same effect: no pods can be voluntarily disrupted, which means rolling updates stall completely.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 5
  selector:
    matchLabels:
      app: my-app

If the deployment has exactly 5 replicas, this PDB allows zero disruptions. The rollout will attempt to spin up a new pod (surge), wait for it to become ready, then terminate the old one. That is the correct behavior for maxUnavailable: 0. But if the PDB is wrong and the intent was actually minAvailable: 4, every rollout is slower than it needs to be by however long each pod takes to come up.

These misconfigurations accumulate quietly because rolling updates do not fail, they just take longer.

Topology Spread Constraints and Scheduling Latency

A more subtle source of rollout delay is topology spread constraints. Added in Kubernetes 1.19 as stable, these allow you to express that pods should be distributed across zones or nodes:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app

whenUnsatisfiable: DoNotSchedule means the scheduler will not place a pod if doing so would violate the skew constraint. In a rolling update, this can cause new pods to sit in Pending while waiting for old pods to terminate and free up capacity in the over-represented zone. The rollout serializes not just because of application behavior but because of scheduler decisions.

Changing whenUnsatisfiable to ScheduleAnyway or adjusting maxSkew to 2 can unblock these situations without meaningfully compromising the distribution goal. Whether that trade-off is acceptable depends on the workload, but many teams do not realize the constraint is causing the delay in the first place.

Auditing Your Own Deployments

The actionable takeaway from the Cloudflare case is not to copy their specific fix but to measure your own rollouts. Kubernetes events and pod timestamps give you the raw data:

kubectl rollout history deployment/my-app
kubectl describe pod <pod-name> | grep -A 5 Events

The timestamps in pod events tell you exactly how long each phase took: time from scheduled to running, time from running to ready, time from deletion requested to container exit. When those numbers do not match your expectations for how fast your application starts or stops, there is a configuration in the way.

For clusters with Prometheus, the kube_pod_status_phase and kube_deployment_status_condition metrics let you build rollout duration histograms. A spike in the p95 or p99 rollout time that does not correspond to any application-level change is almost always a Kubernetes configuration interaction, not a code issue.

The broader point is that Kubernetes configuration is not fire-and-forget. Defaults that were sensible for one replica count or deployment frequency stop being sensible at scale. A preStop sleep that added negligible overhead when a service had 3 replicas and deployed twice a week becomes a meaningful cost when that service has 50 replicas and deploys 20 times a day. Cloudflare found their 600 hours by looking carefully at what was happening during pod termination. Most teams have not looked that carefully yet.

Was this interesting?