· 6 min read ·

The Compounding Cost of Kubernetes Deployment Defaults at Scale

Source: lobsters

Cloudflare published a post recently about recovering 600 hours of engineering time per year from a single line change in a Kubernetes manifest. The number sounds dramatic, but the underlying mechanics are straightforward once you understand how deployment time accumulates at scale. The interesting part is not the fix itself, but the general class of problem it represents: configuration defaults that are entirely reasonable for small clusters become compounding liabilities when you run hundreds of services at Cloudflare’s deployment frequency.

The Multiplication Problem

Kubernetes rolling deployments have several configurable parameters that control how fast pods are replaced. Each one introduces a wait time per pod. Multiply that wait by the number of pods in a service, multiply again by the number of deployments per week, multiply again by the number of services, and small values become very large numbers.

Consider a service with 200 pods that gets deployed three times a week. If each pod takes 30 extra seconds to terminate during a rolling update, that single service costs:

200 pods × 30s × 3 deployments/week × 52 weeks = 936,000 seconds ≈ 260 hours/year

At Cloudflare’s scale, with hundreds of services and deployments happening continuously across a global fleet, 600 hours is not a surprising outcome. The real question is which configuration value was responsible.

terminationGracePeriodSeconds: The Most Common Culprit

The default value of terminationGracePeriodSeconds in Kubernetes is 30. This means when a pod receives a SIGTERM signal, Kubernetes waits up to 30 seconds before sending SIGKILL. The intent is to give the application time to finish in-flight requests and close connections cleanly.

The problem is that “up to 30 seconds” becomes “exactly 30 seconds” when the containerized process does not handle SIGTERM properly. Many applications, particularly those using runtimes or frameworks that do not register signal handlers by default, will simply ignore SIGTERM and wait for SIGKILL. Kubernetes obliges by waiting the full grace period for every single pod.

A manifest entry like this:

spec:
  terminationGracePeriodSeconds: 30

Looks harmless and is often left at its default. But if your application exits in under two seconds when it does receive SIGTERM, or if it does not handle SIGTERM at all and will always wait for SIGKILL, you are paying 28 seconds per pod for nothing.

For services where the application genuinely handles SIGTERM and shuts down quickly, reducing this value to something like 5 or 10 seconds is safe and eliminates the wait. For applications that ignore SIGTERM entirely, the right fix is to handle the signal in code, but you can also use a preStop hook to perform cleanup before the grace period starts.

Rolling Update Strategy Parameters

The other major lever is the rolling update strategy itself. Two fields control the pace:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

With maxUnavailable: 0 and maxSurge: 1, Kubernetes schedules one new pod, waits for it to pass readiness checks, then terminates one old pod, then repeats. For a 200-pod deployment, that is 200 sequential cycles. Each cycle is gated by the termination grace period of the old pod plus the startup and readiness time of the new pod.

Changing maxUnavailable to something like 25% allows Kubernetes to terminate 50 pods simultaneously, then bring up 50 new ones, cutting the total number of cycles from 200 to roughly 8. Combined with a reduced terminationGracePeriodSeconds, the time savings compound.

The reason services so commonly use maxUnavailable: 0 is that operators want to avoid any reduction in capacity during a deployment. This is a valid concern for latency-sensitive services at the edges of a network. But for most backend services, briefly running at 75% capacity during a deployment is an acceptable trade-off for a deployment that finishes in minutes rather than an hour.

The preStop Hook and Endpoint Propagation

There is a well-documented race condition in Kubernetes that causes services to add a preStop sleep as a workaround. When a pod is terminated, two things happen in parallel: the pod receives SIGTERM, and the Endpoints controller removes the pod from the service’s endpoint list. If the pod shuts down before kube-proxy has had a chance to update its iptables or IPVS rules, new connections will briefly route to a pod that is no longer accepting traffic.

The conventional fix is:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]

This delays the SIGTERM signal by 5 seconds, giving kube-proxy time to propagate the endpoint removal. It works, but it adds 5 seconds to every pod termination. At scale, those 5 seconds add up just as much as the grace period does.

Kubernetes 1.26 introduced terminationGracePeriodSeconds on the container level alongside the preStop sleep pattern. More recent approaches use EndpointSlice termination conditions introduced in Kubernetes 1.20, which allow proxies to stop sending new connections to terminating pods without requiring a sleep hack. If your cluster and kube-proxy are new enough, you may be able to remove the preStop sleep entirely.

PodDisruptionBudgets Interact With All of This

PodDisruptionBudgets (PDBs) add another layer. A PDB with minAvailable: 100% or maxUnavailable: 0 tells Kubernetes it may never voluntarily take a pod offline. This blocks not only rolling deployments but also node draining during maintenance or cluster upgrades.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: my-service

A PDB like this effectively serializes all disruptions. During a node drain, Kubernetes can only evict one pod at a time from this service, waiting for it to be rescheduled and healthy before evicting the next. The PDB interacts multiplicatively with terminationGracePeriodSeconds and rolling update strategy, and teams that have tuned their deployment strategy but left their PDB at maxUnavailable: 0 will still see slow drains during cluster upgrades.

minReadySeconds as a Less Obvious Factor

minReadySeconds defaults to 0, meaning a pod is considered available as soon as its readiness probe passes. Some teams set this to a value like 30 or 60 to allow metrics and logs to stabilize before the deployment proceeds to the next pod. The intent is reasonable, but 30 seconds per pod across a large deployment can dwarf the actual startup time of the pod itself.

This is worth auditing separately from grace periods and update strategy. If your readiness probe is well-calibrated and your application genuinely is ready when the probe passes, minReadySeconds is often unnecessary.

Measuring Your Own Deployments

The easiest way to see where time is going in your deployments is to watch the rollout with timestamps:

kubectl rollout status deployment/my-service --watch=true

For more granular data, Kubernetes emits events for each pod transition that include timestamps. Tools like kube-state-metrics expose kube_deployment_status_observed_generation and related metrics that let you track rollout duration in Prometheus over time.

The pattern to look for is a long flat line in pod termination events, where every pod takes exactly the same amount of time to terminate. That uniformity is a sign that pods are hitting the grace period ceiling rather than exiting on their own.

Defaults Optimized for Safety, Not Speed

Kubernetes defaults are conservative by design. A 30-second grace period is a reasonable default for a general-purpose orchestrator that does not know anything about your application’s shutdown behavior. maxUnavailable: 0 avoids any capacity reduction. A preStop sleep prevents connection errors during fast shutdowns. Each of these defaults is defensible on its own.

The issue is that teams often accept these defaults across dozens or hundreds of services without revisiting them. Configuration written to be safe for a service that was never profiled accumulates over months and years, and the total cost only becomes visible when someone finally instruments rollout duration at the fleet level.

The lesson from Cloudflare’s experience is not that one specific YAML field is dangerous. It is that deployment configuration deserves the same measurement discipline as application performance. The numbers are there; the fix, once found, often is one line.

Was this interesting?