feat: add readinessFailureDelay to shutdown config by consideRatio · Pull Request #9363 · envoyproxy/gateway

consideRatio · 2026-06-28T14:32:41Z

What

Adds readinessFailureDelay to ShutdownConfig.

When configured, the shutdown manager starts Envoy listener drain immediately with /drain_listeners?graceful&skip_exit, then delays /healthcheck/fail until the configured duration has elapsed. The default remains 0s, preserving the existing behavior where /healthcheck/fail starts listener drain immediately.

This is useful for environments where failing readiness immediately can leave a node without ready local endpoints before upstream load balancers have stopped sending traffic to it.

In practice, I experienced the need for this on GKE with Cilium / (they call it Dataplane v2), with a envoy gateway helm chart deployed as Gateway API controller. The envoy gateway Service of type: LoadBalancer had externalTrafficPolicy: Local by default, and the GCP provided LoadBalancer for the Service resource takes many seconds to realize that no non-terminating pods on the node are available - so it kept sending traffic to the node. The node receiving the traffic then didn't pass it forward to other pods on other node (because of externalTrafficPolicy: Local), and since the pod on the node wasn't just terminating, but non-ready, new connection was refused.

fixes Add a delay to the shutdown-manager before failing healthchecks #9210
closes feat: add drain delay to graceful shutdown process #9211 (Alternative PR for Add a delay to the shutdown-manager before failing healthchecks #9210, ping @mwain and @zirain for visbility as you worked that)

Validation

I did a e2e test in the GKE cluster where I observed issues before, and confirmed that the issue was resolved using an image built from this PR branch.

go test ./internal/cmd/envoy ./internal/infrastructure/kubernetes/proxy ./api/v1alpha1/validation
git diff --check

netlify · 2026-06-28T14:32:46Z

✅ Deploy Preview for cerulean-figolla-1f9435 ready!

Name	Link
🔨 Latest commit	`77d11ae`
🔍 Latest deploy log	https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/6a41382087e0b9000888136c
😎 Deploy Preview	https://deploy-preview-9363--cerulean-figolla-1f9435.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

consideRatio · 2026-06-28T14:45:20Z

-1. Kubernetes sends SIGTERM to the pod
-2. Shutdown manager fails health checks via `/healthcheck/fail`
-   - This causes Kubernetes readiness probes to fail
-   - External load balancers and services stop routing new traffic to the pod


They stop routing traffic, but it can take many seconds - for example for an load balancers like provided for GCP for a service of type: LoadBalancer.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b759222e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T14:45:33Z

+			logger.Error(err, "error starting listener drain")
+		}
+		logger.Info(fmt.Sprintf("delaying readiness failure by %.0f seconds", readinessFailureDelay.Seconds()))
+		readinessFailurePending = true


Keep shutdown alive until the readiness delay elapses

When readinessFailureDelay is greater than minDrainDuration and the proxy has no active downstream connections, setting readinessFailurePending here is not enough because the loop can still take the later allowedToExit && conn <= exitAtConnections break before the pending /healthcheck/fail is sent. In that case the preStop hook writes the ready file and lets Envoy terminate after only the minimum drain period, so examples like a 40s readiness delay with the default/15s minimum drain never actually keep the pod ready for the configured delay in low-traffic pods.

Useful? React with 👍 / 👎.

The behavior is intentional. readinessFailureDelay is not meant to extend the pod termination/drain lifetime or block the shutdown manager from completing. It only controls when /healthcheck/fail is called while the drain sequence is still running.

Updated the docs to be clear about this.

Allow Envoy listener drain to start immediately while delaying `/healthcheck/fail` during pod termination. This helps deployments that need the terminating pod to remain a ready local endpoint while upstream load balancers stop sending traffic to the node. The default remains 0s, preserving the existing behavior where `/healthcheck/fail` starts listener drain immediately. Signed-off-by: Erik Sundell <erik@sundellopensource.se>

consideRatio marked this pull request as ready for review June 28, 2026 14:42

consideRatio requested a review from a team as a code owner June 28, 2026 14:42

consideRatio commented Jun 28, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 28, 2026

View reviewed changes

consideRatio force-pushed the readiness-failure-delay branch from d39dc07 to 30c704f Compare June 28, 2026 15:02

consideRatio force-pushed the readiness-failure-delay branch from 30c704f to 77d11ae Compare June 28, 2026 15:05

consideRatio mentioned this pull request Jun 29, 2026

feat: add drain delay to graceful shutdown process #9211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add readinessFailureDelay to shutdown config#9363

feat: add readinessFailureDelay to shutdown config#9363
consideRatio wants to merge 1 commit into
envoyproxy:mainfrom
consideRatio:readiness-failure-delay

consideRatio commented Jun 28, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

consideRatio Jun 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Uh oh!

consideRatio Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

consideRatio commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Validation

Uh oh!

netlify Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cerulean-figolla-1f9435 ready!

Uh oh!

consideRatio Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

consideRatio Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

consideRatio commented Jun 28, 2026 •

edited

Loading

netlify Bot commented Jun 28, 2026 •

edited

Loading

consideRatio Jun 28, 2026 •

edited

Loading