Skip to content

feat: add readinessFailureDelay to shutdown config#9363

Open
consideRatio wants to merge 1 commit into
envoyproxy:mainfrom
consideRatio:readiness-failure-delay
Open

feat: add readinessFailureDelay to shutdown config#9363
consideRatio wants to merge 1 commit into
envoyproxy:mainfrom
consideRatio:readiness-failure-delay

Conversation

@consideRatio

@consideRatio consideRatio commented Jun 28, 2026

Copy link
Copy Markdown

What

Adds readinessFailureDelay to ShutdownConfig.

When configured, the shutdown manager starts Envoy listener drain immediately with /drain_listeners?graceful&skip_exit, then delays /healthcheck/fail until the configured duration has elapsed. The default remains 0s, preserving the existing behavior where /healthcheck/fail starts listener drain immediately.

This is useful for environments where failing readiness immediately can leave a node without ready local endpoints before upstream load balancers have stopped sending traffic to it.


In practice, I experienced the need for this on GKE with Cilium / (they call it Dataplane v2), with a envoy gateway helm chart deployed as Gateway API controller. The envoy gateway Service of type: LoadBalancer had externalTrafficPolicy: Local by default, and the GCP provided LoadBalancer for the Service resource takes many seconds to realize that no non-terminating pods on the node are available - so it kept sending traffic to the node. The node receiving the traffic then didn't pass it forward to other pods on other node (because of externalTrafficPolicy: Local), and since the pod on the node wasn't just terminating, but non-ready, new connection was refused.

Validation

I did a e2e test in the GKE cluster where I observed issues before, and confirmed that the issue was resolved using an image built from this PR branch.

go test ./internal/cmd/envoy ./internal/infrastructure/kubernetes/proxy ./api/v1alpha1/validation
git diff --check

@netlify

netlify Bot commented Jun 28, 2026

Copy link
Copy Markdown

Deploy Preview for cerulean-figolla-1f9435 ready!

Name Link
🔨 Latest commit 77d11ae
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/6a41382087e0b9000888136c
😎 Deploy Preview https://deploy-preview-9363--cerulean-figolla-1f9435.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@consideRatio consideRatio marked this pull request as ready for review June 28, 2026 14:42
@consideRatio consideRatio requested a review from a team as a code owner June 28, 2026 14:42
1. Kubernetes sends SIGTERM to the pod
2. Shutdown manager fails health checks via `/healthcheck/fail`
- This causes Kubernetes readiness probes to fail
- External load balancers and services stop routing new traffic to the pod

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They stop routing traffic, but it can take many seconds - for example for an load balancers like provided for GCP for a service of type: LoadBalancer.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b759222e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

logger.Error(err, "error starting listener drain")
}
logger.Info(fmt.Sprintf("delaying readiness failure by %.0f seconds", readinessFailureDelay.Seconds()))
readinessFailurePending = true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep shutdown alive until the readiness delay elapses

When readinessFailureDelay is greater than minDrainDuration and the proxy has no active downstream connections, setting readinessFailurePending here is not enough because the loop can still take the later allowedToExit && conn <= exitAtConnections break before the pending /healthcheck/fail is sent. In that case the preStop hook writes the ready file and lets Envoy terminate after only the minimum drain period, so examples like a 40s readiness delay with the default/15s minimum drain never actually keep the pod ready for the configured delay in low-traffic pods.

Useful? React with 👍 / 👎.

@consideRatio consideRatio Jun 28, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is intentional. readinessFailureDelay is not meant to extend the pod termination/drain lifetime or block the shutdown manager from completing. It only controls when /healthcheck/fail is called while the drain sequence is still running.

Updated the docs to be clear about this.

@consideRatio consideRatio force-pushed the readiness-failure-delay branch from d39dc07 to 30c704f Compare June 28, 2026 15:02
Allow Envoy listener drain to start immediately while delaying
`/healthcheck/fail` during pod termination. This helps deployments that
need the terminating pod to remain a ready local endpoint while
upstream load balancers stop sending traffic to the node.

The default remains 0s, preserving the existing behavior where
`/healthcheck/fail` starts listener drain immediately.

Signed-off-by: Erik Sundell <erik@sundellopensource.se>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a delay to the shutdown-manager before failing healthchecks

1 participant