feat: add readinessFailureDelay to shutdown config#9363
Conversation
✅ Deploy Preview for cerulean-figolla-1f9435 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
| 1. Kubernetes sends SIGTERM to the pod | ||
| 2. Shutdown manager fails health checks via `/healthcheck/fail` | ||
| - This causes Kubernetes readiness probes to fail | ||
| - External load balancers and services stop routing new traffic to the pod |
There was a problem hiding this comment.
They stop routing traffic, but it can take many seconds - for example for an load balancers like provided for GCP for a service of type: LoadBalancer.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b759222e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logger.Error(err, "error starting listener drain") | ||
| } | ||
| logger.Info(fmt.Sprintf("delaying readiness failure by %.0f seconds", readinessFailureDelay.Seconds())) | ||
| readinessFailurePending = true |
There was a problem hiding this comment.
Keep shutdown alive until the readiness delay elapses
When readinessFailureDelay is greater than minDrainDuration and the proxy has no active downstream connections, setting readinessFailurePending here is not enough because the loop can still take the later allowedToExit && conn <= exitAtConnections break before the pending /healthcheck/fail is sent. In that case the preStop hook writes the ready file and lets Envoy terminate after only the minimum drain period, so examples like a 40s readiness delay with the default/15s minimum drain never actually keep the pod ready for the configured delay in low-traffic pods.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The behavior is intentional. readinessFailureDelay is not meant to extend the pod termination/drain lifetime or block the shutdown manager from completing. It only controls when /healthcheck/fail is called while the drain sequence is still running.
Updated the docs to be clear about this.
d39dc07 to
30c704f
Compare
Allow Envoy listener drain to start immediately while delaying `/healthcheck/fail` during pod termination. This helps deployments that need the terminating pod to remain a ready local endpoint while upstream load balancers stop sending traffic to the node. The default remains 0s, preserving the existing behavior where `/healthcheck/fail` starts listener drain immediately. Signed-off-by: Erik Sundell <erik@sundellopensource.se>
30c704f to
77d11ae
Compare
What
Adds
readinessFailureDelaytoShutdownConfig.When configured, the shutdown manager starts Envoy listener drain immediately with
/drain_listeners?graceful&skip_exit, then delays/healthcheck/failuntil the configured duration has elapsed. The default remains0s, preserving the existing behavior where/healthcheck/failstarts listener drain immediately.This is useful for environments where failing readiness immediately can leave a node without ready local endpoints before upstream load balancers have stopped sending traffic to it.
In practice, I experienced the need for this on GKE with Cilium / (they call it Dataplane v2), with a envoy gateway helm chart deployed as Gateway API controller. The envoy gateway Service of
type: LoadBalancerhadexternalTrafficPolicy: Localby default, and the GCP provided LoadBalancer for the Service resource takes many seconds to realize that no non-terminating pods on the node are available - so it kept sending traffic to the node. The node receiving the traffic then didn't pass it forward to other pods on other node (because ofexternalTrafficPolicy: Local), and since the pod on the node wasn't just terminating, but non-ready, new connection was refused.Validation
I did a e2e test in the GKE cluster where I observed issues before, and confirmed that the issue was resolved using an image built from this PR branch.
go test ./internal/cmd/envoy ./internal/infrastructure/kubernetes/proxy ./api/v1alpha1/validation git diff --check