Keep canary alive when primary promotion fails#1931
Conversation
eb69ba1 to
120c187
Compare
| c.alert(canary, fmt.Sprintf("Promotion failed, primary not ready: %v", err), | ||
| false, flaggerv1.SeverityError) | ||
|
|
||
| if err := canaryController.SetStatusPhase(canary, flaggerv1.CanaryPhaseFailed); err != nil { |
There was a problem hiding this comment.
SetStatusPhase internally sets the canary weight to 0 when the status is Failed. this creates a mismatch b/w what's happening and what's being reported
There was a problem hiding this comment.
Fixed. Replaced SetStatusPhase(Failed) with SyncStatus({Phase: Failed, CanaryWeight: 100}) so the reported weight matches the actual routing now that all traffic is on the canary.
| // during promotion the canary is the only healthy copy, halt | ||
| // instead of rolling back traffic to the unhealthy primary | ||
| if cd.Status.Phase == flaggerv1.CanaryPhasePromoting || | ||
| cd.Status.Phase == flaggerv1.CanaryPhaseFinalising { |
There was a problem hiding this comment.
this works but only partially. we move to the Finalizing phase inside runPromotionTrafficShift - the function which routes all the traffic from the canary to the primary. if the primary deployment starts failing after runPromotionTrafficShift runs, we'd call promotionFailed here. promotionFailed does not set any traffic weights leaving the primary to receive all of the traffic, when we want the canary to be receiving the traffic instead.
There was a problem hiding this comment.
Good catch, you're right. promotionFailed now routes all traffic back to the canary (SetRoutes with primary=0, canary=100), so it also covers the case where runPromotionTrafficShift already moved traffic to the primary before it started failing. Added TestScheduler_DeploymentPromotionFailedAfterTrafficShift, which drives to the Finalising phase with traffic already on the primary and asserts it gets routed back to the canary.
On a failed promotion the canary keeps serving, but the traffic may already have been shifted to the primary by runPromotionTrafficShift before it started failing. Route all traffic back to the canary and report the matching canary weight instead of zeroing it. Addresses review feedback on fluxcd#1931. Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1931 +/- ##
==========================================
+ Coverage 30.00% 30.08% +0.08%
==========================================
Files 288 288
Lines 18455 18482 +27
==========================================
+ Hits 5537 5561 +24
- Misses 12189 12190 +1
- Partials 729 731 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
When the canary analysis succeeds, Flagger copies the canary pod spec to the primary and waits for the primary rollout to finish. If the primary fails to become ready, the non-retriable readiness error triggered the standard analysis rollback, which routes all traffic to the primary and scales the canary to zero. During promotion the primary already runs the new (failing) spec while the canary is the only healthy copy of the new revision still serving traffic. Rolling back therefore sends all traffic to the broken primary and deletes the working canary, taking the application down. Halt the promotion instead: when the primary is not ready and the canary is in the Promoting or Finalising phase, mark the rollout as failed and alert, but keep the canary running and leave routing untouched until the primary recovers or a corrected revision is applied. Fixes fluxcd#1898 Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>
On a failed promotion the canary keeps serving, but the traffic may already have been shifted to the primary by runPromotionTrafficShift before it started failing. Route all traffic back to the canary and report the matching canary weight instead of zeroing it. Addresses review feedback on fluxcd#1931. Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>
0db298a to
e461dd9
Compare
Problem
Reported in #1898. When a primary pod fails to initialize after a canary
promotion, Flagger takes the application down instead of preserving the
healthy canary.
Flow:
primary (
Promote), then moves to thePromoting/Finalisingphase andwaits for the primary rollout to finish.
slow/again-failing init, etc.).
IsPrimaryReadyeventually returns a non-retriable error (progress deadlineexceeded), which triggered the standard analysis
rollback().rollback()routes all traffic to the primary and scales the canary tozero.
The problem is that during promotion the primary already runs the new
(failing) spec, while the canary is the only healthy copy of the new revision
still serving traffic. "Rolling back to the primary" therefore sends all
traffic to the broken primary and deletes the only working pods — a full
outage (worst in
Recreatemode, where no old primary pod remains).rollback()is correct for an analysis failure duringProgressing(there theprimary still holds the old, good spec), but wrong once promotion has started.
Fix
When
IsPrimaryReadyreturns a non-retriable error and the canary is in thePromotingorFinalisingphase, halt the promotion instead of rolling back:Failedand emit a warning event + alert, so it stopsadvancing and surfaces the failure;
The canary keeps serving traffic until the primary recovers or a corrected
revision is applied. Behaviour during
Progressing(and every other phase) isunchanged.
This is the minimal, non-destructive safety fix. Follow-up #1932 tracks the
model-correct behaviour — note that a promotion only starts after the canary
passes analysis, so the canary running the new revision is healthy and only the
primary's separately-rendered copy failed; whether Flagger should revert the
primary to its last-known-good spec or keep serving the healthy canary is an
open question to settle there.
Tests
Added
TestScheduler_DeploymentPromotionPrimaryNotReady, which drives thecanary to
Promoting, makes the primary stuck (ProgressDeadlineExceeded),and asserts the canary is not scaled to zero and traffic is not shifted onto
the broken primary. The full
pkg/controllerandpkg/canarysuites pass(
go test ./pkg/controller/ ./pkg/canary/),gofmtandgo vetare clean.Fixes #1898