Skip to content

Keep canary alive when primary promotion fails#1931

Open
pedrampdd wants to merge 2 commits into
fluxcd:mainfrom
pedrampdd:fix/1898-keep-canary-on-promotion-failure
Open

Keep canary alive when primary promotion fails#1931
pedrampdd wants to merge 2 commits into
fluxcd:mainfrom
pedrampdd:fix/1898-keep-canary-on-promotion-failure

Conversation

@pedrampdd

@pedrampdd pedrampdd commented Jun 11, 2026

Copy link
Copy Markdown

Problem

Reported in #1898. When a primary pod fails to initialize after a canary
promotion, Flagger takes the application down instead of preserving the
healthy canary.

Flow:

  1. The canary analysis succeeds and Flagger copies the canary pod spec to the
    primary (Promote), then moves to the Promoting/Finalising phase and
    waits for the primary rollout to finish.
  2. The promoted primary fails to become ready (bad image, failing sidecar,
    slow/again-failing init, etc.).
  3. IsPrimaryReady eventually returns a non-retriable error (progress deadline
    exceeded), which triggered the standard analysis rollback().
  4. rollback() routes all traffic to the primary and scales the canary to
    zero.

The problem is that during promotion the primary already runs the new
(failing) spec, while the canary is the only healthy copy of the new revision
still serving traffic. "Rolling back to the primary" therefore sends all
traffic to the broken primary and deletes the only working pods — a full
outage (worst in Recreate mode, where no old primary pod remains).

rollback() is correct for an analysis failure during Progressing (there the
primary still holds the old, good spec), but wrong once promotion has started.

Fix

When IsPrimaryReady returns a non-retriable error and the canary is in the
Promoting or Finalising phase, halt the promotion instead of rolling back:

  • mark the rollout as Failed and emit a warning event + alert, so it stops
    advancing and surfaces the failure;
  • do not route traffic to the unhealthy primary;
  • do not scale the canary to zero.

The canary keeps serving traffic until the primary recovers or a corrected
revision is applied. Behaviour during Progressing (and every other phase) is
unchanged.

This is the minimal, non-destructive safety fix. Follow-up #1932 tracks the
model-correct behaviour — note that a promotion only starts after the canary
passes analysis, so the canary running the new revision is healthy and only the
primary's separately-rendered copy failed; whether Flagger should revert the
primary to its last-known-good spec or keep serving the healthy canary is an
open question to settle there.

Tests

Added TestScheduler_DeploymentPromotionPrimaryNotReady, which drives the
canary to Promoting, makes the primary stuck (ProgressDeadlineExceeded),
and asserts the canary is not scaled to zero and traffic is not shifted onto
the broken primary. The full pkg/controller and pkg/canary suites pass
(go test ./pkg/controller/ ./pkg/canary/), gofmt and go vet are clean.

Fixes #1898

@aryan9600 aryan9600 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking this up!

Comment thread pkg/controller/scheduler.go Outdated
c.alert(canary, fmt.Sprintf("Promotion failed, primary not ready: %v", err),
false, flaggerv1.SeverityError)

if err := canaryController.SetStatusPhase(canary, flaggerv1.CanaryPhaseFailed); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetStatusPhase internally sets the canary weight to 0 when the status is Failed. this creates a mismatch b/w what's happening and what's being reported

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced SetStatusPhase(Failed) with SyncStatus({Phase: Failed, CanaryWeight: 100}) so the reported weight matches the actual routing now that all traffic is on the canary.

// during promotion the canary is the only healthy copy, halt
// instead of rolling back traffic to the unhealthy primary
if cd.Status.Phase == flaggerv1.CanaryPhasePromoting ||
cd.Status.Phase == flaggerv1.CanaryPhaseFinalising {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works but only partially. we move to the Finalizing phase inside runPromotionTrafficShift - the function which routes all the traffic from the canary to the primary. if the primary deployment starts failing after runPromotionTrafficShift runs, we'd call promotionFailed here. promotionFailed does not set any traffic weights leaving the primary to receive all of the traffic, when we want the canary to be receiving the traffic instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, you're right. promotionFailed now routes all traffic back to the canary (SetRoutes with primary=0, canary=100), so it also covers the case where runPromotionTrafficShift already moved traffic to the primary before it started failing. Added TestScheduler_DeploymentPromotionFailedAfterTrafficShift, which drives to the Finalising phase with traffic already on the primary and asserts it gets routed back to the canary.

pedrampdd added a commit to pedrampdd/flagger that referenced this pull request Jun 26, 2026
On a failed promotion the canary keeps serving, but the traffic may
already have been shifted to the primary by runPromotionTrafficShift
before it started failing. Route all traffic back to the canary and
report the matching canary weight instead of zeroing it.

Addresses review feedback on fluxcd#1931.

Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>
@pedrampdd pedrampdd requested a review from aryan9600 June 26, 2026 11:59
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 30.08%. Comparing base (61582f7) to head (060bb77).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/scheduler.go 71.42% 5 Missing and 3 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1931      +/-   ##
==========================================
+ Coverage   30.00%   30.08%   +0.08%     
==========================================
  Files         288      288              
  Lines       18455    18482      +27     
==========================================
+ Hits         5537     5561      +24     
- Misses      12189    12190       +1     
- Partials      729      731       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/controller/scheduler.go Outdated
@pedrampdd pedrampdd requested a review from aryan9600 June 27, 2026 12:39
When the canary analysis succeeds, Flagger copies the canary pod spec
to the primary and waits for the primary rollout to finish. If the
primary fails to become ready, the non-retriable readiness error
triggered the standard analysis rollback, which routes all traffic to
the primary and scales the canary to zero.

During promotion the primary already runs the new (failing) spec while
the canary is the only healthy copy of the new revision still serving
traffic. Rolling back therefore sends all traffic to the broken primary
and deletes the working canary, taking the application down.

Halt the promotion instead: when the primary is not ready and the canary
is in the Promoting or Finalising phase, mark the rollout as failed and
alert, but keep the canary running and leave routing untouched until the
primary recovers or a corrected revision is applied.

Fixes fluxcd#1898

Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>
On a failed promotion the canary keeps serving, but the traffic may
already have been shifted to the primary by runPromotionTrafficShift
before it started failing. Route all traffic back to the canary and
report the matching canary weight instead of zeroing it.

Addresses review feedback on fluxcd#1931.

Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>
@pedrampdd pedrampdd force-pushed the fix/1898-keep-canary-on-promotion-failure branch from 0db298a to e461dd9 Compare June 27, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

If a primary pod fails to initialize, flagger doesn't always do the right thing

3 participants