Keep canary alive when primary promotion fails by pedrampdd · Pull Request #1931 · fluxcd/flagger

pedrampdd · 2026-06-11T20:46:01Z

Problem

Reported in #1898. When a primary pod fails to initialize after a canary
promotion, Flagger takes the application down instead of preserving the
healthy canary.

Flow:

The canary analysis succeeds and Flagger copies the canary pod spec to the
primary (Promote), then moves to the Promoting/Finalising phase and
waits for the primary rollout to finish.
The promoted primary fails to become ready (bad image, failing sidecar,
slow/again-failing init, etc.).
IsPrimaryReady eventually returns a non-retriable error (progress deadline
exceeded), which triggered the standard analysis rollback().
rollback() routes all traffic to the primary and scales the canary to
zero.

The problem is that during promotion the primary already runs the new
(failing) spec, while the canary is the only healthy copy of the new revision
still serving traffic. "Rolling back to the primary" therefore sends all
traffic to the broken primary and deletes the only working pods — a full
outage (worst in Recreate mode, where no old primary pod remains).

rollback() is correct for an analysis failure during Progressing (there the
primary still holds the old, good spec), but wrong once promotion has started.

Fix

When IsPrimaryReady returns a non-retriable error and the canary is in the
Promoting or Finalising phase, halt the promotion instead of rolling back:

mark the rollout as Failed and emit a warning event + alert, so it stops
advancing and surfaces the failure;
do not route traffic to the unhealthy primary;
do not scale the canary to zero.

The canary keeps serving traffic until the primary recovers or a corrected
revision is applied. Behaviour during Progressing (and every other phase) is
unchanged.

This is the minimal, non-destructive safety fix. Follow-up #1932 tracks the
model-correct behaviour — note that a promotion only starts after the canary
passes analysis, so the canary running the new revision is healthy and only the
primary's separately-rendered copy failed; whether Flagger should revert the
primary to its last-known-good spec or keep serving the healthy canary is an
open question to settle there.

Tests

Added TestScheduler_DeploymentPromotionPrimaryNotReady, which drives the
canary to Promoting, makes the primary stuck (ProgressDeadlineExceeded),
and asserts the canary is not scaled to zero and traffic is not shifted onto
the broken primary. The full pkg/controller and pkg/canary suites pass
(go test ./pkg/controller/ ./pkg/canary/), gofmt and go vet are clean.

Fixes #1898

aryan9600

thanks for taking this up!

aryan9600 · 2026-06-26T10:58:09Z

+	c.alert(canary, fmt.Sprintf("Promotion failed, primary not ready: %v", err),
+		false, flaggerv1.SeverityError)
+
+	if err := canaryController.SetStatusPhase(canary, flaggerv1.CanaryPhaseFailed); err != nil {


SetStatusPhase internally sets the canary weight to 0 when the status is Failed. this creates a mismatch b/w what's happening and what's being reported

Fixed. Replaced SetStatusPhase(Failed) with SyncStatus({Phase: Failed, CanaryWeight: 100}) so the reported weight matches the actual routing now that all traffic is on the canary.

aryan9600 · 2026-06-26T11:01:40Z

+				// during promotion the canary is the only healthy copy, halt
+				// instead of rolling back traffic to the unhealthy primary
+				if cd.Status.Phase == flaggerv1.CanaryPhasePromoting ||
+					cd.Status.Phase == flaggerv1.CanaryPhaseFinalising {


this works but only partially. we move to the Finalizing phase inside runPromotionTrafficShift - the function which routes all the traffic from the canary to the primary. if the primary deployment starts failing after runPromotionTrafficShift runs, we'd call promotionFailed here. promotionFailed does not set any traffic weights leaving the primary to receive all of the traffic, when we want the canary to be receiving the traffic instead.

Good catch, you're right. promotionFailed now routes all traffic back to the canary (SetRoutes with primary=0, canary=100), so it also covers the case where runPromotionTrafficShift already moved traffic to the primary before it started failing. Added TestScheduler_DeploymentPromotionFailedAfterTrafficShift, which drives to the Finalising phase with traffic already on the primary and asserts it gets routed back to the canary.

On a failed promotion the canary keeps serving, but the traffic may already have been shifted to the primary by runPromotionTrafficShift before it started failing. Route all traffic back to the canary and report the matching canary weight instead of zeroing it. Addresses review feedback on fluxcd#1931. Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>

codecov-commenter · 2026-06-27T12:15:37Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 30.08%. Comparing base (61582f7) to head (060bb77).
⚠️ Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/controller/scheduler.go	71.42%	5 Missing and 3 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1931      +/-   ##
==========================================
+ Coverage   30.00%   30.08%   +0.08%     
==========================================
  Files         288      288              
  Lines       18455    18482      +27     
==========================================
+ Hits         5537     5561      +24     
- Misses      12189    12190       +1     
- Partials      729      731       +2

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

When the canary analysis succeeds, Flagger copies the canary pod spec to the primary and waits for the primary rollout to finish. If the primary fails to become ready, the non-retriable readiness error triggered the standard analysis rollback, which routes all traffic to the primary and scales the canary to zero. During promotion the primary already runs the new (failing) spec while the canary is the only healthy copy of the new revision still serving traffic. Rolling back therefore sends all traffic to the broken primary and deletes the working canary, taking the application down. Halt the promotion instead: when the primary is not ready and the canary is in the Promoting or Finalising phase, mark the rollout as failed and alert, but keep the canary running and leave routing untouched until the primary recovers or a corrected revision is applied. Fixes fluxcd#1898 Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>

On a failed promotion the canary keeps serving, but the traffic may already have been shifted to the primary by runPromotionTrafficShift before it started failing. Route all traffic back to the canary and report the matching canary weight instead of zeroing it. Addresses review feedback on fluxcd#1931. Signed-off-by: Pedram Pourmohammad <eragon.pedy@gmail.com>

pedrampdd requested review from aryan9600 and stefanprodan as code owners June 11, 2026 20:46

pedrampdd mentioned this pull request Jun 11, 2026

Decide correct behaviour after a failed promotion (primary unhealthy, canary healthy) #1932

Open

pedrampdd force-pushed the fix/1898-keep-canary-on-promotion-failure branch from eb69ba1 to 120c187 Compare June 11, 2026 21:00

aryan9600 reviewed Jun 26, 2026

View reviewed changes

pedrampdd requested a review from aryan9600 June 26, 2026 11:59

aryan9600 reviewed Jun 27, 2026

View reviewed changes

Comment thread pkg/controller/scheduler.go Outdated

pedrampdd requested a review from aryan9600 June 27, 2026 12:39

pedrampdd added 2 commits June 27, 2026 19:56

pedrampdd force-pushed the fix/1898-keep-canary-on-promotion-failure branch from 0db298a to e461dd9 Compare June 27, 2026 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep canary alive when primary promotion fails#1931

Keep canary alive when primary promotion fails#1931
pedrampdd wants to merge 2 commits into
fluxcd:mainfrom
pedrampdd:fix/1898-keep-canary-on-promotion-failure

pedrampdd commented Jun 11, 2026 •

edited

Loading

Uh oh!

aryan9600 left a comment

Uh oh!

aryan9600 Jun 26, 2026

Uh oh!

pedrampdd Jun 26, 2026

Uh oh!

aryan9600 Jun 26, 2026

Uh oh!

pedrampdd Jun 26, 2026

Uh oh!

codecov-commenter commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pedrampdd commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

aryan9600 left a comment

Choose a reason for hiding this comment

Uh oh!

aryan9600 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pedrampdd Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

aryan9600 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pedrampdd Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 27, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pedrampdd commented Jun 11, 2026 •

edited

Loading