Skip to content

Recover OAuth workloads from transient refresh failures#5350

Merged
jhrozek merged 2 commits into
stacklok:mainfrom
gkatz2:fix/auth-retrying-cross-tick
Jun 17, 2026
Merged

Recover OAuth workloads from transient refresh failures#5350
jhrozek merged 2 commits into
stacklok:mainfrom
gkatz2:fix/auth-retrying-cross-tick

Conversation

@gkatz2

@gkatz2 gkatz2 commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • When an OAuth token refresh fails as a transient error for longer
    than the in-loop short retry (~1–2 minutes at defaults), the
    monitor today marks the workload unauthenticated and exits its
    goroutine. The workload stays permanently dead even after the
    underlying condition clears on its own. The canonical trigger is a
    brief client-side network-context change—a VPN disconnect, a
    laptop sleeping on one network and resuming on another, etc.—that
    causes token-refresh requests to reach the OAuth server from a
    different network origin. See OAuth monitor gives up on transient failures, leaving workloads dead #5349 for the full failure mode.
  • Add a WorkloadStatusAuthRetrying workload status. After the short
    retry exhausts on a still-transient error, the monitor stays alive
    and re-attempts refresh on a longer cadence (default 10 min,
    configurable via TOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVAL) until
    either success (→ Running) or a configurable ceiling (default 24
    h, configurable via TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED)
    (→ Unauthenticated).
  • Hot callers (request-path Token() calls, e.g. from the token
    injection middleware) fast-fail with the cached error during
    AuthRetrying, returning 503+Retry-After immediately rather than
    blocking on another short-retry duration against the broken
    endpoint.
  • New pkg/oauthproto/oauthtest/server.go provides a scriptable
    OAuth server fixture used by two end-to-end integration tests that
    drive the state machine against the real golang.org/x/oauth2
    library + real HTTP responses.
  • The vmcp backend health mapping treats AuthRetrying as
    BackendDegraded so tool discovery still aggregates the workload
    while invocation surfaces the underlying 503.

Fixes #5349

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)—11 new tests across the PR:
    7 covering the AuthRetrying state machine (entry on short-retry
    exhaustion, recovery to Running, ceiling transition to
    Unauthenticated, hot-caller fast-fail, permanent-error mid-
    AuthRetrying, DCR Warn silence on ceiling, post-stop gate);
    2 integration tests that drive the state machine against a
    real httptest.NewServer through the real golang.org/x/oauth2
    library; 2 covering the new switch arms in
    mapWorkloadStatusToVMCPHealth and workloadStatusIndicator.
  • Linting (task lint-fix)
  • Manual testing—built a local override binary, replaced the
    deployed ~/.local/bin/thv, restarted two real OAuth-backed
    remote workloads (Sourcegraph and Sourcegraph Deep Search) on
    the new binary. Verified MCP initialize round-trip works
    end-to-end with token injection through the proxy. Verified
    thv list (default, no flag) and thv list --all both render
    the new auth_retrying status via a manual status-file
    simulation.

Note on confidence. The bug's natural trigger (a real WAF block
during a real token refresh) recurs every 1–3 days in the affected
environment and needs sudo-level network controls (pfctl /
iptables) to drive on demand—neither fits CI or routine
verification. The two TestIntegration_* cases in
pkg/auth/monitored_token_source_test.go are where regressions in
the long-tail behavior would surface: they drive the full state
machine through the real golang.org/x/oauth2 library against a
scriptable OAuth server fixture (pkg/oauthproto/oauthtest/)
returning the actual <!DOCTYPE html> 4xx-without-RFC-6749-error-
code response shape observed in production. Two scenarios:
Running → AuthRetrying → Running (recovery on next tick) and
Running → AuthRetrying → Unauthenticated (ceiling exceeded). The
mock-based unit tests in the same file pin individual edge cases at
higher speed; together with the integration tests they cover both
the state machine logic and the real-library response-parsing path.

API Compatibility

  • This PR does not break the v1beta1 API.

The v1beta1 operator API surface is not touched by this PR. The
new auth_retrying workload status is added to the runtime enum and
to the OpenAPI swagger for the workloads HTTP API, both of which are
additive—existing clients that don't know about the new value will
see it as an unrecognised string and route through their default
handlers.

Does this introduce a user-facing change?

Yes:

  • New auth_retrying workload status, surfaced in default
    thv list alongside running (the workload is still self-recovering
    without user intervention – every other non-Running status is terminal
    until external intervention, and stays hidden behind --all).
    Also in the OpenAPI swagger and the architecture state diagram.
  • Operators can tune the AuthRetrying cadence (default 10m) and
    ceiling (default 24h) via TOOLHIVE_TOKEN_AUTH_RETRYING_TICK_INTERVAL
    and TOOLHIVE_TOKEN_AUTH_RETRYING_MAX_ELAPSED.
  • In environments where the bug surfaces (described in OAuth monitor gives up on transient failures, leaving workloads dead #5349),
    operators no longer need to manually re-auth workloads after a
    transient network-context change resolves on its own.

Large PR Justification

The 1825-line additions break down as:

  • Tests: ~1213 lines across 5 *_test.go files. The bulk is in
    pkg/auth/monitored_token_source_test.go (1029 new lines), covering
    the new state machine, the concurrent stress test of the gate-under-mu
    invariant (1000 iterations), 2 integration tests that drive the full
    flow against a real golang.org/x/oauth2 library, and regression
    tests for the bug fixes from review iteration.
  • Test infrastructure: 136 lines in
    pkg/oauthproto/oauthtest/server.go, a new scriptable OAuth server
    fixture used by the integration tests.
  • Generated swagger: 12 lines, regenerated from the new
    auth_retrying enum value in pkg/api/v1/workload_types.go.
  • Architecture docs: 15 lines documenting the new state.
  • Production code: ~450 lines. The change is necessarily atomic –
    state transitions, ceiling logic, hot-caller fast-fail, and the
    MonitoredTokenSource plumbing reference each other and would not
    function if split.

Generated with Claude Code.

@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label May 20, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.43216% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.00%. Comparing base (141f9a4) to head (cd67aeb).

Files with missing lines Patch % Lines
pkg/auth/monitored_token_source.go 86.20% 17 Missing and 3 partials ⚠️
pkg/oauthproto/oauthtest/server.go 84.44% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5350      +/-   ##
==========================================
+ Coverage   69.87%   70.00%   +0.12%     
==========================================
  Files         648      649       +1     
  Lines       65873    66054     +181     
==========================================
+ Hits        46029    46240     +211     
+ Misses      16500    16470      -30     
  Partials     3344     3344              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions Bot dismissed their stale review May 20, 2026 21:55

Large PR justification has been provided. Thank you!

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 20, 2026
@gkatz2 gkatz2 force-pushed the fix/auth-retrying-cross-tick branch from 178577a to 978fc6f Compare June 5, 2026 15:07
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jun 5, 2026
@gkatz2

gkatz2 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Gentle nudge — I'm still seeing this issue across multiple OAuth-backed MCP servers in our environment. Happy to address review feedback whenever a maintainer can take a look. Thanks!

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test comment

Comment thread docs/arch/02-core-concepts.md

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread pkg/auth/monitored_token_source.go

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread pkg/auth/monitored_token_source.go

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread pkg/auth/monitored_token_source.go

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread pkg/auth/monitored_token_source.go Outdated

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread pkg/workloads/manager.go

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread pkg/auth/monitored_token_source.go

@jhrozek jhrozek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the state-machine design is solid and the test suite is thorough — especially the concurrent stress test and the real-HTTP integration tests. A few things worth discussing below.

Comment thread pkg/auth/monitored_token_source.go
Comment thread pkg/auth/monitored_token_source.go
Comment thread pkg/auth/monitored_token_source.go
Comment thread pkg/auth/monitored_token_source.go Outdated
Comment thread pkg/workloads/manager.go
Comment thread docs/arch/02-core-concepts.md
@jhrozek

jhrozek commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

sorry about the stray test messages, I was driving the review from a CC session and it couldn't figure out how to post an inline comment :-/

@gkatz2 gkatz2 force-pushed the fix/auth-retrying-cross-tick branch from 978fc6f to 09ca821 Compare June 15, 2026 23:49
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jun 15, 2026
When an OAuth refresh fails as a transient error for longer than the
in-loop short retry (~1-2 minutes at defaults), the monitor used to
mark the workload unauthenticated and exit. The workload stayed dead
even after the underlying condition cleared on its own (e.g. a brief
VPN disconnect routing requests through a different network path).
Now the monitor keeps retrying on a configurable longer cadence
(default 10 minutes) until either success or a configurable ceiling
(default 24 hours), at which point the workload is finally marked
unauthenticated.

Fixes stacklok#5349

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@gkatz2 gkatz2 force-pushed the fix/auth-retrying-cross-tick branch from 09ca821 to 4328bc8 Compare June 16, 2026 00:02
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jun 16, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jun 16, 2026
@jhrozek jhrozek merged commit 2add525 into stacklok:main Jun 17, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OAuth monitor gives up on transient failures, leaving workloads dead

2 participants