Skip to content

feat: auto failover for API with LK Cloud#936

Open
davidzhao wants to merge 4 commits into
mainfrom
dz/api-failover
Open

feat: auto failover for API with LK Cloud#936
davidzhao wants to merge 4 commits into
mainfrom
dz/api-failover

Conversation

@davidzhao

Copy link
Copy Markdown
Member

automatically fail over 5xx and transport errors

automatically fail over 5xx and transport errors
Comment thread .github/workflows/test-api.yml
Comment thread .github/workflows/test-api.yml Fixed
Comment thread failover.go Fixed
Comment thread failover.go Outdated
Comment thread failover.go Outdated
Comment thread failover.go Outdated

// discover regions lazily
if !fetchedRegions {
settings = t.regions.get(originalScheme, originalHost, req.Header)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the discovery get/fetch respect the caller's context? Right now RoundTrip calls t.regions.get(originalScheme, originalHost, req.Header) without req.Context(), and fetch uses http.NewRequest on the 5s-timeout client — so it ignores the caller's deadline/cancel.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't.. this is an internal request to discover regions, we'll limit it to a much shorter timeout. so for example, if someone was doing a CreateSIPParticipant with a 30s timeout, we do not want region refresh to take the same 30s.

@milos-lk milos-lk Jun 28, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree the discovery fetch shouldn't ride the caller's 30s — capping it short is right. One clarification though: threading the context doesn't undo that cap. http.NewRequestWithContext(ctx, …) on the Timeout: 2s client is bounded by min(2s, remaining deadline), not the max - the client derives its deadline from the request ctx, so whichever fires first wins - right? Your CreateSIPParticipant/30s case still gets capped at 2s; the 30s never applies.
The reason to still pass ctx is the two cases the fixed 2s alone doesn't cover:

  • Cancellation - if the caller cancels (client disconnected, op aborted, process draining), the in-flight discovery keeps burning up to 2s on an abandoned request instead of stopping right away.
  • Sub-2s deadlines (maybe that's not too frequent in practice - but theoretical possibility) — a caller with a 500ms timeout: the primary fails at 500ms, then discovery still runs up to 2s, so the call returns ~2.5s and the result gets discarded anyway.

@milos-lk

Copy link
Copy Markdown
Contributor

just for my education: are all APIs idempotent and safe to be retried on 5xx?

@milos-lk

milos-lk commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

related to my comment above - just thinking out loud here
putting failover in the http.RoundTripper is neat and low-touch, but at that layer the failover decision is being made on opaque bytes + an HTTP status. It can read the method name (twirp.MethodName(ctx) rides on the request ctx), but not the typed request or the typed twirp error — so based on what I can see it can't really tell a safe-to-replay read from a mutation, and it ends up retrying every 5xx for every call. Should CreateRoom/StartRoomCompositeEgress/etc. silently replay against another region on a 5xx or dropped connection? That's an at-least-once semantic for operations that were effectively at-most-once before, and the transport can't distinguish them — which makes me think the retry/idempotency policy wants to live somewhere it can see the method and error.
Not sure how all of that reflects observability (if that's a concern now or in future) - would metrics be generic HTTP retries rather than “Egress.StartRoomCompositeEgress failed over from region A to region B?

Comment thread failover_apitest_test.go
client := lksdk.NewRoomServiceClient(testServerURL(t), "devkey", "secret")
ctx := failoverCtx(t, lksdk.FailoverOn, map[string]string{hdrFailRegions: "0"})
_, err := client.CreateRoom(ctx, &livekit.CreateRoomRequest{Name: "api-test"})
require.NoError(t, err, "should fail over to a healthy region")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this test the exact region that it connects to? Guess not as CreateRoomResponse does not have a region indication.

Comment thread failover.go Outdated
// original host plus (MaxAttempts-1) fallback regions. Defaults to 3.
MaxAttempts int
// BackoffBase is the initial delay before the first retry; each subsequent
// retry doubles it. Defaults to 200ms. Set to 0 to retry without delay.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not allow this no delay if this is at API level. Have horror stories at a previous job where a no back off retry kept piling up and eventually killing the service.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess users can still bypass and do their own back off, but at least not provide that option here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point.. I've removed these options, since this is internal, I've kept it hard coded. we can always introduce new options later

Comment thread failover.go Outdated
@davidzhao

Copy link
Copy Markdown
Member Author

related to my comment above - just thinking out loud here putting failover in the http.RoundTripper is neat and low-touch, but at that layer the failover decision is being made on opaque bytes + an HTTP status. It can read the method name (twirp.MethodName(ctx) rides on the request ctx), but not the typed request or the typed twirp error — so based on what I can see it can't really tell a safe-to-replay read from a mutation, and it ends up retrying every 5xx for every call. Should CreateRoom/StartRoomCompositeEgress/etc. silently replay against another region on a 5xx or dropped connection? That's an at-least-once semantic for operations that were effectively at-most-once before, and the transport can't distinguish them — which makes me think the retry/idempotency policy wants to live somewhere it can see the method and error. Not sure how all of that reflects observability (if that's a concern now or in future) - would metrics be generic HTTP retries rather than “Egress.StartRoomCompositeEgress failed over from region A to region B?

In this case, all requests should be retried. our server should be as idempotent as possible. that would be the right design patterns here. otherwise, if a request returns 500, how should we react to it? the contract says the API errored out and the most natural thing for any end user to handle is to retry it.

for observability, we should be able to see the HTTP path to determine the method/etc.

Comment thread regionurlprovider.go Dismissed
@davidzhao

Copy link
Copy Markdown
Member Author

@milos-lk updated the implementation to address feedback. now we have a consolidated region cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants