feat: auto failover for API with LK Cloud by davidzhao · Pull Request #936 · livekit/server-sdk-go

davidzhao · 2026-06-27T21:15:06Z

automatically fail over 5xx and transport errors

milos-lk · 2026-06-27T22:55:27Z

+
+		// discover regions lazily
+		if !fetchedRegions {
+			settings = t.regions.get(originalScheme, originalHost, req.Header)


Should the discovery get/fetch respect the caller's context? Right now RoundTrip calls t.regions.get(originalScheme, originalHost, req.Header) without req.Context(), and fetch uses http.NewRequest on the 5s-timeout client — so it ignores the caller's deadline/cancel.

it shouldn't.. this is an internal request to discover regions, we'll limit it to a much shorter timeout. so for example, if someone was doing a CreateSIPParticipant with a 30s timeout, we do not want region refresh to take the same 30s.

Totally agree the discovery fetch shouldn't ride the caller's 30s — capping it short is right. One clarification though: threading the context doesn't undo that cap. http.NewRequestWithContext(ctx, …) on the Timeout: 2s client is bounded by min(2s, remaining deadline), not the max - the client derives its deadline from the request ctx, so whichever fires first wins - right? Your CreateSIPParticipant/30s case still gets capped at 2s; the 30s never applies.
The reason to still pass ctx is the two cases the fixed 2s alone doesn't cover:

Cancellation - if the caller cancels (client disconnected, op aborted, process draining), the in-flight discovery keeps burning up to 2s on an abandoned request instead of stopping right away.

Sub-2s deadlines (maybe that's not too frequent in practice - but theoretical possibility) — a caller with a 500ms timeout: the primary fails at 500ms, then discovery still runs up to 2s, so the call returns ~2.5s and the result gets discarded anyway.

milos-lk · 2026-06-27T23:03:28Z

just for my education: are all APIs idempotent and safe to be retried on 5xx?

milos-lk · 2026-06-27T23:20:04Z

related to my comment above - just thinking out loud here
putting failover in the http.RoundTripper is neat and low-touch, but at that layer the failover decision is being made on opaque bytes + an HTTP status. It can read the method name (twirp.MethodName(ctx) rides on the request ctx), but not the typed request or the typed twirp error — so based on what I can see it can't really tell a safe-to-replay read from a mutation, and it ends up retrying every 5xx for every call. Should CreateRoom/StartRoomCompositeEgress/etc. silently replay against another region on a 5xx or dropped connection? That's an at-least-once semantic for operations that were effectively at-most-once before, and the transport can't distinguish them — which makes me think the retry/idempotency policy wants to live somewhere it can see the method and error.
Not sure how all of that reflects observability (if that's a concern now or in future) - would metrics be generic HTTP retries rather than “Egress.StartRoomCompositeEgress failed over from region A to region B?

boks1971 · 2026-06-28T12:41:42Z

+	client := lksdk.NewRoomServiceClient(testServerURL(t), "devkey", "secret")
+	ctx := failoverCtx(t, lksdk.FailoverOn, map[string]string{hdrFailRegions: "0"})
+	_, err := client.CreateRoom(ctx, &livekit.CreateRoomRequest{Name: "api-test"})
+	require.NoError(t, err, "should fail over to a healthy region")


can this test the exact region that it connects to? Guess not as CreateRoomResponse does not have a region indication.

boks1971 · 2026-06-28T13:12:24Z

+	// original host plus (MaxAttempts-1) fallback regions. Defaults to 3.
+	MaxAttempts int
+	// BackoffBase is the initial delay before the first retry; each subsequent
+	// retry doubles it. Defaults to 200ms. Set to 0 to retry without delay.


Probably not allow this no delay if this is at API level. Have horror stories at a previous job where a no back off retry kept piling up and eventually killing the service.

I guess users can still bypass and do their own back off, but at least not provide that option here?

great point.. I've removed these options, since this is internal, I've kept it hard coded. we can always introduce new options later

davidzhao · 2026-06-28T13:27:21Z

related to my comment above - just thinking out loud here putting failover in the http.RoundTripper is neat and low-touch, but at that layer the failover decision is being made on opaque bytes + an HTTP status. It can read the method name (twirp.MethodName(ctx) rides on the request ctx), but not the typed request or the typed twirp error — so based on what I can see it can't really tell a safe-to-replay read from a mutation, and it ends up retrying every 5xx for every call. Should CreateRoom/StartRoomCompositeEgress/etc. silently replay against another region on a 5xx or dropped connection? That's an at-least-once semantic for operations that were effectively at-most-once before, and the transport can't distinguish them — which makes me think the retry/idempotency policy wants to live somewhere it can see the method and error. Not sure how all of that reflects observability (if that's a concern now or in future) - would metrics be generic HTTP retries rather than “Egress.StartRoomCompositeEgress failed over from region A to region B?

In this case, all requests should be retried. our server should be as idempotent as possible. that would be the right design patterns here. otherwise, if a request returns 500, how should we react to it? the contract says the API errored out and the most natural thing for any end user to handle is to retry it.

for observability, we should be able to see the HTTP path to determine the method/etc.

davidzhao · 2026-06-28T14:38:57Z

@milos-lk updated the implementation to address feedback. now we have a consolidated region cache.

feat: auto failover for API with LK Cloud

7e4efd6

automatically fail over 5xx and transport errors

davidzhao requested a review from a team as a code owner June 27, 2026 21:15

github-actions Bot added the slack-ts:cs-devs:1782594916.679099 label Jun 27, 2026

davidzhao commented Jun 27, 2026

View reviewed changes

Comment thread .github/workflows/test-api.yml

github-advanced-security AI found potential problems Jun 27, 2026

View reviewed changes

Comment thread .github/workflows/test-api.yml Fixed

github-advanced-security AI found potential problems Jun 27, 2026

View reviewed changes

Comment thread failover.go Fixed

fix codeql perms

b952fb6

milos-lk reviewed Jun 27, 2026

View reviewed changes

Comment thread failover.go Outdated

milos-lk reviewed Jun 27, 2026

View reviewed changes

Comment thread failover.go Outdated

milos-lk reviewed Jun 27, 2026

View reviewed changes

tweaks

2e82b8c

boks1971 approved these changes Jun 28, 2026

View reviewed changes

address feedback

8de18bd

github-advanced-security AI found potential problems Jun 28, 2026

View reviewed changes

Comment thread regionurlprovider.go Dismissed

Uh oh!

Conversation

davidzhao commented Jun 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

milos-lk Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

davidzhao Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

milos-lk Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milos-lk commented Jun 27, 2026

Uh oh!

milos-lk commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boks1971 Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

boks1971 Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

boks1971 Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

davidzhao Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidzhao commented Jun 28, 2026

Uh oh!

Uh oh!

davidzhao commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

milos-lk Jun 28, 2026 •

edited

Loading

milos-lk commented Jun 27, 2026 •

edited

Loading