feat: auto failover for API with LK Cloud#936
Conversation
automatically fail over 5xx and transport errors
|
|
||
| // discover regions lazily | ||
| if !fetchedRegions { | ||
| settings = t.regions.get(originalScheme, originalHost, req.Header) |
There was a problem hiding this comment.
Should the discovery get/fetch respect the caller's context? Right now RoundTrip calls t.regions.get(originalScheme, originalHost, req.Header) without req.Context(), and fetch uses http.NewRequest on the 5s-timeout client — so it ignores the caller's deadline/cancel.
There was a problem hiding this comment.
it shouldn't.. this is an internal request to discover regions, we'll limit it to a much shorter timeout. so for example, if someone was doing a CreateSIPParticipant with a 30s timeout, we do not want region refresh to take the same 30s.
There was a problem hiding this comment.
Totally agree the discovery fetch shouldn't ride the caller's 30s — capping it short is right. One clarification though: threading the context doesn't undo that cap. http.NewRequestWithContext(ctx, …) on the Timeout: 2s client is bounded by min(2s, remaining deadline), not the max - the client derives its deadline from the request ctx, so whichever fires first wins - right? Your CreateSIPParticipant/30s case still gets capped at 2s; the 30s never applies.
The reason to still pass ctx is the two cases the fixed 2s alone doesn't cover:
- Cancellation - if the caller cancels (client disconnected, op aborted, process draining), the in-flight discovery keeps burning up to 2s on an abandoned request instead of stopping right away.
- Sub-2s deadlines (maybe that's not too frequent in practice - but theoretical possibility) — a caller with a 500ms timeout: the primary fails at 500ms, then discovery still runs up to 2s, so the call returns ~2.5s and the result gets discarded anyway.
|
just for my education: are all APIs idempotent and safe to be retried on 5xx? |
|
related to my comment above - just thinking out loud here |
| client := lksdk.NewRoomServiceClient(testServerURL(t), "devkey", "secret") | ||
| ctx := failoverCtx(t, lksdk.FailoverOn, map[string]string{hdrFailRegions: "0"}) | ||
| _, err := client.CreateRoom(ctx, &livekit.CreateRoomRequest{Name: "api-test"}) | ||
| require.NoError(t, err, "should fail over to a healthy region") |
There was a problem hiding this comment.
can this test the exact region that it connects to? Guess not as CreateRoomResponse does not have a region indication.
| // original host plus (MaxAttempts-1) fallback regions. Defaults to 3. | ||
| MaxAttempts int | ||
| // BackoffBase is the initial delay before the first retry; each subsequent | ||
| // retry doubles it. Defaults to 200ms. Set to 0 to retry without delay. |
There was a problem hiding this comment.
Probably not allow this no delay if this is at API level. Have horror stories at a previous job where a no back off retry kept piling up and eventually killing the service.
There was a problem hiding this comment.
I guess users can still bypass and do their own back off, but at least not provide that option here?
There was a problem hiding this comment.
great point.. I've removed these options, since this is internal, I've kept it hard coded. we can always introduce new options later
In this case, all requests should be retried. our server should be as idempotent as possible. that would be the right design patterns here. otherwise, if a request returns 500, how should we react to it? the contract says the API errored out and the most natural thing for any end user to handle is to retry it. for observability, we should be able to see the HTTP path to determine the method/etc. |
|
@milos-lk updated the implementation to address feedback. now we have a consolidated region cache. |
automatically fail over 5xx and transport errors