fix: Start listening after schema cache load#4880
Conversation
ca17636 to
5bb48c5
Compare
|
Previous discussion on the motivation of the change on #4703 (comment). |
|
@mkleczek As per #4703 (comment), this would clearly benefit the case of But let's consider the scenario of a single instance managed by systemd behind a proxy:
With this change, now we'll not respond and clients will get a So under this scenario, it looks like this new behavior is worse? Thinking more what we need is zero-downtime restarts, which I guess is easier under this new behavior since we could rely on systemd socket activation? |
Not really. From the point of view of the client there is not much gain from these |
5bb48c5 to
99554cb
Compare
Thought about removing the
So it's a minimum, not exact time. I think it should be fine to be clear about this on the docs and recommend jitter. |
|
@mkleczek The direction here is good, make sure to address the comments and then we can merge this. |
99554cb to
6a927aa
Compare
|
I am marking this PR as draft to address concerns related to handling schema cache loading errors during start-up. It seems the right course of action cannot be any of these two extremes:
The first one forces clients to handle normal conditions as errors. It seems like the best (ultimate?) startup sequence should be:
That way we achieve both:
The above requires wider refactoring - today the whole schema cache loading loop is implemented in a single function without any means to introspect the state of the loading process. Clients can only trigger asynchronous schema load and wait for it to finish. @steve-chavez WDYT? |
f577ea6 to
3eed89b
Compare
I wrote up two different proposals but threw them away, because I always came to the conclusion that this is the sensible thing to do. So 👍 |
Looks much better. Also 👍 from me. |
So between these two steps, we'd still return the connection error, however after that we'd retry and get the 503. I agree with this. @steve-chavez Not sure if it was discussed elsewhere, but this would mean that the proposal to wait until the schema cache is loaded on startup is no longer desired, right? |
|
@laurenceisla The waiting is being discussed on #4873 (comment). #4129 won't be solved here. |
e653c00 to
55c3e8c
Compare
Updated the code to implemented the above. |
255644b to
9f49c51
Compare
@mkleczek While the happy path is devoid of errors, the "usual path" always has some db connections errors (while the db is coming up, this happens on docker compose), should we account for a number of retries maybe before giving 503? Otherwise, I'm wondering if there's value in merging this independently (separate from #4703), since under a connection error we'll now force a client to handle both If we agree it's not an improvement on its own, maybe we should merge it together in #4703 (which is of course great on its own). That way we avoid a change in behavior here, since on #4703 this change will be guarded by the reuse port config. Thoughts? |
I remember on #4703 (comment) the third option sounded great and IIRC it was the main motivation for this PR, but under real world conditions we've seen |
The problem is with The are several scenarios we have to consider, I guess:
I am now starting to think that the best strategy is the original idea of this PR: do not listen until schema cache is loaded (even in case of errors) - it handles scenarios 2 and 3 properly and in scenario 1 it makes clients receive WDYT? |
Yes, that's why I mentioned above that this behavior of
Yes, this would only make things harder for non reuseport cases (since connection refused can last long). There's no benefit on changing the behavior for the default case. We would cause a breaking change unnecessarily. |
The problem with what we have currently is that clients get errors during startup even if all is fine. That's confusing and IMHO wrong. I'd say that gives us a choice:
The first option stays compatible with what we have now (and does not improve anything in cases when initial schema cache load fails). The second option seems cleaner to me but it is not clear cut. |
9f49c51 to
d8883f2
Compare
a6895b2 to
c442894
Compare
|
Needs a rebase after 1a6ba20. |
The reason I didn't do refactoring first was to avoid hard to resolve conflicts. I'd be grateful if we collaborated more on PRs to make our job easier instead of harder. |
c442894 to
feffcfd
Compare
Rebased |
Same reasoning here - but with an eye on our future selves, when we need to maintain things. It's much more likely we'd like to revert this fix compared to the refactor. If we do the refactor first, then the fix, it's easy to revert. If we do it the other way around, we'd then need to be very careful at that time. btw rebasing your changeset over it should not have been hard. It should have been as easy as:
The result after the two commits is the same, so that part is really easy. The harder to resolve conflict, which included actually looking at the code, was the one that I did when I cherry-picked it. That's why I did it and didn't force it onto you. |
steve-chavez
left a comment
There was a problem hiding this comment.
Looked at all the change in tests, they look fine.
All is left is resolving https://github.com/PostgREST/postgrest/pull/4880/changes#r3306654090.
61ed1e8 to
8297bfd
Compare
This change ensures PostgREST starts listening on a server socket only after it loaded the schema cache and is ready to handle requests. It is no longer going to return 503 errors during startup until the schema cache is loaded.
8297bfd to
28f283d
Compare
This change ensures PostgREST starts listening on a server socket only after it loaded the schema cache and is ready to handle requests. It is no longer going to return 503 errors during startup until the schema cache is loaded.
DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.