fix: Do not clear the schema cache during retries#4869
Conversation
c468e80 to
36be579
Compare
|
Clarifying the motivation on #4873 |
5201ea9 to
82cf2e7
Compare
|
Follow up question on #4873 (comment), not sure if it's correct to do the fix like this. |
02f8fbc to
4d6a87d
Compare
@steve-chavez - So what's the conclusion after #4873 (comment), #4873 (comment) and all discussions in #4873 ? Are we going to proceed with this PR? |
|
@mkleczek We need to clarify what are we going to do with the request "waiting" (not sure how to call this) mentioned on #4873 (comment) because there's also a scenario here where we void the schema cache and we force clients to wait while the schema cache is reloaded again. If that "waiting" is useless, let's settle it on #4873 (comment). |
f92122f to
3b60b73
Compare
|
Now that we clarified the "waiting" was ineffective on #4937, let's proceed with reviewing this. |
| @@ -763,7 +763,7 @@ def test_admin_ready_includes_schema_cache_state(defaultenv, metapostgrest): | |||
| assert response.status_code == 503 | |||
|
|
|||
| response = postgrest.session.get("/projects", timeout=1) | |||
| assert response.status_code == 503 | |||
| assert response.status_code == 200 | |||
There was a problem hiding this comment.
@mkleczek One thing that is confusing, this is the full test we're changing here:
Lines 740 to 768 in 86d6ed1
So the /ready endpoint will give 503 but a request to /projects will give 200? An admin will not be able to correlate the time the health check was failing with the users' requests failing. I don't think this behavior is correct.
There was a problem hiding this comment.
Hmm... but what is there to correlate if a request returns 200? There is no error in this case. Schema cache is stale but we still can handle the query.
But this is a good question: what should admin server return from /ready during schema cache reloading? 200 or 503?
There was a problem hiding this comment.
But this is a good question: what should admin server return from /ready during schema cache reloading? 200 or 503?
Under this change, it looks we should return 200 on /ready once the startup scache load is done, no matter if it fails later.
And perhaps only return 503 if the db is down (detected by connection error or perhaps by pool metric).
But TBH not sure if any of the above are right.
There was a problem hiding this comment.
@wolfgangwalther I wonder if you have any opinions on what the /ready endpoint should report with this change?
There was a problem hiding this comment.
In general, the /ready endpoint is kind of a hypothetical request. A request that allows the infrastructure provider to tell whether a real request would likely succeed or not (and thus whether it makes sense to route requests to this instance). As such, its a basic requirement that ready returns true when the actual request succeeds - and vice versa.
Any combination of "ready says yes, but request says nope" or "ready says no, but request would succeed" is inherently wrong.
Which result the ready endpoint should return is thus closely related to the long discussion in #4873.
TLDR: If you intend to optimistically serve requests while the schema cache is reloaded, the ready endpoint must report "ready" at this stage, too.
There was a problem hiding this comment.
/live return 200 only when the schema cache has initially been loaded successfully.
I think it does make more sense now that we're switching to serving requests as "best effort", we at least need the initial schema cache load for that. @mkleczek Any thoughts?
There was a problem hiding this comment.
I think it does make more sense now that we're switching to serving requests as "best effort", we at least need the initial schema cache load for that
On second thought, if we go forward with #4468 (comment) then not even an initial schema cache load would be required for serving some requests (like request to table with simple filters). So it wouldn't be correct in that case for /live to wait until schema cache load is done.
There was a problem hiding this comment.
it only tries to load once and if there's an error we start listening on the port anyway, during this point then
/livewill be 200I think this is wrong. With our new understanding of "initial schema load is different than reloads", we should also make
/livereturn 200 only when the schema cache has initially been loaded successfully.
I am not convinced.
I think that we should base our thinking on the goals of each endpoint in a production runtime environment. What do we want to achieve with various endpoints?
My thinking is: the endpoints should be based on k8s probes as canonical (as they are the most common and quite obvious to copy in different environments as they simply make sense).
There are 3 kinds of probes:
- Startup probe -
k8suses it during startup to determine if program is started. - Liveness probe -
k8suses it to determine if the process is "alive" (ie. did not hang) - if liveness probe failsk8swill kill the process and start a new instance. - Readiness probe -
k8suses it to determine if the process is capable of serving traffic. If readiness probe fails, program will stop receiving traffic (but will stay alive and will be checked for liveness and readiness).
We are discussing readiness probe here. I see two different cases when PostgREST might be considered "not ready":
- No schema cache
- Stale schema cache (when known)
IMO it should be an operator's decision which one to check in readiness probe. Hence I would suggest two different URIs: one would return 5xx on "no schema cache", the second one - on "no schema cache or known stale schema cache". It does not matter if these URIs are different paths or some different query parameters - don't want to bikeshed about this.
Note that the first case is well suited for startup and readiness probe, while the second does not really make sense on startup. But none of them is well suited for liveness probe really as there is no point in restarting PostgREST due to schema cache not being loaded or stale.
That leaves us with liveness probe - we should provide a simple way to tell if PostgREST should be restarted. I would say current /live endpoint is pretty OK but in reality - redundant because k8s provides TCP liveness probe OOTB.
Nevertheless - /live is substantially different from others because its goal is different.
(Of course - an operator can configure liveness probe to check one if the schema cache related URIs if needed.)
@steve-chavez WDYT?
There was a problem hiding this comment.
Hence I would suggest two different URIs: one would return 5xx on "no schema cache", the second one - on "no schema cache or known stale schema cache". It does not matter if these URIs are different paths or some different query parameters - don't want to bikeshed about this.
@mkleczek Agree, I've opened #4985 for that.
I think only some docs are missing here, somewhere in:
There was a problem hiding this comment.
My thinking is: the endpoints should be based on
k8sprobes as canonical
[...]
Thanks, I now see that my proposal about /live does not make sense here.
e685d2f to
5f3f373
Compare
8a684e0 to
0bbfaf9
Compare
retryingSchemaCacheLoad should not clear existing schema cache upon failure - there is no reason to do that. If there is a communication issue with the database server or db is down, clients are going to get 502 anyway. If it was a glitch when loading the schema cache - the clients are going to use old (stale) schema cache for some time until next retry re-loads it successfully.
0bbfaf9 to
31998c5
Compare
| - Shutdown should wait for in flight requests by @mkleczek in #4702 | ||
| - Remove automatic transaction retries on `40001 (serialization_failure)` errors to prevent replication lag by @laurenceisla in #3673 | ||
| - Fix unexpected results when embedding and filtering the same table more than once by @laurenceisla in #4075 | ||
| - PostgREST no longer returns voids schema cache during loading retries by @mkleczek in #4873 #4869 |
There was a problem hiding this comment.
This needs to be more user facing, currently it mentions an implementation detail. Maybe:
| - PostgREST no longer returns voids schema cache during loading retries by @mkleczek in #4873 #4869 | |
| - If the schema cache fails to reload, PostgREST will no longer stop serving requests and will continue doing so in a "best effort" basic by @mkleczek in #4873 #4869 |
DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.
retryingSchemaCacheLoadshould not clear existing schema cache upon failure - there is no reason to do that.If there is a communication issue with the database server or db is down, clients are going to get 502 anyway. If it was a glitch when loading the schema cache - the clients are going to use old (stale) schema cache for some time until next retry re-loads it successfully.
Fixes #4873