fix: Do not clear the schema cache during retries by mkleczek · Pull Request #4869 · PostgREST/postgrest

mkleczek · 2026-05-04T07:28:49Z

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

retryingSchemaCacheLoad should not clear existing schema cache upon failure - there is no reason to do that.
If there is a communication issue with the database server or db is down, clients are going to get 502 anyway. If it was a glitch when loading the schema cache - the clients are going to use old (stale) schema cache for some time until next retry re-loads it successfully.

Fixes #4873

steve-chavez · 2026-05-04T18:02:40Z

Clarifying the motivation on #4873

steve-chavez · 2026-05-04T18:14:49Z

Follow up question on #4873 (comment), not sure if it's correct to do the fix like this.

mkleczek · 2026-05-12T07:32:19Z

Follow up question on #4873 (comment), not sure if it's correct to do the fix like this.

@steve-chavez - So what's the conclusion after #4873 (comment), #4873 (comment) and all discussions in #4873 ?

Are we going to proceed with this PR?

steve-chavez · 2026-05-12T18:18:36Z

@mkleczek We need to clarify what are we going to do with the request "waiting" (not sure how to call this) mentioned on #4873 (comment) because there's also a scenario here where we void the schema cache and we force clients to wait while the schema cache is reloaded again. If that "waiting" is useless, let's settle it on #4873 (comment).

steve-chavez · 2026-05-20T16:21:51Z

Now that we clarified the "waiting" was ineffective on #4937, let's proceed with reviewing this.

steve-chavez · 2026-05-20T16:25:46Z

@@ -763,7 +763,7 @@ def test_admin_ready_includes_schema_cache_state(defaultenv, metapostgrest):
        assert response.status_code == 503

        response = postgrest.session.get("/projects", timeout=1)
-        assert response.status_code == 503
+        assert response.status_code == 200


@mkleczek One thing that is confusing, this is the full test we're changing here:

postgrest/test/io/test_io.py

Lines 740 to 768 in 86d6ed1

def test_admin_ready_includes_schema_cache_state(defaultenv, metapostgrest):

"Should get a failed response from the admin server ready endpoint when the schema cache is not loaded"

role = "timeout_authenticator"

env = {

**defaultenv,

"PGUSER": role,

"PGRST_DB_ANON_ROLE": role,

"PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP": "500",

}

with run(env=env) as postgrest:

# The schema cache query takes at least 500ms, due to PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP above.

# Make it impossible to load the schema cache, by setting statement timeout to 400ms.

set_statement_timeout(metapostgrest, role, 400)

# force a reconnection so the new role setting is picked up

postgrest.process.send_signal(signal.SIGUSR1)

postgrest.wait_until_scache_starts_loading()

response = postgrest.admin.get("/ready", timeout=1)

assert response.status_code == 503

response = postgrest.session.get("/projects", timeout=1)

assert response.status_code == 503

reset_statement_timeout(metapostgrest, role)

So the /ready endpoint will give 503 but a request to /projects will give 200? An admin will not be able to correlate the time the health check was failing with the users' requests failing. I don't think this behavior is correct.

Hmm... but what is there to correlate if a request returns 200? There is no error in this case. Schema cache is stale but we still can handle the query.

But this is a good question: what should admin server return from /ready during schema cache reloading? 200 or 503?

But this is a good question: what should admin server return from /ready during schema cache reloading? 200 or 503?

Under this change, it looks we should return 200 on /ready once the startup scache load is done, no matter if it fails later.

And perhaps only return 503 if the db is down (detected by connection error or perhaps by pool metric).

But TBH not sure if any of the above are right.

@wolfgangwalther I wonder if you have any opinions on what the /ready endpoint should report with this change?

In general, the /ready endpoint is kind of a hypothetical request. A request that allows the infrastructure provider to tell whether a real request would likely succeed or not (and thus whether it makes sense to route requests to this instance). As such, its a basic requirement that ready returns true when the actual request succeeds - and vice versa.

Any combination of "ready says yes, but request says nope" or "ready says no, but request would succeed" is inherently wrong.

Which result the ready endpoint should return is thus closely related to the long discussion in #4873.

TLDR: If you intend to optimistically serve requests while the schema cache is reloaded, the ready endpoint must report "ready" at this stage, too.

/live return 200 only when the schema cache has initially been loaded successfully.

I think it does make more sense now that we're switching to serving requests as "best effort", we at least need the initial schema cache load for that. @mkleczek Any thoughts?

I think it does make more sense now that we're switching to serving requests as "best effort", we at least need the initial schema cache load for that

On second thought, if we go forward with #4468 (comment) then not even an initial schema cache load would be required for serving some requests (like request to table with simple filters). So it wouldn't be correct in that case for /live to wait until schema cache load is done.

it only tries to load once and if there's an error we start listening on the port anyway, during this point then /live will be 200

I think this is wrong. With our new understanding of "initial schema load is different than reloads", we should also make /live return 200 only when the schema cache has initially been loaded successfully.

I am not convinced.

I think that we should base our thinking on the goals of each endpoint in a production runtime environment. What do we want to achieve with various endpoints?

My thinking is: the endpoints should be based on k8s probes as canonical (as they are the most common and quite obvious to copy in different environments as they simply make sense).
There are 3 kinds of probes:

Startup probe - k8s uses it during startup to determine if program is started.

Liveness probe - k8s uses it to determine if the process is "alive" (ie. did not hang) - if liveness probe fails k8s will kill the process and start a new instance.

Readiness probe - k8s uses it to determine if the process is capable of serving traffic. If readiness probe fails, program will stop receiving traffic (but will stay alive and will be checked for liveness and readiness).

We are discussing readiness probe here. I see two different cases when PostgREST might be considered "not ready":

No schema cache

Stale schema cache (when known)

IMO it should be an operator's decision which one to check in readiness probe. Hence I would suggest two different URIs: one would return 5xx on "no schema cache", the second one - on "no schema cache or known stale schema cache". It does not matter if these URIs are different paths or some different query parameters - don't want to bikeshed about this.

Note that the first case is well suited for startup and readiness probe, while the second does not really make sense on startup. But none of them is well suited for liveness probe really as there is no point in restarting PostgREST due to schema cache not being loaded or stale.

That leaves us with liveness probe - we should provide a simple way to tell if PostgREST should be restarted. I would say current /live endpoint is pretty OK but in reality - redundant because k8s provides TCP liveness probe OOTB.

Nevertheless - /live is substantially different from others because its goal is different.
(Of course - an operator can configure liveness probe to check one if the schema cache related URIs if needed.)

@steve-chavez WDYT?

Hence I would suggest two different URIs: one would return 5xx on "no schema cache", the second one - on "no schema cache or known stale schema cache". It does not matter if these URIs are different paths or some different query parameters - don't want to bikeshed about this.

@mkleczek Agree, I've opened #4985 for that.

I think only some docs are missing here, somewhere in:

postgrest/docs/references/schema_cache.rst

Line 1 in fae5817

.. _schema_cache:

My thinking is: the endpoints should be based on k8s probes as canonical
[...]

Thanks, I now see that my proposal about /live does not make sense here.

retryingSchemaCacheLoad should not clear existing schema cache upon failure - there is no reason to do that. If there is a communication issue with the database server or db is down, clients are going to get 502 anyway. If it was a glitch when loading the schema cache - the clients are going to use old (stale) schema cache for some time until next retry re-loads it successfully.

steve-chavez · 2026-06-03T17:59:20Z

 - Shutdown should wait for in flight requests by @mkleczek in #4702
 - Remove automatic transaction retries on `40001 (serialization_failure)` errors to prevent replication lag by @laurenceisla in #3673
 - Fix unexpected results when embedding and filtering the same table more than once by @laurenceisla in #4075
+- PostgREST no longer returns voids schema cache during loading retries by @mkleczek in #4873 #4869


This needs to be more user facing, currently it mentions an implementation detail. Maybe:

Suggested change

- PostgREST no longer returns voids schema cache during loading retries by @mkleczek in #4873 #4869

- If the schema cache fails to reload, PostgREST will no longer stop serving requests and will continue doing so in a "best effort" basic by @mkleczek in #4873 #4869

mkleczek requested a review from steve-chavez May 4, 2026 07:49

mkleczek force-pushed the push-olktzqvzkonu branch from c468e80 to 36be579 Compare May 4, 2026 17:35

steve-chavez reviewed May 4, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

mkleczek force-pushed the push-olktzqvzkonu branch 2 times, most recently from 5201ea9 to 82cf2e7 Compare May 4, 2026 18:08

wolfgangwalther reviewed May 5, 2026

View reviewed changes

Comment thread test/io/test_io.py

Comment thread test/io/fixtures/roles.sql

mkleczek force-pushed the push-olktzqvzkonu branch 3 times, most recently from 02f8fbc to 4d6a87d Compare May 12, 2026 06:29

mkleczek mentioned this pull request May 12, 2026

A statement timeout can void the schema cache #4873

Open

mkleczek force-pushed the push-olktzqvzkonu branch 8 times, most recently from f92122f to 3b60b73 Compare May 20, 2026 10:35

steve-chavez reviewed May 20, 2026

View reviewed changes

mkleczek force-pushed the push-olktzqvzkonu branch 3 times, most recently from e685d2f to 5f3f373 Compare May 31, 2026 06:04

mkleczek force-pushed the push-olktzqvzkonu branch 2 times, most recently from 8a684e0 to 0bbfaf9 Compare May 31, 2026 21:32

steve-chavez mentioned this pull request Jun 1, 2026

New metric for failed_attempts_since_last_successful_schema_reload #4971

Open

mkleczek force-pushed the push-olktzqvzkonu branch from 0bbfaf9 to 31998c5 Compare June 2, 2026 07:13

steve-chavez mentioned this pull request Jun 2, 2026

Response body for health checks #4181

Open

steve-chavez reviewed Jun 3, 2026

View reviewed changes

steve-chavez mentioned this pull request Jun 3, 2026

excluding checks in admin /ready endpoint #4985

Open

	def test_admin_ready_includes_schema_cache_state(defaultenv, metapostgrest):
	"Should get a failed response from the admin server ready endpoint when the schema cache is not loaded"

	role = "timeout_authenticator"

	env = {
	**defaultenv,
	"PGUSER": role,
	"PGRST_DB_ANON_ROLE": role,
	"PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP": "500",
	}

	with run(env=env) as postgrest:
	# The schema cache query takes at least 500ms, due to PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP above.
	# Make it impossible to load the schema cache, by setting statement timeout to 400ms.
	set_statement_timeout(metapostgrest, role, 400)

	# force a reconnection so the new role setting is picked up
	postgrest.process.send_signal(signal.SIGUSR1)

	postgrest.wait_until_scache_starts_loading()

	response = postgrest.admin.get("/ready", timeout=1)
	assert response.status_code == 503

	response = postgrest.session.get("/projects", timeout=1)
	assert response.status_code == 503

	reset_statement_timeout(metapostgrest, role)

	- PostgREST no longer returns voids schema cache during loading retries by @mkleczek in #4873 #4869
	- If the schema cache fails to reload, PostgREST will no longer stop serving requests and will continue doing so in a "best effort" basic by @mkleczek in #4873 #4869

Uh oh!

Conversation

mkleczek commented May 4, 2026 • edited by steve-chavez Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steve-chavez commented May 4, 2026

Uh oh!

Uh oh!

steve-chavez commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

mkleczek commented May 12, 2026

Uh oh!

steve-chavez commented May 12, 2026

Uh oh!

steve-chavez commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steve-chavez May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

mkleczek commented May 4, 2026 •

edited by steve-chavez

Loading

steve-chavez commented May 20, 2026 •

edited

Loading

steve-chavez May 20, 2026 •

edited

Loading

mkleczek May 20, 2026 •

edited

Loading

mkleczek Jun 3, 2026 •

edited

Loading