Skip to content

fix: implement connection tracking in metrics#4672

Merged
steve-chavez merged 1 commit into
PostgREST:mainfrom
mkleczek:refactor/connection-tracking
May 18, 2026
Merged

fix: implement connection tracking in metrics#4672
steve-chavez merged 1 commit into
PostgREST:mainfrom
mkleczek:refactor/connection-tracking

Conversation

@mkleczek
Copy link
Copy Markdown
Collaborator

@mkleczek mkleczek commented Feb 25, 2026

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events.

This is problematic because Hasql pool reports various connection events in various states that make it impossible to predict the state change from the received event. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles.

Fixes #4622

@steve-chavez
Copy link
Copy Markdown
Member

Fixes #4622

Fix should have an entry on the CHANGELOG

Comment thread src/PostgREST/Metrics.hs Outdated
Comment thread src/PostgREST/Metrics.hs
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 5 times, most recently from 0f475e3 to 592e5a3 Compare March 10, 2026 14:53
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 2 times, most recently from 1f3b4bc to bce1f6e Compare March 13, 2026 06:49
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 8 times, most recently from 7383b59 to 9b3c004 Compare April 2, 2026 14:45
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch from 9b3c004 to 6af7b30 Compare April 8, 2026 12:33
Comment thread test/observability/Observation/MetricsSpec.hs
@steve-chavez
Copy link
Copy Markdown
Member

@mkleczek Since this is a fix, let's change the commit, PR title and add a fix entry in the CHANGELOG.

@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 5 times, most recently from 27d9a1d to e438eed Compare April 14, 2026 07:02
@steve-chavez
Copy link
Copy Markdown
Member

@mkleczek Could you please address the feedback here? I've repeated the same twice now 😩

@mkleczek
Copy link
Copy Markdown
Collaborator Author

mkleczek commented Apr 14, 2026

@mkleczek Could you please address the feedback here? I've repeated the same twice now 😩

Done

Comment thread test/observability/Observation/MetricsSpec.hs
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 4 times, most recently from 90134d4 to c196a70 Compare May 13, 2026 13:21
@mkleczek
Copy link
Copy Markdown
Collaborator Author

@steve-chavez @wolfgangwalther - what do you think is required to push this forward?

#4622 is being now reported by our support and it is becoming urgent to fix it.

@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 3 times, most recently from c389f35 to 202f84e Compare May 14, 2026 08:30
@wolfgangwalther
Copy link
Copy Markdown
Member

what do you think is required to push this forward?

This does not introduce any new test infrastructure that I'd be opposed to, so I won't block on the long term vision of how our tests should be structured. It uses the existing infrastructure. I might not like the way the test is written, but I don't see a need to block on that either.

To be clear, the question in #4672 (comment) was asked to get a feeling of how things could be done differently, if we had better test infrastructure elsewhere - and not to block this PR's progress.

Imho the only previously blocking comment is #4672 (comment). Now, since I wrote that comment, I started a major discussion on how we should test in general, which is blocking the other, test-infra related, issues/PRs. I don't think we should hold this PR hostage to that either.

TLDR: No blockers for me.

@steve-chavez
Copy link
Copy Markdown
Member

what do you think is required to push this forward?

It's gonna be really confusing when we look back in history and we say we fix #4622 and there isn't a precise test proving it (the current test does not).

Let's not set a precedent here that can later hurt us with tech debt, so we should first clear the above thread.

@wolfgangwalther
Copy link
Copy Markdown
Member

It's gonna be really confusing when we look back in history and we say we fix #4622 and there isn't a precise test proving it (the current test does not).

Let's not set a precedent here that can later hurt us with tech debt, so we should first clear the above thread.

The thread you linked is about the currently existing test, so that doesn't quite match the first sentence about a missing test. If you're concerned about a missing test, then we need to clear #4672 (comment).

I'd still say the situation now is different compared to when #1766 happened - we are actively working on improving the test situation and we have an open PR to track the addition of the herein-missing test. Thus, I'd say the risk of this getting forgotten is much smaller than earlier.

I'd say we should go ahead with this.

@mkleczek mkleczek force-pushed the refactor/connection-tracking branch from 202f84e to 8e12a9a Compare May 15, 2026 07:59
@steve-chavez
Copy link
Copy Markdown
Member

@wolfgangwalther Let's not merge because I have a much simpler test almost ready for PR, let's merge this after that.

@wolfgangwalther
Copy link
Copy Markdown
Member

No worries, I don't intend to merge. I just wanted to make my implicit approval explicit. I am well aware that you still have a thread open (this is now actually blocking the merge as well) - and I'm not just going to override you and resolve that thread. That's for you to decide :)

Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events. This is problematic because Hasql pool reports various connection events in multiple phases. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles.

This change adds a ConnTrack data structure and logic to track database connections lifecycles. At the moment it supports "connected" and "inUse" connection counts precisely. The "pgrst_db_pool_available" metric is implemented on top of ConnTrack instead of a simple Gauge.
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch from bf18e9a to 7785329 Compare May 18, 2026 12:16
@mkleczek mkleczek requested a review from steve-chavez May 18, 2026 12:25
@steve-chavez steve-chavez merged commit a297391 into PostgREST:main May 18, 2026
33 checks passed
Comment thread CHANGELOG.md
### Fixed

- Fix unnecessary connection pool flushes during schema cache reloading by @mkleczek in #4645
- Fix race condition in pool_available metric causing negative values during network instability by @mkleczek in #4622
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek This was added in an old version Fixed section https://github.com/PostgREST/postgrest/blob/main/CHANGELOG.md#fixed-2 😕

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh... rebasing changelog is inherently tricky. My bad.

Raised #4942

@steve-chavez
Copy link
Copy Markdown
Member

I think we should backport this, it will require:

@mkleczek
Copy link
Copy Markdown
Collaborator Author

I think we should backport this, it will require:

Hmm... most probably the test in observability test suite is a problem for backporting.

Maybe we should split it into 2 PRs then (separate the test in MetricsSpec)?

@steve-chavez
Copy link
Copy Markdown
Member

Maybe we should split it into 2 PRs then (separate the test in MetricsSpec)?

Yup, sounds good.

@taimoorzaeem
Copy link
Copy Markdown
Member

I think we just follow the order in the git log. It involves 5 commits in order:

Let me try doing this in 1 PR, so it's easier to revert, just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in pool_available metric causes negative values during network instability

4 participants