THREESCALE-14969: Rotate OIDC tokens for Zync#4310
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4310 +/- ##
==========================================
- Coverage 88.92% 88.87% -0.06%
==========================================
Files 1752 1753 +1
Lines 44131 44146 +15
Branches 689 689
==========================================
- Hits 39245 39235 -10
- Misses 4870 4895 +25
Partials 16 16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| def self.refresh_oidc_sync | ||
| user_id = scope_attributes["owner_id"] | ||
| cache_key = "access_tokens/user:#{user_id}/oidc" | ||
|
|
||
| # Hot path: skip the transaction entirely on cache hit (zero DB queries). | ||
| cached = Rails.cache.read(cache_key) | ||
| return cached if cached | ||
|
|
||
| transaction do | ||
| # Lock existing OIDC tokens to serialize concurrent cache misses (e.g. full resync). | ||
| # Workers that arrive simultaneously will queue here. On cold start (no tokens yet) | ||
| # there is nothing to lock and stampede churn is harmless — all created tokens are valid. | ||
| lock.where(name: OIDC_SYNC_TOKEN).load | ||
|
|
||
| # Double-check inside the transaction: a concurrent worker may have populated the cache | ||
| # while we were waiting on the row lock above. | ||
| Rails.cache.fetch(cache_key, expires_in: 1.hour) do | ||
| # Expire (not delete) the current active token so Zync can keep using it for up to | ||
| # 1 day while it picks up the new one. The janitor cleans up expired tokens weekly. | ||
| where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now) | ||
| create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
This method could be much more simple, like:
def self.refresh_oidc_sync
user_id = scope_attributes["owner_id"]
Rails.cache.fetch("access_tokens/user:#{user_id}/oidc", expires_in: 1.hour) do
where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)
create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value
end
endBut that would cause a non-deterministic result on concurrent cache misses. e.g. When we run the full resync task from #4307 and the cache key is invalid, it could happen that many concurrent processes enter the block at the same time and update-then-create tokens, which would en up on the last of them prevailing and expiring all previously created token. Claude explains:
Simple version (bare Rails.cache.fetch)
1. A, B, C all call Rails.cache.read — all miss
2. A enters Rails.cache.fetch — cache miss, enters block
3. B enters Rails.cache.fetch — also cache miss (A hasn't finished yet), enters block
4. C enters Rails.cache.fetch — also cache miss, enters block
5. A: update_all(expires_at: 1.day.from_now) — expires the active token. Then create! — new token.
6. B: update_all(expires_at: 1.day.from_now) — expires A's freshly created token (it has expires_at: nil). Then create! — another new token.
7. C: same — expires B's token, creates yet another.
8. Each process writes its own value to the cache — last writer wins.
Result: 3 rotations, 2 wasted tokens (still valid for 1 day, so no broken auth, but unnecessary churn).
But we could also end up with multiple non-expired tokens, in this scenario:
1. A: update_all — expires old token, commits
2. B: update_all — no rows with expires_at: nil, no-op
3. A: create! — creates T1 (expires_at: nil)
4. C: update_all — expires T1
5. B: create! — creates T2 (expires_at: nil)
6. C: create! — creates T3 (expires_at: nil)
The committed code prevents this because it sets a DB lock at lock.where(name: OIDC_SYNC_TOKEN).load, and processes B and C will wait there, until A closes the transaction block. Since the cache.fetch call is inside the transaction, we ensure the cache is also written when A finished, so B and C will always get a hit when calling cache.fetch.
1. A, B, C all call Rails.cache.read — all miss
2. A enters the transaction first, runs SELECT ... FOR UPDATE — locks the OIDC token row
3. B and C enter transactions, hit SELECT ... FOR UPDATE — blocked at DB level, waiting for A's lock
4. A runs Rails.cache.fetch — cache miss, executes block: expires old token, creates new one, writes to cache. Transaction commits, lock released.
5. B gets the lock, runs SELECT ... FOR UPDATE. Then Rails.cache.fetch — cache hit (A populated it). Returns cached value. Transaction commits.
6. C same as B — cache hit, no DB writes.
Result: 1 rotation, 0 wasted tokens.
This is convenient but not really necessary since such race conditions will be rare and the result is just some wasted tokens that the Janitor will purge anyway. So if you want the simple version, I'm fine with it.
There was a problem hiding this comment.
I also considered the :race_condition_ttl option for Rails.cache.fetch but discarded it because it only prevents the cache stampede in the short period of time of N seconds after the cache expires, in any other scenario all processes would enter the block anyway.
There was a problem hiding this comment.
The lock increases DB query count. If we really need locking, them maybe better use redlock like we do for billing.
Given the complexity I wonder if we should avoid caching, expire the tokens and clear them with janitor every night.
There was a problem hiding this comment.
Yes, I'll make changes to use Redlock instead. I wouldn't avoid caching because that would create a new token for each zync event, that is, a few additional DB hits per zync event, if you worry about DB hits, I think caching is needed. Also, I have a strong opinion about caching because not doing so would turn our access tokens table into a dump. The Janitor runs once per week, not per night:
porta/app/lib/three_scale/jobs.rb
Lines 106 to 111 in 066c2d1
What I would accept is caching but not locking at all, then the code would look like #4310 (comment)
There was a problem hiding this comment.
ok, lets try to use locking sparsely then. Only if cache doesn't exist, then try to acquire a no-wait lock, if lock acquisition fails, then wait for lock and read the cache created by the other process. Something like that I imagine. WDYT?
There was a problem hiding this comment.
I've been trying to implement locking with Redlock and I don't think it's worth it. These are the reasons:
- With DB locking, waiting processes will wait indefinitely until the lock is released, at that very moment another process will take the lock. But Redis locking is based in polling: we must tweak the
:retry_countand:retry_delayoptions in order to configure how polling behaves, which is error prone and implies higher complexity to handle all possible scenarios (e.g. what if all retries complete before the lock is released?). Also, in the best scenario, the waiting process will not be awaken the moment the lock is released but when the current delay period finishes. - Redis lock requires implementing the logic to release the lock which adds more complexity to the method. With DB lock, it's the DB who holds all locking logic. It's better for us to delegate such logic to DB engine.
- Cost of DB locking in DB queries is not so high, because the cache miss will only happen once per hour per provider.
For those reasons I don't think the Redis lock is the proper tool for our scenario.
| # Expire (not delete) the current active token so Zync can keep using it for up to | ||
| # 1 day while it picks up the new one. The janitor cleans up expired tokens weekly. | ||
| where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now) |
There was a problem hiding this comment.
Doesn't zync pick up then new one almost immediately? 1 day to pick just he key looks excessive.
Also didn't we think to provide expiring tokens always to begin with?
There was a problem hiding this comment.
Zync should in most of cases run the jobs fast, but there's no warranty, the only thing that happens immediately is job enqueuing, that's why the rotated token doesn't have an expiration time, we don't know when the job is going to run and use the token.
Rotation means removing the old token, the only reason we set an expiration time instead of just removing token immediately is to mitigate the race condition that happens when a token is already rotated in porta but didn't reach zync yet, and zync sends a request at that moment using the old token.
The expiration time we set here is debatable, I explained it here: #4309 (comment)
I think a period of about 5-10 hours would be better, but you mentioned 1 day in some comment so I used that value.
| DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name) | ||
| DeleteExpiredOIDCSyncTokensWorker.perform_async |
There was a problem hiding this comment.
How about just reusing this one?
| DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name) | |
| DeleteExpiredOIDCSyncTokensWorker.perform_async | |
| DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name, AccessToken.name) |
Just we should define the stale scope to equal expired_oidc_sync
1950368 to
50cff64
Compare
❌ 7 blocking issues (7 total)
|
| return cached if cached | ||
|
|
||
| transaction do | ||
| # Lock to serialize concurrent cache misses (e.g. full resync). |
There was a problem hiding this comment.
(e.g. full resync)
What should this mean 😬
There was a problem hiding this comment.
full resync (#4307) is a scenario when concurrent cache misses can happen
There was a problem hiding this comment.
IMO the comment can be removed. It brought only confusion to me, also it is clear that the lock prevents concurrent refreshes. But leaving up to you.
There was a problem hiding this comment.
I prefer to keep the comment, I think it'll help me in the future if I come back to this.
| create_with(scopes: %w[account_management], permission: 'ro').find_or_create_by!(name: OIDC_SYNC_TOKEN) | ||
| scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) } | ||
|
|
||
| def self.refresh_oidc_sync |
There was a problem hiding this comment.
Will it be annoying if I request to rename back to oidc_sync?
Maybe I suggested the name refresh_oidc_sync but in my version there was no cache and all calls actually refreshed the token. Now it can just read it from cache. So the name is incorrect for this usage.
Or in an extreme case might be #cached_or_refresh_oidc_sync.
There was a problem hiding this comment.
Will it be annoying if I request to rename back to oidc_sync?
Yes, but 1115be6
akostadinov
left a comment
There was a problem hiding this comment.
Looks great, thank you!
1115be6 to
8b1ec05
Compare
|
|
||
| def self.oidc_sync | ||
| create_with(scopes: %w[account_management], permission: 'ro').find_or_create_by!(name: OIDC_SYNC_TOKEN) | ||
| scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) } |
There was a problem hiding this comment.
The old oidc_sync method created tokens without setting expires_at, so existing tokens in production have expires_at: NULL. The stale scope won't match them because NULL < NOW() is NULL in SQL, not TRUE.
These will stay in the access_tokens table indefinitely , should the scope also handle NULLs ?
There was a problem hiding this comment.
Thanks for your review!
Yeah, what you say makes sense, and I think we could add null to the scope. My only objection is that NULLs won't accumulate on the table, so I don't think the Janitor is the place to remove them, since they would be removed on the first janitor run and then there would never be any non-expirable OIDC token ever again. It seems to me more appropriate for a one-off removal. IMO, the best way to approach this is to add a release note to heads up customers about the legacy tokens being stale in DB and providing them with the command to perform the cleanup if they wish:
bundle exec rails runner "AccessToken.where(name: AccessToken::OIDC_SYNC_TOKEN, expires_at: nil).delete_all"@akostadinov WDYT?
There was a problem hiding this comment.
Claude suggestion for release note:
OIDC sync tokens are now created with a 1-day expiration and rotated automatically. Legacy tokens (created before this change) remain in the database with no expiration date. They are no longer used but are not cleaned up automatically. To remove them, run:
bundle exec rails runner "AccessToken.where(name: AccessToken::OIDC_SYNC_TOKEN, expires_at: nil).delete_all"
There was a problem hiding this comment.
I actually second the null check in the stale scope. In fact it occurred to me in a random moment I was supposed not to think about work. And then I saw Madni suggest it. That's a good way to handle it without user interaction and we can leave a comment that this condition can be removed after a couple of releases.
Or we can add a migration to add expiration time. But letting user deal with it seems unnecessary.
There was a problem hiding this comment.
Can we done in another PR, even better.
There was a problem hiding this comment.
Because we will be able to deploy, see things are working fine and then update existing tokens. As you decide.
There was a problem hiding this comment.
OK, if there's two of you asking for this, here it is: b0b1b15
There was a problem hiding this comment.
Because we will be able to deploy, see things are working fine and then update existing tokens. As you decide.
I committed before reading this. But I think we are fine because the cron job will only launch the Janitor once per week, the night between Sunday and Monday, at 12 AM. So if we deploy this Monday or Tuesday, we still have 3 or 4 labor days to observe how everything works and rollback if something fails.
|
|
||
| scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) } | ||
| # nil covers legacy tokens created before expiration was added; can be removed after a release | ||
| scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: [...Time.now.utc, nil]) } |
…ctsWorker Adds AccessToken.name to the list of models cleaned by DeleteAllStaleObjectsWorker. This allows stale OIDC sync tokens to be automatically purged via the AccessToken.stale scope, keeping the cleanup logic centralized.
Refactors AccessToken.oidc_sync to implement automatic token rotation with a 1-hour cache window. Tokens now expire after 1 day, and the stale scope (including nil expires_at for legacy tokens) enables automatic cleanup via JanitorWorker. Key changes: - oidc_sync returns plaintext value directly (not the model) - Hot-path cache read skips DB entirely (zero queries on cache hit) - Lock serializes concurrent cache misses during full resyncs - Stale tokens remain valid until janitor cleanup runs - nil in stale scope covers legacy tokens created before expiration tracking was added (can be removed after a release) ZyncWorker updated to use the new plaintext return value.
b0b1b15 to
c9f9b8b
Compare
| OIDC_SYNC_TOKEN = 'OIDC Synchronization Token'.freeze | ||
|
|
||
| # nil covers legacy tokens created before expiration was added; can be removed after a release | ||
| scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: [...Time.now.utc, nil]) } |
There was a problem hiding this comment.
Can we find a better name for this scope please? 🙃
If I see AccessToken.stale, I would assume that it's any expired token 🙃
But it's in reality:
- only tokens related to OIDC sync
- including non-expirable tokens
Maybe at least smth like stale_oidc_sync ?
There was a problem hiding this comment.
UPDATE: ah, as I progressed with the review, I understood why you chose this name 😅
That's to make DeleteAllStaleObjectsWorker pick these objects automatically...
Mmm, OK, makes more sense now, but maybe we can at least leave a comment explaining the name choice?
There was a problem hiding this comment.
stale is a pretty meaningful name IMO :) Objects that will not be used and hold no value for anybody. We do not do this to user created tokens because users may wish to see what they previously had.
If you think a comment will help, please suggest one, rather than wait for the ai to add something meaningless 👼
| # Lock to serialize concurrent cache misses (e.g. full resync). | ||
| lock.where(name: OIDC_SYNC_TOKEN).load | ||
|
|
||
| Rails.cache.fetch(cache_key, expires_in: 1.hour) do |
There was a problem hiding this comment.
Also, I am wondering - this approach means that at least every hour (given that zync updates happen regularly) a new token will be generated. So, in a week (Janitor runs once a week, right?), there will be about 24*7 tokens generated. Just trying to think if this can be an issue 🙃 I guess not, it's not a huge amount.
There was a problem hiding this comment.
It's actually 24 x 7 per provider, so it can be a lot.
There was a problem hiding this comment.
Do zync updates happen regularly in a normal situation?
Btw the cache survival for 1 hour is not a hard guarantee.
|
I left some random questions... |
One can't be 100% sure of not missing something but I considered a lot of scenarios. Maybe you can discuss with the AI to try to figure out what I missed. |
What this PR does / why we need it:
The issue description explains the situation, but summarizing, #4236 broke the integration with zync. Porta calls
PUT /tenantendpoint on zync and pushes the access token to Zync, which will use that access token later to pull data from porta. After #4236, now the access token is sent hashed to zync, which is useless as authentication method, so zync can't retrieve anymore data from porta.To solve that, this PR implements token rotation every hour, that way we mitigate possible zync DB leaks containing plaintext tokens.
The rotation process implies a small race condition window between the moment the old token is expired and the new one is received by zync and available for next requests. For that reason, old tokens are not set to expire immediately or deleted, instead, they are set to expire 1 day after being discarded.
This way, even in the worse scenario when the zync queue gets stuck for some reason and doesn't process jobs, and also it holds a discarded token in the DB, it still has one complete day to recover.
In order to further mitigate the leaking problem, our plan is to make changes in Zync to implement client token encryption.
In order to further mitigate the race condition problems when rotating, we also include some caching that forces the rotation to happen once per hour, and includes protection against cache stampede when many processes get a cache miss concurrently at the same time. This is better explained in the in-code comments below.
Besides, we also plan to implement retry logic for auth errors in Zync.
Finally, the PR also includes some changes in the Janitor to purge all discarded tokens once per week.
Replaces #4304 and #4309
Which issue(s) this PR fixes
https://redhat.atlassian.net/browse/THREESCALE-14969
Verification steps
Test should pass. Also, just work normally with porta + zync and verify everuthing works as expected, creating domains, pushing apps to Keycloak, etc.