THREESCALE-14969: Rotate OIDC tokens for Zync by jlledom · Pull Request #4310 · 3scale/porta

jlledom · 2026-05-29T08:37:44Z

What this PR does / why we need it:

The issue description explains the situation, but summarizing, #4236 broke the integration with zync. Porta calls PUT /tenant endpoint on zync and pushes the access token to Zync, which will use that access token later to pull data from porta. After #4236, now the access token is sent hashed to zync, which is useless as authentication method, so zync can't retrieve anymore data from porta.

To solve that, this PR implements token rotation every hour, that way we mitigate possible zync DB leaks containing plaintext tokens.

The rotation process implies a small race condition window between the moment the old token is expired and the new one is received by zync and available for next requests. For that reason, old tokens are not set to expire immediately or deleted, instead, they are set to expire 1 day after being discarded.

This way, even in the worse scenario when the zync queue gets stuck for some reason and doesn't process jobs, and also it holds a discarded token in the DB, it still has one complete day to recover.

In order to further mitigate the leaking problem, our plan is to make changes in Zync to implement client token encryption.

In order to further mitigate the race condition problems when rotating, we also include some caching that forces the rotation to happen once per hour, and includes protection against cache stampede when many processes get a cache miss concurrently at the same time. This is better explained in the in-code comments below.

Besides, we also plan to implement retry logic for auth errors in Zync.

Finally, the PR also includes some changes in the Janitor to purge all discarded tokens once per week.

Replaces #4304 and #4309

Which issue(s) this PR fixes

https://redhat.atlassian.net/browse/THREESCALE-14969

Verification steps

Test should pass. Also, just work normally with porta + zync and verify everuthing works as expected, creating domains, pushing apps to Keycloak, etc.

codecov · 2026-05-29T09:07:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.87%. Comparing base (c708373) to head (1950368).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4310      +/-   ##
==========================================
- Coverage   88.92%   88.87%   -0.06%     
==========================================
  Files        1752     1753       +1     
  Lines       44131    44146      +15     
  Branches      689      689              
==========================================
- Hits        39245    39235      -10     
- Misses       4870     4895      +25     
  Partials       16       16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jlledom · 2026-05-29T11:34:18Z

+  def self.refresh_oidc_sync
+    user_id = scope_attributes["owner_id"]
+    cache_key = "access_tokens/user:#{user_id}/oidc"
+
+    # Hot path: skip the transaction entirely on cache hit (zero DB queries).
+    cached = Rails.cache.read(cache_key)
+    return cached if cached
+
+    transaction do
+      # Lock existing OIDC tokens to serialize concurrent cache misses (e.g. full resync).
+      # Workers that arrive simultaneously will queue here. On cold start (no tokens yet)
+      # there is nothing to lock and stampede churn is harmless — all created tokens are valid.
+      lock.where(name: OIDC_SYNC_TOKEN).load
+
+      # Double-check inside the transaction: a concurrent worker may have populated the cache
+      # while we were waiting on the row lock above.
+      Rails.cache.fetch(cache_key, expires_in: 1.hour) do
+        # Expire (not delete) the current active token so Zync can keep using it for up to
+        # 1 day while it picks up the new one. The janitor cleans up expired tokens weekly.
+        where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)
+        create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value
+      end
+    end
  end


This method could be much more simple, like:

def self.refresh_oidc_sync user_id = scope_attributes["owner_id"] Rails.cache.fetch("access_tokens/user:#{user_id}/oidc", expires_in: 1.hour) do where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now) create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value end end

But that would cause a non-deterministic result on concurrent cache misses. e.g. When we run the full resync task from #4307 and the cache key is invalid, it could happen that many concurrent processes enter the block at the same time and update-then-create tokens, which would en up on the last of them prevailing and expiring all previously created token. Claude explains:

Simple version (bare Rails.cache.fetch) 1. A, B, C all call Rails.cache.read — all miss 2. A enters Rails.cache.fetch — cache miss, enters block 3. B enters Rails.cache.fetch — also cache miss (A hasn't finished yet), enters block 4. C enters Rails.cache.fetch — also cache miss, enters block 5. A: update_all(expires_at: 1.day.from_now) — expires the active token. Then create! — new token. 6. B: update_all(expires_at: 1.day.from_now) — expires A's freshly created token (it has expires_at: nil). Then create! — another new token. 7. C: same — expires B's token, creates yet another. 8. Each process writes its own value to the cache — last writer wins. Result: 3 rotations, 2 wasted tokens (still valid for 1 day, so no broken auth, but unnecessary churn).

But we could also end up with multiple non-expired tokens, in this scenario:

1. A: update_all — expires old token, commits 2. B: update_all — no rows with expires_at: nil, no-op 3. A: create! — creates T1 (expires_at: nil) 4. C: update_all — expires T1 5. B: create! — creates T2 (expires_at: nil) 6. C: create! — creates T3 (expires_at: nil)

The committed code prevents this because it sets a DB lock at lock.where(name: OIDC_SYNC_TOKEN).load, and processes B and C will wait there, until A closes the transaction block. Since the cache.fetch call is inside the transaction, we ensure the cache is also written when A finished, so B and C will always get a hit when calling cache.fetch.

1. A, B, C all call Rails.cache.read — all miss 2. A enters the transaction first, runs SELECT ... FOR UPDATE — locks the OIDC token row 3. B and C enter transactions, hit SELECT ... FOR UPDATE — blocked at DB level, waiting for A's lock 4. A runs Rails.cache.fetch — cache miss, executes block: expires old token, creates new one, writes to cache. Transaction commits, lock released. 5. B gets the lock, runs SELECT ... FOR UPDATE. Then Rails.cache.fetch — cache hit (A populated it). Returns cached value. Transaction commits. 6. C same as B — cache hit, no DB writes. Result: 1 rotation, 0 wasted tokens.

This is convenient but not really necessary since such race conditions will be rare and the result is just some wasted tokens that the Janitor will purge anyway. So if you want the simple version, I'm fine with it.

I also considered the :race_condition_ttl option for Rails.cache.fetch but discarded it because it only prevents the cache stampede in the short period of time of N seconds after the cache expires, in any other scenario all processes would enter the block anyway.

The lock increases DB query count. If we really need locking, them maybe better use redlock like we do for billing.

Given the complexity I wonder if we should avoid caching, expire the tokens and clear them with janitor every night.

Yes, I'll make changes to use Redlock instead. I wouldn't avoid caching because that would create a new token for each zync event, that is, a few additional DB hits per zync event, if you worry about DB hits, I think caching is needed. Also, I have a strong opinion about caching because not doing so would turn our access tokens table into a dump. The Janitor runs once per week, not per night:

porta/app/lib/three_scale/jobs.rb

Lines 106 to 111 in 066c2d1

WEEK = Task.map([

[Pdf::Dispatch, :weekly],

[JanitorWorker, :perform_async],

[SuspendInactiveAccountsWorker, :perform_async],

[StaleAccountWorker, :perform_async]

]).freeze

What I would accept is caching but not locking at all, then the code would look like #4310 (comment)

ok, lets try to use locking sparsely then. Only if cache doesn't exist, then try to acquire a no-wait lock, if lock acquisition fails, then wait for lock and read the cache created by the other process. Something like that I imagine. WDYT?

I've been trying to implement locking with Redlock and I don't think it's worth it. These are the reasons:

With DB locking, waiting processes will wait indefinitely until the lock is released, at that very moment another process will take the lock. But Redis locking is based in polling: we must tweak the :retry_count and :retry_delay options in order to configure how polling behaves, which is error prone and implies higher complexity to handle all possible scenarios (e.g. what if all retries complete before the lock is released?). Also, in the best scenario, the waiting process will not be awaken the moment the lock is released but when the current delay period finishes.

Redis lock requires implementing the logic to release the lock which adds more complexity to the method. With DB lock, it's the DB who holds all locking logic. It's better for us to delegate such logic to DB engine.

Cost of DB locking in DB queries is not so high, because the cache miss will only happen once per hour per provider.

For those reasons I don't think the Redis lock is the proper tool for our scenario.

akostadinov · 2026-05-29T13:47:16Z

+        # Expire (not delete) the current active token so Zync can keep using it for up to
+        # 1 day while it picks up the new one. The janitor cleans up expired tokens weekly.
+        where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)


Doesn't zync pick up then new one almost immediately? 1 day to pick just he key looks excessive.

Also didn't we think to provide expiring tokens always to begin with?

Zync should in most of cases run the jobs fast, but there's no warranty, the only thing that happens immediately is job enqueuing, that's why the rotated token doesn't have an expiration time, we don't know when the job is going to run and use the token.

Rotation means removing the old token, the only reason we set an expiration time instead of just removing token immediately is to mitigate the race condition that happens when a token is already rotated in porta but didn't reach zync yet, and zync sends a request at that moment using the old token.

The expiration time we set here is debatable, I explained it here: #4309 (comment)

I think a period of about 5-10 hours would be better, but you mentioned 1 day in some comment so I used that value.

akostadinov · 2026-06-01T16:00:08Z

    DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name)
+    DeleteExpiredOIDCSyncTokensWorker.perform_async


How about just reusing this one?

Suggested change

DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name)

DeleteExpiredOIDCSyncTokensWorker.perform_async

DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name, AccessToken.name)

Just we should define the stale scope to equal expired_oidc_sync

qltysh · 2026-06-02T09:00:13Z

❌ 7 blocking issues (7 total)

Tool	Category	Rule	Count
rubocop	Lint	Avoid using `update\_columns` because it skips validations.	5	❌
reek	Lint	AccessToken#self.oidc_sync calls 'Rails.cache' 2 times	2	❌

akostadinov · 2026-06-02T09:14:31Z

+    return cached if cached
+
+    transaction do
+      # Lock to serialize concurrent cache misses (e.g. full resync).


(e.g. full resync)

What should this mean 😬

full resync (#4307) is a scenario when concurrent cache misses can happen

IMO the comment can be removed. It brought only confusion to me, also it is clear that the lock prevents concurrent refreshes. But leaving up to you.

I prefer to keep the comment, I think it'll help me in the future if I come back to this.

:unacceptable:

akostadinov · 2026-06-02T09:42:01Z

-    create_with(scopes: %w[account_management], permission: 'ro').find_or_create_by!(name: OIDC_SYNC_TOKEN)
+  scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) }
+
+  def self.refresh_oidc_sync


Will it be annoying if I request to rename back to oidc_sync?

Maybe I suggested the name refresh_oidc_sync but in my version there was no cache and all calls actually refreshed the token. Now it can just read it from cache. So the name is incorrect for this usage.

Or in an extreme case might be #cached_or_refresh_oidc_sync.

Will it be annoying if I request to rename back to oidc_sync?

Yes, but 1115be6

akostadinov

Looks great, thank you!

madnialihussain · 2026-06-02T10:48:36Z


-  def self.oidc_sync
-    create_with(scopes: %w[account_management], permission: 'ro').find_or_create_by!(name: OIDC_SYNC_TOKEN)
+  scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) }


The old oidc_sync method created tokens without setting expires_at, so existing tokens in production have expires_at: NULL. The stale scope won't match them because NULL < NOW() is NULL in SQL, not TRUE.

These will stay in the access_tokens table indefinitely , should the scope also handle NULLs ?

Thanks for your review!

Yeah, what you say makes sense, and I think we could add null to the scope. My only objection is that NULLs won't accumulate on the table, so I don't think the Janitor is the place to remove them, since they would be removed on the first janitor run and then there would never be any non-expirable OIDC token ever again. It seems to me more appropriate for a one-off removal. IMO, the best way to approach this is to add a release note to heads up customers about the legacy tokens being stale in DB and providing them with the command to perform the cleanup if they wish:

bundle exec rails runner "AccessToken.where(name: AccessToken::OIDC_SYNC_TOKEN, expires_at: nil).delete_all"

@akostadinov WDYT?

Claude suggestion for release note:

OIDC sync tokens are now created with a 1-day expiration and rotated automatically. Legacy tokens (created before this change) remain in the database with no expiration date. They are no longer used but are not cleaned up automatically. To remove them, run:

bundle exec rails runner "AccessToken.where(name: AccessToken::OIDC_SYNC_TOKEN, expires_at: nil).delete_all"

I actually second the null check in the stale scope. In fact it occurred to me in a random moment I was supposed not to think about work. And then I saw Madni suggest it. That's a good way to handle it without user interaction and we can leave a comment that this condition can be removed after a couple of releases.

Or we can add a migration to add expiration time. But letting user deal with it seems unnecessary.

Can we done in another PR, even better.

Because we will be able to deploy, see things are working fine and then update existing tokens. As you decide.

OK, if there's two of you asking for this, here it is: b0b1b15

Because we will be able to deploy, see things are working fine and then update existing tokens. As you decide.

I committed before reading this. But I think we are fine because the cron job will only launch the Janitor once per week, the night between Sunday and Monday, at 12 AM. So if we deploy this Monday or Tuesday, we still have 3 or 4 labor days to observe how everything works and rollback if something fails.

akostadinov · 2026-06-03T10:03:14Z


-  scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) }
+  # nil covers legacy tokens created before expiration was added; can be removed after a release
+  scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: [...Time.now.utc, nil]) }


so cool 🫠

…ctsWorker Adds AccessToken.name to the list of models cleaned by DeleteAllStaleObjectsWorker. This allows stale OIDC sync tokens to be automatically purged via the AccessToken.stale scope, keeping the cleanup logic centralized.

Refactors AccessToken.oidc_sync to implement automatic token rotation with a 1-hour cache window. Tokens now expire after 1 day, and the stale scope (including nil expires_at for legacy tokens) enables automatic cleanup via JanitorWorker. Key changes: - oidc_sync returns plaintext value directly (not the model) - Hot-path cache read skips DB entirely (zero queries on cache hit) - Lock serializes concurrent cache misses during full resyncs - Stale tokens remain valid until janitor cleanup runs - nil in stale scope covers legacy tokens created before expiration tracking was added (can be removed after a release) ZyncWorker updated to use the new plaintext return value.

mayorova · 2026-06-03T14:43:40Z

  OIDC_SYNC_TOKEN = 'OIDC Synchronization Token'.freeze

+  # nil covers legacy tokens created before expiration was added; can be removed after a release
+  scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: [...Time.now.utc, nil]) }


Can we find a better name for this scope please? 🙃

If I see AccessToken.stale, I would assume that it's any expired token 🙃

But it's in reality:

only tokens related to OIDC sync

including non-expirable tokens

Maybe at least smth like stale_oidc_sync ?

UPDATE: ah, as I progressed with the review, I understood why you chose this name 😅
That's to make DeleteAllStaleObjectsWorker pick these objects automatically...

Mmm, OK, makes more sense now, but maybe we can at least leave a comment explaining the name choice?

stale is a pretty meaningful name IMO :) Objects that will not be used and hold no value for anybody. We do not do this to user created tokens because users may wish to see what they previously had.

If you think a comment will help, please suggest one, rather than wait for the ai to add something meaningless 👼

mayorova · 2026-06-03T15:54:12Z

+      # Lock to serialize concurrent cache misses (e.g. full resync).
+      lock.where(name: OIDC_SYNC_TOKEN).load
+
+      Rails.cache.fetch(cache_key, expires_in: 1.hour) do


Also, I am wondering - this approach means that at least every hour (given that zync updates happen regularly) a new token will be generated. So, in a week (Janitor runs once a week, right?), there will be about 24*7 tokens generated. Just trying to think if this can be an issue 🙃 I guess not, it's not a huge amount.

It's actually 24 x 7 per provider, so it can be a lot.

Do zync updates happen regularly in a normal situation?

Btw the cache survival for 1 hour is not a hard guarantee.

mayorova · 2026-06-03T15:54:54Z

I left some random questions...
The solution seems reasonable, but I don't know... I have a feeling that I'm missing something, some corner case maybe. Maybe just overthinking.

jlledom · 2026-06-04T11:39:10Z

I left some random questions... The solution seems reasonable, but I don't know... I have a feeling that I'm missing something, some corner case maybe. Maybe just overthinking.

One can't be 100% sure of not missing something but I considered a lot of scenarios. Maybe you can discuss with the AI to try to figure out what I missed.

jlledom self-assigned this May 29, 2026

jlledom requested review from akostadinov, madnialihussain and mayorova May 29, 2026 11:09

jlledom marked this pull request as ready for review May 29, 2026 11:09

This was referenced May 29, 2026

THREESCALE-14969: Allow authentication by zync token #4304

Closed

zync needs plaintext tokens #4309

Closed

jlledom commented May 29, 2026

View reviewed changes

akostadinov reviewed May 29, 2026

View reviewed changes

akostadinov reviewed Jun 1, 2026

View reviewed changes

jlledom force-pushed the THREESCALE-14969-zync-rotate-tokens branch from 1950368 to 50cff64 Compare June 2, 2026 08:58

akostadinov reviewed Jun 2, 2026

View reviewed changes

akostadinov previously approved these changes Jun 2, 2026

View reviewed changes

jlledom dismissed akostadinov’s stale review via 1115be6 June 2, 2026 10:31

akostadinov previously approved these changes Jun 2, 2026

View reviewed changes

jlledom force-pushed the THREESCALE-14969-zync-rotate-tokens branch from 1115be6 to 8b1ec05 Compare June 2, 2026 11:24

madnialihussain reviewed Jun 2, 2026

View reviewed changes

madnialihussain previously approved these changes Jun 2, 2026

View reviewed changes

jlledom dismissed stale reviews from madnialihussain and akostadinov via b0b1b15 June 3, 2026 08:24

akostadinov reviewed Jun 3, 2026

View reviewed changes

akostadinov approved these changes Jun 3, 2026

View reviewed changes

madnialihussain approved these changes Jun 3, 2026

View reviewed changes

jlledom added 2 commits June 3, 2026 14:34

jlledom force-pushed the THREESCALE-14969-zync-rotate-tokens branch from b0b1b15 to c9f9b8b Compare June 3, 2026 12:39

mayorova reviewed Jun 3, 2026

View reviewed changes

madnialihussain mentioned this pull request Jun 3, 2026

THREESCALE-12102: Add token_exchange_enabled OIDC flow toggle #4313

Open

mayorova reviewed Jun 3, 2026

View reviewed changes

Comment thread app/models/access_token.rb

mayorova reviewed Jun 3, 2026

View reviewed changes

Comment thread app/models/access_token.rb

mayorova reviewed Jun 3, 2026

View reviewed changes

mayorova approved these changes Jun 8, 2026

View reviewed changes

jlledom merged commit c391790 into master Jun 9, 2026
21 of 25 checks passed

jlledom deleted the THREESCALE-14969-zync-rotate-tokens branch June 9, 2026 08:30

	WEEK = Task.map([
	[Pdf::Dispatch, :weekly],
	[JanitorWorker, :perform_async],
	[SuspendInactiveAccountsWorker, :perform_async],
	[StaleAccountWorker, :perform_async]
	]).freeze

		DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name)
		DeleteExpiredOIDCSyncTokensWorker.perform_async

Conversation

jlledom commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jlledom May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akostadinov Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlledom Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akostadinov May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qltysh Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akostadinov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlledom Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayorova Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlledom commented May 29, 2026 •

edited

Loading

codecov Bot commented May 29, 2026 •

edited

Loading

jlledom May 29, 2026 •

edited

Loading

akostadinov Jun 1, 2026 •

edited

Loading

jlledom Jun 1, 2026 •

edited

Loading

akostadinov May 29, 2026 •

edited

Loading

qltysh Bot commented Jun 2, 2026 •

edited

Loading

jlledom Jun 2, 2026 •

edited

Loading

mayorova Jun 3, 2026 •

edited

Loading