Skip to content

THREESCALE-14969: Rotate OIDC tokens for Zync#4310

Merged
jlledom merged 2 commits into
masterfrom
THREESCALE-14969-zync-rotate-tokens
Jun 9, 2026
Merged

THREESCALE-14969: Rotate OIDC tokens for Zync#4310
jlledom merged 2 commits into
masterfrom
THREESCALE-14969-zync-rotate-tokens

Conversation

@jlledom

@jlledom jlledom commented May 29, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

The issue description explains the situation, but summarizing, #4236 broke the integration with zync. Porta calls PUT /tenant endpoint on zync and pushes the access token to Zync, which will use that access token later to pull data from porta. After #4236, now the access token is sent hashed to zync, which is useless as authentication method, so zync can't retrieve anymore data from porta.

To solve that, this PR implements token rotation every hour, that way we mitigate possible zync DB leaks containing plaintext tokens.

The rotation process implies a small race condition window between the moment the old token is expired and the new one is received by zync and available for next requests. For that reason, old tokens are not set to expire immediately or deleted, instead, they are set to expire 1 day after being discarded.

This way, even in the worse scenario when the zync queue gets stuck for some reason and doesn't process jobs, and also it holds a discarded token in the DB, it still has one complete day to recover.

In order to further mitigate the leaking problem, our plan is to make changes in Zync to implement client token encryption.

In order to further mitigate the race condition problems when rotating, we also include some caching that forces the rotation to happen once per hour, and includes protection against cache stampede when many processes get a cache miss concurrently at the same time. This is better explained in the in-code comments below.

Besides, we also plan to implement retry logic for auth errors in Zync.

Finally, the PR also includes some changes in the Janitor to purge all discarded tokens once per week.

Replaces #4304 and #4309

Which issue(s) this PR fixes

https://redhat.atlassian.net/browse/THREESCALE-14969

Verification steps

Test should pass. Also, just work normally with porta + zync and verify everuthing works as expected, creating domains, pushing apps to Keycloak, etc.

@jlledom jlledom self-assigned this May 29, 2026
@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.87%. Comparing base (c708373) to head (1950368).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4310      +/-   ##
==========================================
- Coverage   88.92%   88.87%   -0.06%     
==========================================
  Files        1752     1753       +1     
  Lines       44131    44146      +15     
  Branches      689      689              
==========================================
- Hits        39245    39235      -10     
- Misses       4870     4895      +25     
  Partials       16       16              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jlledom jlledom marked this pull request as ready for review May 29, 2026 11:09
Comment thread app/models/access_token.rb Outdated
Comment on lines 150 to 173
def self.refresh_oidc_sync
user_id = scope_attributes["owner_id"]
cache_key = "access_tokens/user:#{user_id}/oidc"

# Hot path: skip the transaction entirely on cache hit (zero DB queries).
cached = Rails.cache.read(cache_key)
return cached if cached

transaction do
# Lock existing OIDC tokens to serialize concurrent cache misses (e.g. full resync).
# Workers that arrive simultaneously will queue here. On cold start (no tokens yet)
# there is nothing to lock and stampede churn is harmless — all created tokens are valid.
lock.where(name: OIDC_SYNC_TOKEN).load

# Double-check inside the transaction: a concurrent worker may have populated the cache
# while we were waiting on the row lock above.
Rails.cache.fetch(cache_key, expires_in: 1.hour) do
# Expire (not delete) the current active token so Zync can keep using it for up to
# 1 day while it picks up the new one. The janitor cleans up expired tokens weekly.
where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)
create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value
end
end
end

@jlledom jlledom May 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could be much more simple, like:

def self.refresh_oidc_sync
  user_id = scope_attributes["owner_id"]
  Rails.cache.fetch("access_tokens/user:#{user_id}/oidc", expires_in: 1.hour) do
    where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)
    create!(name: OIDC_SYNC_TOKEN, scopes: %w[account_management], permission: 'ro').plaintext_value
  end
end

But that would cause a non-deterministic result on concurrent cache misses. e.g. When we run the full resync task from #4307 and the cache key is invalid, it could happen that many concurrent processes enter the block at the same time and update-then-create tokens, which would en up on the last of them prevailing and expiring all previously created token. Claude explains:

Simple version (bare Rails.cache.fetch)
  
  1. A, B, C all call Rails.cache.read — all miss
  2. A enters Rails.cache.fetch — cache miss, enters block
  3. B enters Rails.cache.fetch — also cache miss (A hasn't finished yet), enters block
  4. C enters Rails.cache.fetch — also cache miss, enters block
  5. A: update_all(expires_at: 1.day.from_now) — expires the active token. Then create! — new token.
  6. B: update_all(expires_at: 1.day.from_now) — expires A's freshly created token (it has expires_at: nil). Then create! — another new token.
  7. C: same — expires B's token, creates yet another.
  8. Each process writes its own value to the cache — last writer wins.

  Result: 3 rotations, 2 wasted tokens (still valid for 1 day, so no broken auth, but unnecessary churn).

But we could also end up with multiple non-expired tokens, in this scenario:

1. A: update_all — expires old token, commits
2. B: update_all — no rows with expires_at: nil, no-op
3. A: create! — creates T1 (expires_at: nil)
4. C: update_all — expires T1
5. B: create! — creates T2 (expires_at: nil)
6. C: create! — creates T3 (expires_at: nil)

The committed code prevents this because it sets a DB lock at lock.where(name: OIDC_SYNC_TOKEN).load, and processes B and C will wait there, until A closes the transaction block. Since the cache.fetch call is inside the transaction, we ensure the cache is also written when A finished, so B and C will always get a hit when calling cache.fetch.

1. A, B, C all call Rails.cache.read — all miss
2. A enters the transaction first, runs SELECT ... FOR UPDATE — locks the OIDC token row
3. B and C enter transactions, hit SELECT ... FOR UPDATE — blocked at DB level, waiting for A's lock
4. A runs Rails.cache.fetch — cache miss, executes block: expires old token, creates new one, writes to cache. Transaction commits, lock released.
5. B gets the lock, runs SELECT ... FOR UPDATE. Then Rails.cache.fetch — cache hit (A populated it). Returns cached value. Transaction commits.
6. C same as B — cache hit, no DB writes.

Result: 1 rotation, 0 wasted tokens.

This is convenient but not really necessary since such race conditions will be rare and the result is just some wasted tokens that the Janitor will purge anyway. So if you want the simple version, I'm fine with it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also considered the :race_condition_ttl option for Rails.cache.fetch but discarded it because it only prevents the cache stampede in the short period of time of N seconds after the cache expires, in any other scenario all processes would enter the block anyway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock increases DB query count. If we really need locking, them maybe better use redlock like we do for billing.

Given the complexity I wonder if we should avoid caching, expire the tokens and clear them with janitor every night.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll make changes to use Redlock instead. I wouldn't avoid caching because that would create a new token for each zync event, that is, a few additional DB hits per zync event, if you worry about DB hits, I think caching is needed. Also, I have a strong opinion about caching because not doing so would turn our access tokens table into a dump. The Janitor runs once per week, not per night:

WEEK = Task.map([
[Pdf::Dispatch, :weekly],
[JanitorWorker, :perform_async],
[SuspendInactiveAccountsWorker, :perform_async],
[StaleAccountWorker, :perform_async]
]).freeze

What I would accept is caching but not locking at all, then the code would look like #4310 (comment)

@akostadinov akostadinov Jun 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, lets try to use locking sparsely then. Only if cache doesn't exist, then try to acquire a no-wait lock, if lock acquisition fails, then wait for lock and read the cache created by the other process. Something like that I imagine. WDYT?

@jlledom jlledom Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to implement locking with Redlock and I don't think it's worth it. These are the reasons:

  1. With DB locking, waiting processes will wait indefinitely until the lock is released, at that very moment another process will take the lock. But Redis locking is based in polling: we must tweak the :retry_count and :retry_delay options in order to configure how polling behaves, which is error prone and implies higher complexity to handle all possible scenarios (e.g. what if all retries complete before the lock is released?). Also, in the best scenario, the waiting process will not be awaken the moment the lock is released but when the current delay period finishes.
  2. Redis lock requires implementing the logic to release the lock which adds more complexity to the method. With DB lock, it's the DB who holds all locking logic. It's better for us to delegate such logic to DB engine.
  3. Cost of DB locking in DB queries is not so high, because the cache miss will only happen once per hour per provider.

For those reasons I don't think the Redis lock is the proper tool for our scenario.

Comment thread app/models/access_token.rb Outdated
Comment on lines +167 to +169
# Expire (not delete) the current active token so Zync can keep using it for up to
# 1 day while it picks up the new one. The janitor cleans up expired tokens weekly.
where(name: OIDC_SYNC_TOKEN, expires_at: nil).update_all(expires_at: 1.day.from_now)

@akostadinov akostadinov May 29, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't zync pick up then new one almost immediately? 1 day to pick just he key looks excessive.

Also didn't we think to provide expiring tokens always to begin with?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zync should in most of cases run the jobs fast, but there's no warranty, the only thing that happens immediately is job enqueuing, that's why the rotated token doesn't have an expiration time, we don't know when the job is going to run and use the token.

Rotation means removing the old token, the only reason we set an expiration time instead of just removing token immediately is to mitigate the race condition that happens when a token is already rotated in porta but didn't reach zync yet, and zync sends a request at that moment using the old token.

The expiration time we set here is debatable, I explained it here: #4309 (comment)

I think a period of about 5-10 hours would be better, but you mentioned 1 day in some comment so I used that value.

Comment thread app/workers/janitor_worker.rb Outdated
Comment on lines +11 to +12
DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name)
DeleteExpiredOIDCSyncTokensWorker.perform_async

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just reusing this one?

Suggested change
DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name)
DeleteExpiredOIDCSyncTokensWorker.perform_async
DeleteAllStaleObjectsWorker.perform_later(MessageRecipient.name, Message.name, AccessToken.name)

Just we should define the stale scope to equal expired_oidc_sync

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jlledom jlledom force-pushed the THREESCALE-14969-zync-rotate-tokens branch from 1950368 to 50cff64 Compare June 2, 2026 08:58
@qltysh

qltysh Bot commented Jun 2, 2026

Copy link
Copy Markdown

❌ 7 blocking issues (7 total)

Tool Category Rule Count
rubocop Lint Avoid using update\_columns because it skips validations. 5
reek Lint AccessToken#self.oidc_sync calls 'Rails.cache' 2 times 2

return cached if cached

transaction do
# Lock to serialize concurrent cache misses (e.g. full resync).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(e.g. full resync)

What should this mean 😬

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full resync (#4307) is a scenario when concurrent cache misses can happen

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the comment can be removed. It brought only confusion to me, also it is clear that the lock prevents concurrent refreshes. But leaving up to you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep the comment, I think it'll help me in the future if I come back to this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:unacceptable:

Comment thread app/models/access_token.rb Outdated
create_with(scopes: %w[account_management], permission: 'ro').find_or_create_by!(name: OIDC_SYNC_TOKEN)
scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) }

def self.refresh_oidc_sync

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be annoying if I request to rename back to oidc_sync?

Maybe I suggested the name refresh_oidc_sync but in my version there was no cache and all calls actually refreshed the token. Now it can just read it from cache. So the name is incorrect for this usage.

Or in an extreme case might be #cached_or_refresh_oidc_sync.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be annoying if I request to rename back to oidc_sync?

Yes, but 1115be6

akostadinov
akostadinov previously approved these changes Jun 2, 2026

@akostadinov akostadinov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

akostadinov
akostadinov previously approved these changes Jun 2, 2026
@jlledom jlledom force-pushed the THREESCALE-14969-zync-rotate-tokens branch from 1115be6 to 8b1ec05 Compare June 2, 2026 11:24
Comment thread app/models/access_token.rb Outdated

def self.oidc_sync
create_with(scopes: %w[account_management], permission: 'ro').find_or_create_by!(name: OIDC_SYNC_TOKEN)
scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old oidc_sync method created tokens without setting expires_at, so existing tokens in production have expires_at: NULL. The stale scope won't match them because NULL < NOW() is NULL in SQL, not TRUE.

These will stay in the access_tokens table indefinitely , should the scope also handle NULLs ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!

Yeah, what you say makes sense, and I think we could add null to the scope. My only objection is that NULLs won't accumulate on the table, so I don't think the Janitor is the place to remove them, since they would be removed on the first janitor run and then there would never be any non-expirable OIDC token ever again. It seems to me more appropriate for a one-off removal. IMO, the best way to approach this is to add a release note to heads up customers about the legacy tokens being stale in DB and providing them with the command to perform the cleanup if they wish:

bundle exec rails runner "AccessToken.where(name: AccessToken::OIDC_SYNC_TOKEN, expires_at: nil).delete_all"

@akostadinov WDYT?

@jlledom jlledom Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude suggestion for release note:

OIDC sync tokens are now created with a 1-day expiration and rotated automatically. Legacy tokens (created before this change) remain in the database with no expiration date. They are no longer used but are not cleaned up automatically. To remove them, run:

bundle exec rails runner "AccessToken.where(name: AccessToken::OIDC_SYNC_TOKEN, expires_at: nil).delete_all"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually second the null check in the stale scope. In fact it occurred to me in a random moment I was supposed not to think about work. And then I saw Madni suggest it. That's a good way to handle it without user interaction and we can leave a comment that this condition can be removed after a couple of releases.

Or we can add a migration to add expiration time. But letting user deal with it seems unnecessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we done in another PR, even better.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we will be able to deploy, see things are working fine and then update existing tokens. As you decide.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if there's two of you asking for this, here it is: b0b1b15

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we will be able to deploy, see things are working fine and then update existing tokens. As you decide.

I committed before reading this. But I think we are fine because the cron job will only launch the Janitor once per week, the night between Sunday and Monday, at 12 AM. So if we deploy this Monday or Tuesday, we still have 3 or 4 labor days to observe how everything works and rollback if something fails.

madnialihussain
madnialihussain previously approved these changes Jun 2, 2026
@jlledom jlledom dismissed stale reviews from madnialihussain and akostadinov via b0b1b15 June 3, 2026 08:24

scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: ...Time.now.utc) }
# nil covers legacy tokens created before expiration was added; can be removed after a release
scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: [...Time.now.utc, nil]) }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so cool 🫠

jlledom added 2 commits June 3, 2026 14:34
…ctsWorker

Adds AccessToken.name to the list of models cleaned by
DeleteAllStaleObjectsWorker. This allows stale OIDC sync tokens
to be automatically purged via the AccessToken.stale scope, keeping
the cleanup logic centralized.
Refactors AccessToken.oidc_sync to implement automatic token rotation
with a 1-hour cache window. Tokens now expire after 1 day, and the
stale scope (including nil expires_at for legacy tokens) enables
automatic cleanup via JanitorWorker.

Key changes:
- oidc_sync returns plaintext value directly (not the model)
- Hot-path cache read skips DB entirely (zero queries on cache hit)
- Lock serializes concurrent cache misses during full resyncs
- Stale tokens remain valid until janitor cleanup runs
- nil in stale scope covers legacy tokens created before expiration
  tracking was added (can be removed after a release)

ZyncWorker updated to use the new plaintext return value.
@jlledom jlledom force-pushed the THREESCALE-14969-zync-rotate-tokens branch from b0b1b15 to c9f9b8b Compare June 3, 2026 12:39
OIDC_SYNC_TOKEN = 'OIDC Synchronization Token'.freeze

# nil covers legacy tokens created before expiration was added; can be removed after a release
scope :stale, -> { where(name: OIDC_SYNC_TOKEN, expires_at: [...Time.now.utc, nil]) }

@mayorova mayorova Jun 3, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a better name for this scope please? 🙃

If I see AccessToken.stale, I would assume that it's any expired token 🙃

But it's in reality:

  • only tokens related to OIDC sync
  • including non-expirable tokens

Maybe at least smth like stale_oidc_sync ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: ah, as I progressed with the review, I understood why you chose this name 😅
That's to make DeleteAllStaleObjectsWorker pick these objects automatically...

Mmm, OK, makes more sense now, but maybe we can at least leave a comment explaining the name choice?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale is a pretty meaningful name IMO :) Objects that will not be used and hold no value for anybody. We do not do this to user created tokens because users may wish to see what they previously had.

If you think a comment will help, please suggest one, rather than wait for the ai to add something meaningless 👼

Comment thread app/models/access_token.rb
Comment thread app/models/access_token.rb
# Lock to serialize concurrent cache misses (e.g. full resync).
lock.where(name: OIDC_SYNC_TOKEN).load

Rails.cache.fetch(cache_key, expires_in: 1.hour) do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am wondering - this approach means that at least every hour (given that zync updates happen regularly) a new token will be generated. So, in a week (Janitor runs once a week, right?), there will be about 24*7 tokens generated. Just trying to think if this can be an issue 🙃 I guess not, it's not a huge amount.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually 24 x 7 per provider, so it can be a lot.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do zync updates happen regularly in a normal situation?

Btw the cache survival for 1 hour is not a hard guarantee.

@mayorova

mayorova commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

I left some random questions...
The solution seems reasonable, but I don't know... I have a feeling that I'm missing something, some corner case maybe. Maybe just overthinking.

@jlledom

jlledom commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

I left some random questions... The solution seems reasonable, but I don't know... I have a feeling that I'm missing something, some corner case maybe. Maybe just overthinking.

One can't be 100% sure of not missing something but I considered a lot of scenarios. Maybe you can discuss with the AI to try to figure out what I missed.

@jlledom jlledom merged commit c391790 into master Jun 9, 2026
21 of 25 checks passed
@jlledom jlledom deleted the THREESCALE-14969-zync-rotate-tokens branch June 9, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants