Skip to content

support background entry removal for events cache#10523

Open
gbates101 wants to merge 1 commit into
temporalio:mainfrom
gbates101:events-cache-background-eviction
Open

support background entry removal for events cache#10523
gbates101 wants to merge 1 commit into
temporalio:mainfrom
gbates101:events-cache-background-eviction

Conversation

@gbates101
Copy link
Copy Markdown

What changed?

Using what was contributed in #7902, this adds a background eviction goroutine + config for the events cache.

Why?

I think the reason this PR (and the original) are useful is how cache size is computed. The amount of heap used by the MS/events cache reaches well beyond the bytes limit set in the config (more below), which I think is partly due to how cache size is computed against the entry's serialized size. For events its historyEventCacheItemImpl.CacheSize() returning event.Size(), which is the protobuf wire size rather than the size of the unmarshalled *HistoryEvent (with its nested duration/timestamp messages, maps, etc), which is presumably much larger than its serialized bytes, so the cache's upper bound is seemingly much higher than what is set in config. This lets cache that's past its TTL eat into the rest of the heap, potentially consuming most or all of it.

In our case, we were running 2500 concurrent workflows as a benchmark (e.g. as one workflow completes another is scheduled to maintain this number) with 1GB events cache limit, and after ~6 hours of monotonic heap growth the GOMEMLIMIT of 10GB was reached for all three history service pods, with the inuse_space memory profile showing nearly all of the heap consumed by events-related call stacks. cache_usage{cache_type="events"} counter showed only ~730MB used. Rerunning the benchmark with the change in this PR enabled avoided the monotonic heap growth once the 1h TTL was reached, since stale entries were evicted at the rate new ones were created, and our 2500 concurrent workflow benchmark has now been running indefinitely at 5GB heap consumed and ~130MB cache usage per pod.

How did you test it?

  • built
  • run locally and tested manually
  • updated existing tests
  • added new unit test(s)
  • added new functional test(s)

Also benchmarked the change as mentioned above.

Potential risks

Off by default, so behavior is unchanged unless explicitly enabled, and it reuses the same background-eviction machinery already proven for the workflow cache in #7902. When enabled, the only side effect is that an evicted event is re-read from persistence if a workflow genuinely needs it again. The TTL provides the grace window for near-term reads (e.g. start/completion events read by transfer tasks shortly after close), so this should be rare. As with the workflow cache setting, it requires a service restart to take effect.

@gbates101 gbates101 requested review from a team as code owners June 4, 2026 02:55
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jun 4, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants