support background entry removal for events cache#10523
Open
gbates101 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed?
Using what was contributed in #7902, this adds a background eviction goroutine + config for the events cache.
Why?
I think the reason this PR (and the original) are useful is how cache size is computed. The amount of heap used by the MS/events cache reaches well beyond the bytes limit set in the config (more below), which I think is partly due to how cache size is computed against the entry's serialized size. For events its
historyEventCacheItemImpl.CacheSize()returningevent.Size(), which is the protobuf wire size rather than the size of the unmarshalled*HistoryEvent(with its nested duration/timestamp messages, maps, etc), which is presumably much larger than its serialized bytes, so the cache's upper bound is seemingly much higher than what is set in config. This lets cache that's past its TTL eat into the rest of the heap, potentially consuming most or all of it.In our case, we were running 2500 concurrent workflows as a benchmark (e.g. as one workflow completes another is scheduled to maintain this number) with 1GB events cache limit, and after ~6 hours of monotonic heap growth the GOMEMLIMIT of 10GB was reached for all three history service pods, with the inuse_space memory profile showing nearly all of the heap consumed by events-related call stacks.
cache_usage{cache_type="events"}counter showed only ~730MB used. Rerunning the benchmark with the change in this PR enabled avoided the monotonic heap growth once the 1h TTL was reached, since stale entries were evicted at the rate new ones were created, and our 2500 concurrent workflow benchmark has now been running indefinitely at 5GB heap consumed and ~130MB cache usage per pod.How did you test it?
Also benchmarked the change as mentioned above.
Potential risks
Off by default, so behavior is unchanged unless explicitly enabled, and it reuses the same background-eviction machinery already proven for the workflow cache in #7902. When enabled, the only side effect is that an evicted event is re-read from persistence if a workflow genuinely needs it again. The TTL provides the grace window for near-term reads (e.g. start/completion events read by transfer tasks shortly after close), so this should be rare. As with the workflow cache setting, it requires a service restart to take effect.