fix: stop unbounded event-bus RuntimeState recorder leak on long-lived processes#6056
fix: stop unbounded event-bus RuntimeState recorder leak on long-lived processes#6056mattatcha wants to merge 1 commit into
Conversation
…leak The process-global event bus recorded every emitted event into a RuntimeState (entity `root` list + `event_record`) on every kickoff, unconditionally and with no eviction. Only the checkpoint/replay machinery ever reads that recorder, so for the common "construct a Flow/Crew, kickoff, discard" pattern it grew ~linearly with kickoff count until a long-lived process (worker, request handler, scheduler) was OOM-killed. Gate recording behind an armed flag: the bus only registers entities and records events once recording is enabled, which happens when a CheckpointConfig is resolved on a Crew/Flow/Agent or when a state is restored via set_runtime_state(). Plain and @persist kickoff loops now record nothing; checkpoint/replay behavior is unchanged. Also expose a public reset_runtime_state() so embedders that checkpoint but never replay in-process can bound memory between runs.
|
Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR adds an explicit recording gate to the event bus singleton that prevents unbounded RuntimeState growth in long-lived processes. Recording defaults to disabled and arms only when checkpointing is configured or explicitly enabled; it can be reset while staying armed for subsequent runs. ChangesEvent Bus Recording Gate and Checkpoint Integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 654e738. Configure here.
| # between the check and the deref. | ||
| state = self._runtime_state | ||
| if state is not None: | ||
| state.event_record.add(event) |
There was a problem hiding this comment.
Persistence resume skips event replay
High Severity
Gating _register_source and _record_event on _recording_enabled stops the process-global bus from building RuntimeState for flows that use persistence but not checkpointing. _replay_recorded_events on resume still reads that record; when it is missing, replay returns immediately and completed-step MethodExecution* events are not dispatched.
Reviewed by Cursor Bugbot for commit 654e738. Configure here.


Summary
The process-global event bus (
crewai_event_bus) recorded every emitted event into aRuntimeState— the entityrootlist plus anevent_record— on every kickoff, unconditionally and with no eviction. Only the checkpoint/replay machinery ever reads that recorder, so for the common "construct a Flow/Crew,kickoff(), discard" pattern it grew ~linearly with kickoff count until a long-lived process (worker, request handler, scheduler) was OOM-killed.Key Changes
CheckpointConfigis resolved on a Crew/Flow/Agent, or when a state is restored viaset_runtime_state()(checkpoint restore / fork). Plain and@persistkickoff loops now record nothing.crewai_event_bus.reset_runtime_state()so embedders that checkpoint but never replay in-process can bound memory between runs.Checkpoint/replay behavior is unchanged — the recorder is still populated whenever checkpointing is configured.
Repro
Tests
RuntimeState.Note
Medium Risk
Touches global singleton event-bus behavior for checkpoint/replay recording; incorrect gating could break checkpoints or resume, but scoped to recording paths with new regression tests.
Overview
Fixes an unbounded memory leak on the process-global
crewai_event_bus: it no longer records every kickoff’s entities and events intoRuntimeStateunless recording is explicitly armed.Recording gate: A new
_recording_enabledflag defaults to off._register_sourceand_record_eventbecome no-ops until armed viaenable_recording()orset_runtime_state(). Checkpoint handler registration now callsenable_recording()when aCheckpointConfigis first resolved, so checkpoint/replay behavior stays the same for configured runs.Memory control: Adds public
reset_runtime_state()to clear the attachedRuntimeStateand entity id set between runs while leaving recording armed—intended for long-lived embedders that checkpoint but don’t replay in-process._record_eventreads_runtime_stateonce to tolerate concurrent resets.Tests: Replay test arms recording explicitly; new
TestRecordingGateand_isolated_recording_state()assert plain flows leaveruntime_statenil and armed flows still populate the recorder.Reviewed by Cursor Bugbot for commit 654e738. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
New Features
Bug Fixes
reset_runtime_state()improved to safely clear recorded data while maintaining recording state.Tests