Add timeout and retries to git operations#2133
Conversation
A stalled subprocess (notably git clone of track or team repos) could wedge the calling actor indefinitely because subprocess.communicate() was invoked without a timeout. This adds an optional timeout kwarg to run_subprocess_with_logging that kills the child process on timeout expiry, drains output, and returns the (non-zero) child return code. All git operations in esrally/utils/git.py now pass a 600s timeout.
There was a problem hiding this comment.
Pull request overview
This PR adds timeout handling to subprocess execution and applies a 600s timeout to git operations to prevent Rally from hanging indefinitely on stalled git commands.
Changes:
- Adds an optional
timeoutparameter torun_subprocess_with_logging(). - Applies
GIT_TIMEOUTto several git commands. - Adds tests for timeout handling and git clone timeout behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
esrally/utils/process.py |
Adds timeout handling and logging for subprocess execution. |
esrally/utils/git.py |
Introduces GIT_TIMEOUT and passes it to git subprocess calls. |
tests/utils/process_test.py |
Adds coverage for timeout behavior in subprocess logging. |
tests/utils/git_test.py |
Updates assertions and adds clone timeout error coverage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
As well as the timeout, what about also adding a retry? I can see a few benchmarks the last 24-48hours or so that have hung due to git clone hanging - whilst the timeout will surface the issue quickly, the next step would then be rerun the benchmark. If we retry a number of times, that could increase our chances of success. |
|
I'm happy to treat retries as a follow-up if you prefer. However, the tests are flaky. Failed inhttps://github.com/elastic/rally/actions/runs/26627364075/job/78467278902?pr=2133: |
|
Removing es-perf review, sorry that my last comment was not a GitHub review. |
I think this makes sense to include now given GitHub's trending instability. I've wrapped |
A stalled subprocess (observed during a git clone of rally-tracks) can block indefinitely because
subprocess.communicate()is invoked without a timeout.This commit adds an optional timeout kwarg to
run_subprocess_with_loggingthat kills the child process on timeout expiry, drains output, and returns the (non-zero) child return code. Additionally, we ensure all networked git operations now pass a 600s timeout to avoid blocking indefinitely, and are retried in the event of failure.See example stack trace and ~2.5h git operation hang here