Skip to content

Add timeout and retries to git operations#2133

Open
b-deam wants to merge 9 commits into
elastic:masterfrom
b-deam:add-subprocess-timeout
Open

Add timeout and retries to git operations#2133
b-deam wants to merge 9 commits into
elastic:masterfrom
b-deam:add-subprocess-timeout

Conversation

@b-deam

@b-deam b-deam commented May 29, 2026

Copy link
Copy Markdown
Member

A stalled subprocess (observed during a git clone of rally-tracks) can block indefinitely because subprocess.communicate() is invoked without a timeout.

This commit adds an optional timeout kwarg to run_subprocess_with_logging that kills the child process on timeout expiry, drains output, and returns the (non-zero) child return code. Additionally, we ensure all networked git operations now pass a 600s timeout to avoid blocking indefinitely, and are retried in the event of failure.

See example stack trace and ~2.5h git operation hang here
$ sudo py-spy dump --pid 6272
[...]
Thread 6272 (idle): "MainThread"
    communicate (subprocess.py:1196)
    run_subprocess_with_logging (esrally/utils/process.py:112)
    clone (esrally/utils/git.py:72)
    __init__ (esrally/utils/repo.py:44)
    __init__ (esrally/track/loader.py:406)
    track_repo (esrally/track/loader.py:369)
    load_track (esrally/track/loader.py:252)
    setup (esrally/racecontrol.py:235)
    receiveMsg_Setup (esrally/racecontrol.py:114)
    guard (esrally/actor.py:96)
    receiveMessage (thespian/actors.py:840)
    _handleOneMessage (thespian/system/actorManager.py:164)
    handleMessages (thespian/system/actorManager.py:122)
    _runWithExpiry (thespian/system/transport/TCPTransport.py:1337)
    _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80)
    run (thespian/system/transport/wakeupTransportBase.py:71)
    run (thespian/system/actorManager.py:88)
    startChild (thespian/system/multiprocCommon.py:696)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _launch (multiprocessing/popen_fork.py:71)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:281)
    start (multiprocessing/process.py:121)
    _startChildActor (thespian/system/multiprocCommon.py:358)
    h_PendingActor (thespian/system/admin/adminCore.py:318)
    h_PendingActor (thespian/system/admin/globalNames.py:19)
    handleIncoming (thespian/system/admin/adminCore.py:114)
    _runWithExpiry (thespian/system/transport/TCPTransport.py:1337)
    _run_subtransport (thespian/system/transport/wakeupTransportBase.py:80)
    run (thespian/system/transport/wakeupTransportBase.py:71)
    run (thespian/system/admin/convention.py:757)
    startAdmin (thespian/system/multiprocCommon.py:208)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _launch (multiprocessing/popen_fork.py:71)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:281)
    start (multiprocessing/process.py:121)
    _startAdmin (thespian/system/multiprocCommon.py:105)
    __init__ (thespian/system/systemBase.py:326)
    __init__ (thespian/system/multiprocCommon.py:83)
    __init__ (thespian/system/multiprocTCPBase.py:28)
    _startupActorSys (thespian/actors.py:678)
    __init__ (thespian/actors.py:637)
    bootstrap_actor_system (esrally/actor.py:311)
    start (esrally/rallyd.py:43)
    main (esrally/rallyd.py:120)
    <module> (esrallyd:6)

$ sudo ps -fp 6273,6274,6276 -o pid,ppid,etime,stat,cmd
    PID    PPID     ELAPSED STAT CMD
   6273    6272    02:32:07 Sl   git clone https://github.com/elastic/rally-tracks /home/esbench/.rally/benchmarks/tracks/default
   6274    6273    02:32:07 S     \_ /usr/bin/ssh -o SendEnv=GIT_PROTOCOL git@github.com git-upload-pack '/elastic/rally-tracks'
   6276    6273    02:32:05 S     \_ /usr/lib/git-core/git index-pack --stdin --fix-thin --keep=fetch-pack 6273 on rally-0 --check-self-contained-and-connected

A stalled subprocess (notably git clone of track or team repos) could wedge the calling actor
indefinitely because subprocess.communicate() was invoked without a
timeout. This adds an optional timeout kwarg to
run_subprocess_with_logging that kills the child process on timeout expiry, drains
output, and returns the (non-zero) child return code. All git operations in
esrally/utils/git.py now pass a 600s timeout.
@b-deam b-deam self-assigned this May 29, 2026
Copilot AI review requested due to automatic review settings May 29, 2026 07:28
@b-deam b-deam requested a review from a team as a code owner May 29, 2026 07:28
@b-deam b-deam added the enhancement Improves the status quo label May 29, 2026
@b-deam b-deam added :internal Changes for internal, undocumented features: e.g. experimental, release scripts and removed :internal Changes for internal, undocumented features: e.g. experimental, release scripts labels May 29, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds timeout handling to subprocess execution and applies a 600s timeout to git operations to prevent Rally from hanging indefinitely on stalled git commands.

Changes:

  • Adds an optional timeout parameter to run_subprocess_with_logging().
  • Applies GIT_TIMEOUT to several git commands.
  • Adds tests for timeout handling and git clone timeout behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
esrally/utils/process.py Adds timeout handling and logging for subprocess execution.
esrally/utils/git.py Introduces GIT_TIMEOUT and passes it to git subprocess calls.
tests/utils/process_test.py Adds coverage for timeout behavior in subprocess logging.
tests/utils/git_test.py Updates assertions and adds clone timeout error coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread esrally/utils/process.py Outdated
Comment thread esrally/utils/git.py
@gareth-ellis

Copy link
Copy Markdown
Member

As well as the timeout, what about also adding a retry? I can see a few benchmarks the last 24-48hours or so that have hung due to git clone hanging - whilst the timeout will surface the issue quickly, the next step would then be rerun the benchmark. If we retry a number of times, that could increase our chances of success.

@pquentin

pquentin commented Jun 3, 2026

Copy link
Copy Markdown
Member

I'm happy to treat retries as a follow-up if you prefer. However, the tests are flaky. Failed inhttps://github.com/elastic/rally/actions/runs/26627364075/job/78467278902?pr=2133:

_________ test_run_subprocess_with_logging_timeout_kills_process_group _________

caplog = <_pytest.logging.LogCaptureFixture object at 0x12060c440>
tmp_path = PosixPath('/private/var/folders/p8/qyz0lmpd2mld64f_f4c66y4c0000gn/T/pytest-of-runner/pytest-0/test_run_subprocess_with_loggi0')

    def test_run_subprocess_with_logging_timeout_kills_process_group(caplog, tmp_path):
        pid_file = tmp_path / "grandchild.pid"
        # The shell backgrounds `sleep 600` in the same process group; both should die on timeout.
        cmd = f"sh -c 'sleep 600 & echo $! > {pid_file}; wait'"
        timeout = 1
    
        with caplog.at_level(logging.ERROR, logger="esrally.utils.process"):
            returncode = process.run_subprocess_with_logging(cmd, timeout=timeout)
        grandchild_pid = int(pid_file.read_text().strip())
    
        assert returncode == -signal.SIGKILL
>       assert not psutil.pid_exists(grandchild_pid), f"grandchild PID {grandchild_pid} survived process-group kill"
E       AssertionError: grandchild PID 3629 survived process-group kill
E       assert not True
E        +  where True = <function pid_exists at 0x104553e20>(3629)
E        +    where <function pid_exists at 0x104553e20> = psutil.pid_exists

tests/utils/process_test.py:225: AssertionError
------------------------------ Captured log call -------------------------------
2026-05-29 08:41:30 ERROR    esrally.utils.process:process.py:124 Subprocess [sh -c 'sleep 600 & echo $! > /private/var/folders/p8/qyz0lmpd2mld64f_f4c66y4c0000gn/T/pytest-of-runner/pytest-0/test_run_subprocess_with_loggi0/grandchild.pid; wait'] exceeded timeout of [1]s and was terminated with return code [-9].

@pquentin pquentin removed the request for review from a team June 19, 2026 12:05
@pquentin

Copy link
Copy Markdown
Member

Removing es-perf review, sorry that my last comment was not a GitHub review.

@b-deam

b-deam commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

As well as the timeout, what about also adding a retry? I can see a few benchmarks the last 24-48hours or so that have hung due to git clone hanging - whilst the timeout will surface the issue quickly, the next step would then be rerun the benchmark. If we retry a number of times, that could increase our chances of success.

I think this makes sense to include now given GitHub's trending instability. I've wrapped clone and fetch git operations (network based) with a retry helper in eb446ae

@b-deam b-deam changed the title Add timeout to all git operations Add timeout and retries to git operations Jun 22, 2026
@b-deam b-deam requested review from Copilot and pquentin June 22, 2026 04:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread esrally/utils/process.py
Comment thread esrally/utils/process.py
Comment thread esrally/utils/git.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread tests/utils/process_test.py Outdated
Comment thread esrally/utils/process.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improves the status quo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants