Skip to content

Use a stop flag instead of thread interrupt to shut down the logs checkpoint#7193

Open
pditommaso wants to merge 1 commit into
masterfrom
logs-checkpoint-graceful-shutdown
Open

Use a stop flag instead of thread interrupt to shut down the logs checkpoint#7193
pditommaso wants to merge 1 commit into
masterfrom
logs-checkpoint-graceful-shutdown

Conversation

@pditommaso
Copy link
Copy Markdown
Member

Rationale

LogsCheckpoint runs a background thread that periodically uploads the
.nextflow.log, timeline and report files to Seqera Platform. The thread was
stopped by calling thread.interrupt() (guarded by a lock) and using the
interrupt status as the loop-exit signal.

Using interrupt this way is the wrong tool for graceful shutdown:

  • Wrong semantics. Thread.interrupt() is the JVM's cancellation primitive
    for unblocking a thread stuck in a blocking call — not a "please wind down when
    convenient" signal. Here it was overloaded to also mean "exit the loop", mixing
    the wake-up mechanism with the control decision.
  • Fragile signal. Interrupt is a single shared status bit. saveFiles()
    ultimately calls into provider/cloud SDK code (FileHelper.copyPath); any such
    code that catches InterruptedException and clears the flag without re-asserting
    it could swallow the stop signal, leaving the thread running.
  • Implicit coupling. Correctness relied on a non-obvious invariant: interrupt
    was only ever called while holding lock, and saveFiles() also ran under
    lock, so the interrupt could never land mid-upload. Easy to break in a later edit.

What changed

Replace interrupt-as-control-signal with an explicit volatile boolean stopped
flag coordinated through the existing intrinsic monitor (no new concurrency
primitives):

  • stop() sets stopped = true, notifyAll()s to wake the parked thread,
    then thread.join()s. The join is deliberately outside the synchronized
    block — joining while holding the monitor would deadlock, because the woken
    thread must re-acquire the lock to return from wait() and exit.
  • run() parks in lock.wait(interval) and the loop condition is the flag
    (while(!stopped) + post-wait if(stopped) break). The stopped check is
    co-located with wait() under the same lock, so there is no lost-wakeup race.
  • Interrupt is now only handled defensively. It is no longer a control signal;
    a single catch(InterruptedException) around the loop ensures that if the JVM or
    external code interrupts the thread (e.g. on abrupt shutdown) it still terminates
    cleanly, logs at debug, and re-asserts the interrupt flag — instead of leaking an
    uncaught exception out of the thread.
  • stop() hardening: guarded against a null thread (if onFlowCreate failed
    before starting it) and against repeated invocation (both onFlowError and
    onFlowComplete can fire), making shutdown idempotent.
  • Removed the now-redundant await() helper.

Behaviour

Unchanged from the caller's perspective: shutdown still waits for any in-flight
saveFiles() cycle to finish (stop() blocks on the monitor / join()), and the
loop still breaks before starting a new save — no new final flush was introduced.

Tests

Added lifecycle coverage to LogsCheckpointTest:

  • start the thread then onFlowComplete() → thread terminates;
  • onFlowComplete() with no thread started → no exception.

🤖 Generated with Claude Code

…ckpoint

The LogsCheckpoint background thread was terminated by calling
`thread.interrupt()` and using the interrupt status as the loop-exit
signal. Thread interruption is meant for cancelling blocking operations,
not for graceful "please finish" signaling: it conflates the wake-up
mechanism with the control decision, and any library code that swallows
the interrupt flag (e.g. cloud SDK uploads) could lose the signal.

Replace it with an explicit `volatile boolean stopped` flag coordinated
through the existing intrinsic monitor:

- `stop()` sets the flag and `notifyAll()`s to wake the thread, then
  joins (outside the synchronized block to avoid deadlocking the woken
  thread that must re-acquire the lock).
- `run()` parks in `lock.wait(interval)` and exits the loop on the flag.
- Interrupt is no longer a control signal; it is only handled
  defensively so external/JVM interruption still terminates the thread
  cleanly and re-asserts the flag.
- `stop()` is guarded against a null thread (failed onFlowCreate) and
  repeated invocation (onFlowError + onFlowComplete).

Add lifecycle tests covering start/stop and stop-before-start.

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 1, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 73982da
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a1d8fdb303a640008a30658

@pditommaso pditommaso requested review from bentsherman and jorgee and removed request for bentsherman June 1, 2026 13:59
@jorgee
Copy link
Copy Markdown
Contributor

jorgee commented Jun 1, 2026

Flagging an overlap with #7188, which touches the same class for what turns out to be a related but distinct goal.

This PR cleanly fixes the signaling mechanism — dropping interrupt-as-control-signal in favor of a stopped flag is the right call, and the lost-wakeup reasoning is sound. But it preserves the two properties that cause the zombie-head-job hang reported in #7188:

  1. saveFiles() still runs while holding lock:

    synchronized(lock) {
        while( !stopped ) {
            lock.wait(interval.toMillis())
            if( stopped ) break
            handler.saveFiles()   // holds lock for the whole upload
        }
    }
  2. thread.join() is still unbounded, and stop() must acquire lock to signal:

    synchronized(lock) {          // blocks here if the worker is hung in saveFiles()
        if( stopped ) return
        stopped = true
        lock.notifyAll()
    }
    thread.join()                 // no timeout

So if a cloud upload stalls on a half-open socket inside saveFiles(), the worker holds lock, stop() blocks trying to acquire it, and even past that the unbounded join() blocks forever. The worker being a daemon doesn't help, because it's the non-daemon shutdown thread parked in onFlowComplete()join() that keeps the JVM alive. The description's note that "shutdown still waits for any in-flight saveFiles() cycle to finish" is exactly this failure mode.

#7188 addresses the liveness side: it moves saveFiles() out of any lock and bounds the join with a configurable terminateTimeout (default 120s), abandoning a genuinely-hung daemon worker so the head job exits. It also includes a regression test where saveFiles() blocks forever and shutdown still returns within the timeout — the scenario that triggered the incident, which isn't covered here.

Suggest reconciling the two: either land #7188 for the hang fix, or, if the monitor-based design is preferred here, fold in moving saveFiles() outside the lock and adding a bounded join — otherwise the shutdown deadlock remains.

Minor: stopped is read/written only under synchronized(lock), so the volatile modifier is redundant.

@pditommaso
Copy link
Copy Markdown
Member Author

I'd argue that very hard to have a sensible timeout for thead.join since it takes the overall run executions that could span from mins to days

@jorgee
Copy link
Copy Markdown
Contributor

jorgee commented Jun 2, 2026

But, in this case the join is just waiting for the copy of the logs, not the whole execution.

@pditommaso
Copy link
Copy Markdown
Member Author

Got your point, think timeout makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants