Use a stop flag instead of thread interrupt to shut down the logs checkpoint by pditommaso · Pull Request #7193 · nextflow-io/nextflow

pditommaso · 2026-06-01T13:57:43Z

Rationale

LogsCheckpoint runs a background thread that periodically uploads the
.nextflow.log, timeline and report files to Seqera Platform. The thread was
stopped by calling thread.interrupt() (guarded by a lock) and using the
interrupt status as the loop-exit signal.

Using interrupt this way is the wrong tool for graceful shutdown:

Wrong semantics. Thread.interrupt() is the JVM's cancellation primitive
for unblocking a thread stuck in a blocking call — not a "please wind down when
convenient" signal. Here it was overloaded to also mean "exit the loop", mixing
the wake-up mechanism with the control decision.
Fragile signal. Interrupt is a single shared status bit. saveFiles()
ultimately calls into provider/cloud SDK code (FileHelper.copyPath); any such
code that catches InterruptedException and clears the flag without re-asserting
it could swallow the stop signal, leaving the thread running.
Implicit coupling. Correctness relied on a non-obvious invariant: interrupt
was only ever called while holding lock, and saveFiles() also ran under
lock, so the interrupt could never land mid-upload. Easy to break in a later edit.

What changed

Replace interrupt-as-control-signal with an explicit volatile boolean stopped
flag coordinated through the existing intrinsic monitor (no new concurrency
primitives):

stop() sets stopped = true, notifyAll()s to wake the parked thread,
then thread.join()s. The join is deliberately outside the synchronized
block — joining while holding the monitor would deadlock, because the woken
thread must re-acquire the lock to return from wait() and exit.
run() parks in lock.wait(interval) and the loop condition is the flag
(while(!stopped) + post-wait if(stopped) break). The stopped check is
co-located with wait() under the same lock, so there is no lost-wakeup race.
Interrupt is now only handled defensively. It is no longer a control signal;
a single catch(InterruptedException) around the loop ensures that if the JVM or
external code interrupts the thread (e.g. on abrupt shutdown) it still terminates
cleanly, logs at debug, and re-asserts the interrupt flag — instead of leaking an
uncaught exception out of the thread.
stop() hardening: guarded against a null thread (if onFlowCreate failed
before starting it) and against repeated invocation (both onFlowError and
onFlowComplete can fire), making shutdown idempotent.
Removed the now-redundant await() helper.

Behaviour

Unchanged from the caller's perspective: shutdown still waits for any in-flight
saveFiles() cycle to finish (stop() blocks on the monitor / join()), and the
loop still breaks before starting a new save — no new final flush was introduced.

Tests

Added lifecycle coverage to LogsCheckpointTest:

start the thread then onFlowComplete() → thread terminates;
onFlowComplete() with no thread started → no exception.

🤖 Generated with Claude Code

…ckpoint The LogsCheckpoint background thread was terminated by calling `thread.interrupt()` and using the interrupt status as the loop-exit signal. Thread interruption is meant for cancelling blocking operations, not for graceful "please finish" signaling: it conflates the wake-up mechanism with the control decision, and any library code that swallows the interrupt flag (e.g. cloud SDK uploads) could lose the signal. Replace it with an explicit `volatile boolean stopped` flag coordinated through the existing intrinsic monitor: - `stop()` sets the flag and `notifyAll()`s to wake the thread, then joins (outside the synchronized block to avoid deadlocking the woken thread that must re-acquire the lock). - `run()` parks in `lock.wait(interval)` and exits the loop on the flag. - Interrupt is no longer a control signal; it is only handled defensively so external/JVM interruption still terminates the thread cleanly and re-asserts the flag. - `stop()` is guarded against a null thread (failed onFlowCreate) and repeated invocation (onFlowError + onFlowComplete). Add lifecycle tests covering start/stop and stop-before-start. Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

netlify · 2026-06-01T13:57:53Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`73982da`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/6a1d8fdb303a640008a30658

jorgee · 2026-06-01T15:36:14Z

Flagging an overlap with #7188, which touches the same class for what turns out to be a related but distinct goal.

This PR cleanly fixes the signaling mechanism — dropping interrupt-as-control-signal in favor of a stopped flag is the right call, and the lost-wakeup reasoning is sound. But it preserves the two properties that cause the zombie-head-job hang reported in #7188:

saveFiles() still runs while holding lock:

synchronized(lock) {
    while( !stopped ) {
        lock.wait(interval.toMillis())
        if( stopped ) break
        handler.saveFiles()   // holds lock for the whole upload
    }
}

thread.join() is still unbounded, and stop() must acquire lock to signal:

synchronized(lock) {          // blocks here if the worker is hung in saveFiles()
    if( stopped ) return
    stopped = true
    lock.notifyAll()
}
thread.join()                 // no timeout

So if a cloud upload stalls on a half-open socket inside saveFiles(), the worker holds lock, stop() blocks trying to acquire it, and even past that the unbounded join() blocks forever. The worker being a daemon doesn't help, because it's the non-daemon shutdown thread parked in onFlowComplete() → join() that keeps the JVM alive. The description's note that "shutdown still waits for any in-flight saveFiles() cycle to finish" is exactly this failure mode.

#7188 addresses the liveness side: it moves saveFiles() out of any lock and bounds the join with a configurable terminateTimeout (default 120s), abandoning a genuinely-hung daemon worker so the head job exits. It also includes a regression test where saveFiles() blocks forever and shutdown still returns within the timeout — the scenario that triggered the incident, which isn't covered here.

Suggest reconciling the two: either land #7188 for the hang fix, or, if the monitor-based design is preferred here, fold in moving saveFiles() outside the lock and adding a bounded join — otherwise the shutdown deadlock remains.

Minor: stopped is read/written only under synchronized(lock), so the volatile modifier is redundant.

pditommaso · 2026-06-01T15:42:53Z

I'd argue that very hard to have a sensible timeout for thead.join since it takes the overall run executions that could span from mins to days

jorgee · 2026-06-02T07:23:48Z

But, in this case the join is just waiting for the copy of the logs, not the whole execution.

pditommaso · 2026-06-02T07:36:40Z

Got your point, think timeout makes sense

pditommaso requested review from bentsherman and jorgee and removed request for bentsherman June 1, 2026 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a stop flag instead of thread interrupt to shut down the logs checkpoint#7193

Use a stop flag instead of thread interrupt to shut down the logs checkpoint#7193
pditommaso wants to merge 1 commit into
masterfrom
logs-checkpoint-graceful-shutdown

pditommaso commented Jun 1, 2026

Uh oh!

netlify Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

jorgee commented Jun 1, 2026

Uh oh!

pditommaso commented Jun 1, 2026

Uh oh!

jorgee commented Jun 2, 2026

Uh oh!

pditommaso commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pditommaso commented Jun 1, 2026

Rationale

What changed

Behaviour

Tests

Uh oh!

netlify Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

jorgee commented Jun 1, 2026

Uh oh!

pditommaso commented Jun 1, 2026

Uh oh!

jorgee commented Jun 2, 2026

Uh oh!

pditommaso commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify Bot commented Jun 1, 2026 •

edited

Loading