Use a stop flag instead of thread interrupt to shut down the logs checkpoint#7193
Use a stop flag instead of thread interrupt to shut down the logs checkpoint#7193pditommaso wants to merge 1 commit into
Conversation
…ckpoint The LogsCheckpoint background thread was terminated by calling `thread.interrupt()` and using the interrupt status as the loop-exit signal. Thread interruption is meant for cancelling blocking operations, not for graceful "please finish" signaling: it conflates the wake-up mechanism with the control decision, and any library code that swallows the interrupt flag (e.g. cloud SDK uploads) could lose the signal. Replace it with an explicit `volatile boolean stopped` flag coordinated through the existing intrinsic monitor: - `stop()` sets the flag and `notifyAll()`s to wake the thread, then joins (outside the synchronized block to avoid deadlocking the woken thread that must re-acquire the lock). - `run()` parks in `lock.wait(interval)` and exits the loop on the flag. - Interrupt is no longer a control signal; it is only handled defensively so external/JVM interruption still terminates the thread cleanly and re-asserts the flag. - `stop()` is guarded against a null thread (failed onFlowCreate) and repeated invocation (onFlowError + onFlowComplete). Add lifecycle tests covering start/stop and stop-before-start. Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
|
Flagging an overlap with #7188, which touches the same class for what turns out to be a related but distinct goal. This PR cleanly fixes the signaling mechanism — dropping interrupt-as-control-signal in favor of a
So if a cloud upload stalls on a half-open socket inside #7188 addresses the liveness side: it moves Suggest reconciling the two: either land #7188 for the hang fix, or, if the monitor-based design is preferred here, fold in moving Minor: |
|
I'd argue that very hard to have a sensible timeout for thead.join since it takes the overall run executions that could span from mins to days |
|
But, in this case the join is just waiting for the copy of the logs, not the whole execution. |
|
Got your point, think timeout makes sense |
Rationale
LogsCheckpointruns a background thread that periodically uploads the.nextflow.log, timeline and report files to Seqera Platform. The thread wasstopped by calling
thread.interrupt()(guarded by a lock) and using theinterrupt status as the loop-exit signal.
Using interrupt this way is the wrong tool for graceful shutdown:
Thread.interrupt()is the JVM's cancellation primitivefor unblocking a thread stuck in a blocking call — not a "please wind down when
convenient" signal. Here it was overloaded to also mean "exit the loop", mixing
the wake-up mechanism with the control decision.
saveFiles()ultimately calls into provider/cloud SDK code (
FileHelper.copyPath); any suchcode that catches
InterruptedExceptionand clears the flag without re-assertingit could swallow the stop signal, leaving the thread running.
was only ever called while holding
lock, andsaveFiles()also ran underlock, so the interrupt could never land mid-upload. Easy to break in a later edit.What changed
Replace interrupt-as-control-signal with an explicit
volatile boolean stoppedflag coordinated through the existing intrinsic monitor (no new concurrency
primitives):
stop()setsstopped = true,notifyAll()s to wake the parked thread,then
thread.join()s. The join is deliberately outside thesynchronizedblock — joining while holding the monitor would deadlock, because the woken
thread must re-acquire the lock to return from
wait()and exit.run()parks inlock.wait(interval)and the loop condition is the flag(
while(!stopped)+ post-waitif(stopped) break). Thestoppedcheck isco-located with
wait()under the same lock, so there is no lost-wakeup race.a single
catch(InterruptedException)around the loop ensures that if the JVM orexternal code interrupts the thread (e.g. on abrupt shutdown) it still terminates
cleanly, logs at debug, and re-asserts the interrupt flag — instead of leaking an
uncaught exception out of the thread.
stop()hardening: guarded against anullthread (ifonFlowCreatefailedbefore starting it) and against repeated invocation (both
onFlowErrorandonFlowCompletecan fire), making shutdown idempotent.await()helper.Behaviour
Unchanged from the caller's perspective: shutdown still waits for any in-flight
saveFiles()cycle to finish (stop()blocks on the monitor /join()), and theloop still breaks before starting a new save — no new final flush was introduced.
Tests
Added lifecycle coverage to
LogsCheckpointTest:onFlowComplete()→ thread terminates;onFlowComplete()with no thread started → no exception.🤖 Generated with Claude Code