Surface allocated resources to self-sizing task commands (nf-seqera)#7199
Draft
pditommaso wants to merge 1 commit into
Draft
Surface allocated resources to self-sizing task commands (nf-seqera)#7199pditommaso wants to merge 1 commit into
pditommaso wants to merge 1 commit into
Conversation
…(sched#492) Companion client change to seqeralabs/sched#493. Fixes the qr/v2 request-vs-allocation OOM: when the scheduler reduces a task's memory allocation below its request, tools that self-size from the requested memory (bcftools --max-mem, JVM -Xmx, STAR, etc.) over-commit and get OOM-killed, because the command was rendered from the request before the allocation was decided. At materialization (no re-execution of the user script) the executor now: - retains the live command GString on TaskRun (scriptlet path only); - ResourceInterpolator captures command values derived from task.memory/task.cpus using the live GString + the raw body.source + the AST `task` reference, validates the parsed template against the GString, and replaces each resource-derived numeric value with a ${SEQERA_TASK_MEM_n}/${SEQERA_TASK_CPUS_n} shell placeholder; - SeqeraTaskHandler rewrites task.script, ships the bindings on the task, and seeds each placeholder with its as-requested value as an env default (so backends that don't resolve bindings, or an older scheduler, still produce a valid command). The scheduler (sched#493) fills the placeholders with value*allocated/requested at launch. Safety: a mis-identified template, a placeholder that would land inside shell single-quotes, a joint memory+cpus expression, or any parse failure all fall back to the original command — never worse than today. Note: the nf-seqera dependency is bumped to sched-client:0.60.0-SNAPSHOT; this PR builds once that artifact (the API in sched#493) is published. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Companion client change to seqeralabs/sched#493 — together they fix the
qr/v2request-vs-allocation OOM (seqeralabs/sched#492).When the Seqera scheduler's optimizer reduces a task's memory allocation below its request, tools that self-size from the requested memory over-commit and get OOM-killed. The motivating case: nf-core
BCFTOOLS_SORTrenders--max-mem 33177.6M(=task.memory × 0.90) from a 36,864 MiB request, but the container cgroup enforced only 8,704 MiB — bcftools tried to reserve ~29.8 GB in ~8.5 GB and died withexit 255. Affects any self-sizing tool (bcftools/samtools --max-mem, JVM-Xmx, STAR--limitBAMsortRAM,skesa --memory, …).The command is rendered before the scheduler decides the allocation, so it can't be patched afterward. This change captures the resource-derived command values and turns them into shell placeholders; the scheduler (sched#493) fills them with the allocated value at launch — keeping the memory cost savings while eliminating the OOM.
How it works (Seqera executor only; no re-execution of the user script)
At task materialization:
TaskRunretains the live commandGString(before it's flattened to a string), on the plain scriptlet path only — a minimal hook so the executor can inspect which command values were interpolated.ResourceInterpolator(new) locates resource-derived values from three signals — the live commandGString(exact positions, recursing into nested GStrings), the rawbody.source(the interpolation expressions), and the ASTvalRefs(atask-referenced gate). A value is memory/cpus-derived when its expression referencestask.memory/task.cpusdirectly or through one level of local-variable indirection. Each such numeric value is replaced with${SEQERA_TASK_MEM_n}/${SEQERA_TASK_CPUS_n}.SeqeraTaskHandlerrewritestask.scriptbefore the launcher builds.command.sh, ships the bindings on the task, and seeds each placeholder with its as-requested value as an environment default.The placeholder is a plain shell variable expanded by bash at runtime —
.command.shis never rewritten by the scheduler.Safety (never worse than today)
The parsed command template is validated against the live
GString(literal text must match, ignoring whitespace/escaping) before any attribution, so a mis-identified template cannot produce wrong bindings. The original command is returned unchanged on: a placeholder that would land inside shell single-quotes (where it wouldn't expand), a jointmemory+cpusexpression, a non-numeric value, or any parse/validation failure. And because the client seeds the placeholders with the request value, backends that don't resolve bindings (or an older scheduler) still produce a valid command identical to today's behavior.Dependency / merge order
nf-seqerais bumped toio.seqera:sched-client:0.60.0-SNAPSHOT, which provides theResourceBindingAPI added in seqeralabs/sched#493. This PR builds once that artifact is published, so it depends on sched#493 landing first. Draft until then.Tests
ResourceInterpolatorTest— 21 cases: direct self-sizing (bcftools,-Xmxgiga, STAR bytes), cpus arithmetic, multi-line commands with continuations, two memory refs, double-quote expansion, local-variable indirection (skesa ternary), joint mem+cpus left untouched, and the safety fallbacks (single-quote, template mismatch, unrecognized form, anchoredtask.memoryUsage).SeqeraTaskHandlerTest—prepareLauncherrewritestask.scriptand captures bindings; no-op when nothing is resource-derived.🤖 Generated with Claude Code