Refactor task spawner into QueryCoordinator and StageCoordinator by gabotechs · Pull Request #479 · datafusion-contrib/datafusion-distributed

gabotechs · 2026-06-02T08:48:37Z

This is one PR from the following stack of PRs:

This is a pure refactor PR with a couple of implementation detail changes:

Reducing the bloat of complex functions in already complicated parts of the codebase.

For this, two key structs are introduced:

QueryCoordinator: scoped to a whole distributed query, it handles references to pieces of data global to a query's lifetime, like the TaskContext, the metrics, the JoinSet used for spawning tasks, etc... it's in charge also of building StageCoordinator instances, which are scoped per-stage instead of per-query.
StageCoordinator: this is the old CoordinatorToWorkerTaskSpawner, but with some more methods that allow reusability and some better naming. It handles all the comms between workers and coordinator needed for driving a stage forward.

This allows reducing the bloat in prepare_static_plan and the future prepare_dynamic_plan functions.

Ensuring a coordinator->worker channel is held active for as long as the `DistributedExec` node is executing the query on the coordinator.

For the static planner, this is a noop, as the previous model worked fine before, but this will become important in the future for the dynamic planner. In the dynamic planner, the plan can be set by some stages, but they might never reach execution, so instead of coupling the task entry cache invalidation to the task execution finish, it's coupled instead of the coordinator->channel lifetime.

This has one collateral effect: WorkUnit feeds can no longer rely on the global coordinator->worker EOS signal for ensuring that no further WorkUnit feed is going to be sent by the coordinator, so they need an explicit EOS message that signals that no further WorkUnits will be received, even though the coordinator->worker channel will still be alive for a while.

Add a `plan_for_viz` field in `DistributedExec`.

This is a new slot in DistributedExec that holds a reference of the plan that is supposed to be rewritten with metrics for visualization purposes.

Again, for the static planner this is a noop, because the plan meant for visualization is equal to the plan that arrived as child to DistributedExec on the first place. However, during dynamic planning, the plan that arrives to DistributedExec is not going to be the same as the final one after execution, so we need a slot for storing that final plan.

gabotechs · 2026-06-02T09:27:52Z

+pub(super) struct LatencyMetric {
+    max: Time,
+    avg: Time,
+    max_latency_micros: AtomicU64,
+    sum_latency_micros: AtomicU64,
+    count_latency_micros: AtomicU64,
+}


All the contents of this file are unchanged code that previously lived inside task_spawner.rs. I just extracted it to its own file as it's kind of isolated.

gabotechs · 2026-06-02T09:33:36Z

-        plan.children()[0].clone(),
+        distributed_exec.plan_for_viz()?,


During AQE, the plan meant for visualization is the one that gets constructed dynamically during execution, not necessarily the one that arrived on the first place to DistributedExec. This plan_for_viz() method ensures the appropriate plan is rewritten with metrics for visualization purposes.

gabotechs · 2026-06-02T09:37:02Z

 }

-/// DataFusion metrics system is pretty limited from an API standpoint. This intermediate struct
-/// bridges the gaps that are not satisfied by upstream API for measuring latency.
-pub(super) struct LatencyMetric {


Pretty unfortunate diff here:

LatencyMetric was just moved to latency_metric.rs as-is

New keep_stream_alive and NotifyGuard are introduced

CoordinatorToWorkerMetrics was moved from the top of this file to the bottom as-is

This is one PR from the following stack of PRs: - #477 <- you are here - #463 - #464 - #478 - #479 - #432 --- Network boundaries in this project are currently breaking one assumption from upstream DataFusion: `SendableRecordBatchStream`s yield record batches in two situations: - If explicitly polled - Eagerly on an spawned task triggered by the first poll https://github.com/apache/datafusion/blob/d9ea38b95123159161c017840d3e6256e41988dd/datafusion/physical-plan/src/execution_plan.rs#L973-L988 Today, network boundaries pulling from remote sources are breaking this rule, because they start yielding `RecordBatches` over the network even if no poll has ever happened to the `SendableRecordBatchStream` returned by the network boundary. This has two consequences: 1. Greater memory consumption, as data will get accumulated in the network boundaries while nobody is polling for it. 2. Greater speed on JOINs, as an artifact of eagerly buffering right sides even before they are ever polled Consequence 2 is nice, but it should be delivered using standard upstream mechanisms: https://github.com/apache/datafusion/blob/d9ea38b95123159161c017840d3e6256e41988dd/datafusion/common/src/config.rs#L695-L709 Not accidentally by how remote network boundaries work. --- This PR makes it so that remote network boundaries only start the network stream on first poll, instead of on `.execute()` call, as stated by the `EvaluationType::Eager` docs.

This is one PR from the following stack of PRs: - #477 - #463 <- you are here - #464 - #478 - #479 - #486 - #432 This PR introduces a NetworkBoundaryBuilder argument to the network boundary injection logic, allowing more flexible and configurable strategies for determining which exchanges require network communication. This enables better optimization of data movement across distributed tasks.

This is one PR from the following stack of PRs: - #477 - #463 - #464 <- you are here - #478 - #479 - #486 - #432 This PR introduces a MaxGauge metric to provide better tracking of peak values in distributed metrics collection. This enables more accurate monitoring of resource utilization and helps identify bottlenecks in the execution pipeline.

This is one PR from the following stack of PRs: - #477 - #463 - #464 - #478 <- you are here - #479 - #486 - #432 --- Introduces the `ProducerHead` type: ```rust pub enum ProducerHead { /// No specific head node is necessary. None, /// The head node should be a [BroadcastExec]. BroadcastExec { output_partitions: usize }, /// The head node should be a [RepartitionExec]. RepartitionExec { partitioning: Partitioning }, } ``` Which is passed over the network while remotely executing tasks in order to set the appropriate node at the head of a stage. Today, this is a noop because the right head node in stages is ensured statically at planning time, but in follow up PRs, network boundaries can get swamped and reorganized arbitrarily. One example that happens in AQE: 1. A JOIN is planned as a CollectLeft ```js HashJoinExec: mode=CollectLeft CoalescePartitionsExec: [Stage 1] => NetworkBroadcastExec BroadcastExec DistributedLeafExec: unknown size DistributedLeafExec: unknown size ``` 2. While collecting runtime statistics, it happens that `Stage 1` is huge, and during AQE the JOINs are swapped ```js HashJoinExec: mode=CollectLeft DistributedLeafExec: small size CoalescePartitionsExec: [Stage 1] => NetworkBroadcastExec BroadcastExec DistributedLeafExec: big size ``` 3. The `Stage 1` is now on the probe side, so it needs to be rewritten to a `NetworkShuffleExec`, otherwise duplicate data will be returned: ```js HashJoinExec: mode=CollectLeft DistributedLeafExec: small size CoalescePartitionsExec: [Stage 1] => NetworkShuffleExec RepartitionExec // <- dynamically swapped at runtime based on the passed `ProducerHead` DistributedLeafExec: big size ``` Passing a `ProducerHead` at execution time unlocks two things: 1. dynamically set the fanout width accounting for a dynamically scaled upper stage 2. dynamically set the correct operator `BroadcastExec` or `RepartitionExec` based on the network boundary above, which might have changed because of AQE

…tor->worker stream drop

gabotechs changed the base branch from main to gabrielmusat/producer-head June 2, 2026 08:51

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch 3 times, most recently from 59d810a to de99ac6 Compare June 2, 2026 09:23

gabotechs commented Jun 2, 2026

View reviewed changes

gabotechs marked this pull request as ready for review June 2, 2026 09:45

gabotechs changed the title ~~Refactor task spawner~~ Refactor task spawner into QueryCoordinator and StageCoordinator Jun 2, 2026

gabotechs force-pushed the gabrielmusat/producer-head branch from 520fe2c to 82af353 Compare June 2, 2026 13:30

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from de99ac6 to 03aeeb9 Compare June 2, 2026 13:30

gabotechs force-pushed the gabrielmusat/producer-head branch from 82af353 to b2f8e5e Compare June 2, 2026 15:06

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from 03aeeb9 to d551cde Compare June 2, 2026 15:06

gabotechs force-pushed the gabrielmusat/producer-head branch from b2f8e5e to ca79034 Compare June 2, 2026 15:15

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from d551cde to 9a9f0de Compare June 2, 2026 15:15

gabotechs force-pushed the gabrielmusat/producer-head branch from ca79034 to 431d5a2 Compare June 8, 2026 10:48

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from 9a9f0de to bda1c2d Compare June 8, 2026 10:49

gabotechs mentioned this pull request Jun 8, 2026

Add cost model #486

Open

gabotechs force-pushed the gabrielmusat/producer-head branch from 431d5a2 to fe92488 Compare June 8, 2026 12:19

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from bda1c2d to 2879131 Compare June 8, 2026 12:19

gabotechs force-pushed the gabrielmusat/producer-head branch from fe92488 to 7b7b571 Compare June 8, 2026 14:51

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch 2 times, most recently from 4e96136 to 5ad9e3f Compare June 8, 2026 15:11

gabotechs force-pushed the gabrielmusat/producer-head branch 2 times, most recently from 423e39a to 8ab2da5 Compare June 9, 2026 06:42

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from 5ad9e3f to 4cb23c0 Compare June 9, 2026 06:42

gabotechs force-pushed the gabrielmusat/producer-head branch from 8ab2da5 to 657a531 Compare June 11, 2026 07:11

Base automatically changed from gabrielmusat/producer-head to main June 11, 2026 07:59

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch 4 times, most recently from fc261f1 to a4ceac5 Compare June 12, 2026 11:03

gabotechs added 2 commits June 14, 2026 19:16

Add NetworkBoundaryBuilder argument to inject_network_boundaries.rs

554daf8

Refactor coordinator module and ensure cache invalidation on coordina…

1c864d5

…tor->worker stream drop

gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from a4ceac5 to 1c864d5 Compare June 14, 2026 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor task spawner into QueryCoordinator and StageCoordinator#479

Refactor task spawner into QueryCoordinator and StageCoordinator#479
gabotechs wants to merge 2 commits into
mainfrom
gabrielmusat/task-spawner-refactor-and-cache-invalidation

gabotechs commented Jun 2, 2026 •

edited

Loading

Uh oh!

gabotechs Jun 2, 2026 •

edited

Loading

Uh oh!

gabotechs Jun 2, 2026

Uh oh!

gabotechs Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		plan.children()[0].clone(),
		distributed_exec.plan_for_viz()?,

Conversation

gabotechs commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reducing the bloat of complex functions in already complicated parts of the codebase.

Ensuring a coordinator->worker channel is held active for as long as the DistributedExec node is executing the query on the coordinator.

Add a plan_for_viz field in DistributedExec.

Uh oh!

gabotechs Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

gabotechs Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabotechs commented Jun 2, 2026 •

edited

Loading

Ensuring a coordinator->worker channel is held active for as long as the `DistributedExec` node is executing the query on the coordinator.

Add a `plan_for_viz` field in `DistributedExec`.

gabotechs Jun 2, 2026 •

edited

Loading