Skip to content

Refactor task spawner into QueryCoordinator and StageCoordinator#479

Open
gabotechs wants to merge 2 commits into
mainfrom
gabrielmusat/task-spawner-refactor-and-cache-invalidation
Open

Refactor task spawner into QueryCoordinator and StageCoordinator#479
gabotechs wants to merge 2 commits into
mainfrom
gabrielmusat/task-spawner-refactor-and-cache-invalidation

Conversation

@gabotechs

@gabotechs gabotechs commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

This is one PR from the following stack of PRs:


This is a pure refactor PR with a couple of implementation detail changes:

Reducing the bloat of complex functions in already complicated parts of the codebase.

For this, two key structs are introduced:

  • QueryCoordinator: scoped to a whole distributed query, it handles references to pieces of data global to a query's lifetime, like the TaskContext, the metrics, the JoinSet used for spawning tasks, etc... it's in charge also of building StageCoordinator instances, which are scoped per-stage instead of per-query.
  • StageCoordinator: this is the old CoordinatorToWorkerTaskSpawner, but with some more methods that allow reusability and some better naming. It handles all the comms between workers and coordinator needed for driving a stage forward.

This allows reducing the bloat in prepare_static_plan and the future prepare_dynamic_plan functions.

Ensuring a coordinator->worker channel is held active for as long as the DistributedExec node is executing the query on the coordinator.

For the static planner, this is a noop, as the previous model worked fine before, but this will become important in the future for the dynamic planner. In the dynamic planner, the plan can be set by some stages, but they might never reach execution, so instead of coupling the task entry cache invalidation to the task execution finish, it's coupled instead of the coordinator->channel lifetime.

This has one collateral effect: WorkUnit feeds can no longer rely on the global coordinator->worker EOS signal for ensuring that no further WorkUnit feed is going to be sent by the coordinator, so they need an explicit EOS message that signals that no further WorkUnits will be received, even though the coordinator->worker channel will still be alive for a while.

Add a plan_for_viz field in DistributedExec.

This is a new slot in DistributedExec that holds a reference of the plan that is supposed to be rewritten with metrics for visualization purposes.

Again, for the static planner this is a noop, because the plan meant for visualization is equal to the plan that arrived as child to DistributedExec on the first place. However, during dynamic planning, the plan that arrives to DistributedExec is not going to be the same as the final one after execution, so we need a slot for storing that final plan.

@gabotechs gabotechs changed the base branch from main to gabrielmusat/producer-head June 2, 2026 08:51
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch 3 times, most recently from 59d810a to de99ac6 Compare June 2, 2026 09:23
Comment on lines +11 to +17
pub(super) struct LatencyMetric {
max: Time,
avg: Time,
max_latency_micros: AtomicU64,
sum_latency_micros: AtomicU64,
count_latency_micros: AtomicU64,
}

@gabotechs gabotechs Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the contents of this file are unchanged code that previously lived inside task_spawner.rs. I just extracted it to its own file as it's kind of isolated.

Comment on lines -69 to +65
plan.children()[0].clone(),
distributed_exec.plan_for_viz()?,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During AQE, the plan meant for visualization is the one that gets constructed dynamically during execution, not necessarily the one that arrived on the first place to DistributedExec. This plan_for_viz() method ensures the appropriate plan is rewritten with metrics for visualization purposes.

}

/// DataFusion metrics system is pretty limited from an API standpoint. This intermediate struct
/// bridges the gaps that are not satisfied by upstream API for measuring latency.
pub(super) struct LatencyMetric {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty unfortunate diff here:

  • LatencyMetric was just moved to latency_metric.rs as-is
  • New keep_stream_alive and NotifyGuard are introduced
  • CoordinatorToWorkerMetrics was moved from the top of this file to the bottom as-is

@gabotechs gabotechs marked this pull request as ready for review June 2, 2026 09:45
@gabotechs gabotechs changed the title Refactor task spawner Refactor task spawner into QueryCoordinator and StageCoordinator Jun 2, 2026
@gabotechs gabotechs force-pushed the gabrielmusat/producer-head branch from 520fe2c to 82af353 Compare June 2, 2026 13:30
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from de99ac6 to 03aeeb9 Compare June 2, 2026 13:30
gabotechs added a commit that referenced this pull request Jun 2, 2026
This is one PR from the following stack of PRs:
- #477
<- you are here
- #463
- #464
- #478
- #479
- #432

---

Network boundaries in this project are currently breaking one assumption
from upstream DataFusion:

`SendableRecordBatchStream`s yield record batches in two situations:
- If explicitly polled
- Eagerly on an spawned task triggered by the first poll


https://github.com/apache/datafusion/blob/d9ea38b95123159161c017840d3e6256e41988dd/datafusion/physical-plan/src/execution_plan.rs#L973-L988

Today, network boundaries pulling from remote sources are breaking this
rule, because they start yielding `RecordBatches` over the network even
if no poll has ever happened to the `SendableRecordBatchStream` returned
by the network boundary.

This has two consequences:
1. Greater memory consumption, as data will get accumulated in the
network boundaries while nobody is polling for it.
2. Greater speed on JOINs, as an artifact of eagerly buffering right
sides even before they are ever polled

Consequence 2 is nice, but it should be delivered using standard
upstream mechanisms:


https://github.com/apache/datafusion/blob/d9ea38b95123159161c017840d3e6256e41988dd/datafusion/common/src/config.rs#L695-L709

Not accidentally by how remote network boundaries work.

---

This PR makes it so that remote network boundaries only start the
network stream on first poll, instead of on `.execute()` call, as stated
by the `EvaluationType::Eager` docs.
@gabotechs gabotechs force-pushed the gabrielmusat/producer-head branch from 82af353 to b2f8e5e Compare June 2, 2026 15:06
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from 03aeeb9 to d551cde Compare June 2, 2026 15:06
@gabotechs gabotechs force-pushed the gabrielmusat/producer-head branch from b2f8e5e to ca79034 Compare June 2, 2026 15:15
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from d551cde to 9a9f0de Compare June 2, 2026 15:15
@gabotechs gabotechs force-pushed the gabrielmusat/producer-head branch from ca79034 to 431d5a2 Compare June 8, 2026 10:48
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from 9a9f0de to bda1c2d Compare June 8, 2026 10:49
@gabotechs gabotechs mentioned this pull request Jun 8, 2026
@gabotechs gabotechs force-pushed the gabrielmusat/producer-head branch from 431d5a2 to fe92488 Compare June 8, 2026 12:19
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from bda1c2d to 2879131 Compare June 8, 2026 12:19
@gabotechs gabotechs force-pushed the gabrielmusat/producer-head branch from fe92488 to 7b7b571 Compare June 8, 2026 14:51
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch 2 times, most recently from 4e96136 to 5ad9e3f Compare June 8, 2026 15:11
@gabotechs gabotechs force-pushed the gabrielmusat/producer-head branch 2 times, most recently from 423e39a to 8ab2da5 Compare June 9, 2026 06:42
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from 5ad9e3f to 4cb23c0 Compare June 9, 2026 06:42
gabotechs added a commit that referenced this pull request Jun 11, 2026
This is one PR from the following stack of PRs:
- #477
- #463
<- you are here
- #464
- #478
- #479
- #486
- #432

This PR introduces a NetworkBoundaryBuilder argument to the network
boundary injection logic, allowing more flexible and configurable
strategies for determining which exchanges require network
communication. This enables better optimization of data movement across
distributed tasks.
gabotechs added a commit that referenced this pull request Jun 11, 2026
This is one PR from the following stack of PRs:

- #477
- #463
- #464
<- you are here
- #478
- #479
- #486
- #432

This PR introduces a MaxGauge metric to provide better tracking of peak
values in distributed metrics collection. This enables more accurate
monitoring of resource utilization and helps identify bottlenecks in the
execution pipeline.
@gabotechs gabotechs force-pushed the gabrielmusat/producer-head branch from 8ab2da5 to 657a531 Compare June 11, 2026 07:11
gabotechs added a commit that referenced this pull request Jun 11, 2026
This is one PR from the following stack of PRs:
- #477
- #463
- #464
- #478
<- you are here
- #479
- #486
- #432

---

Introduces the `ProducerHead` type:

```rust
pub enum ProducerHead {
    /// No specific head node is necessary.
    None,
    /// The head node should be a [BroadcastExec].
    BroadcastExec { output_partitions: usize },
    /// The head node should be a [RepartitionExec].
    RepartitionExec { partitioning: Partitioning },
}
```

Which is passed over the network while remotely executing tasks in order
to set the appropriate node at the head of a stage.

Today, this is a noop because the right head node in stages is ensured
statically at planning time, but in follow up PRs, network boundaries
can get swamped and reorganized arbitrarily.

One example that happens in AQE:

1. A JOIN is planned as a CollectLeft

```js
HashJoinExec: mode=CollectLeft
  CoalescePartitionsExec:
    [Stage 1] => NetworkBroadcastExec
      BroadcastExec
        DistributedLeafExec: unknown size
  DistributedLeafExec: unknown size
```

2. While collecting runtime statistics, it happens that `Stage 1` is
huge, and during AQE the JOINs are swapped

```js
HashJoinExec: mode=CollectLeft
  DistributedLeafExec: small size
  CoalescePartitionsExec:
    [Stage 1] => NetworkBroadcastExec
      BroadcastExec
        DistributedLeafExec: big size
```

3. The `Stage 1` is now on the probe side, so it needs to be rewritten
to a `NetworkShuffleExec`, otherwise duplicate data will be returned:

```js
HashJoinExec: mode=CollectLeft
  DistributedLeafExec: small size
  CoalescePartitionsExec:
    [Stage 1] => NetworkShuffleExec
      RepartitionExec // <- dynamically swapped at runtime based on the passed `ProducerHead`
        DistributedLeafExec: big size
```

Passing a `ProducerHead` at execution time unlocks two things:
1. dynamically set the fanout width accounting for a dynamically scaled
upper stage
2. dynamically set the correct operator `BroadcastExec` or
`RepartitionExec` based on the network boundary above, which might have
changed because of AQE
Base automatically changed from gabrielmusat/producer-head to main June 11, 2026 07:59
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch 4 times, most recently from fc261f1 to a4ceac5 Compare June 12, 2026 11:03
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from a4ceac5 to 1c864d5 Compare June 14, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant