No eager buffering in network connections by gabotechs · Pull Request #477 · datafusion-contrib/datafusion-distributed

gabotechs · 2026-06-01T07:56:17Z

This is one PR from the following stack of PRs:

Network boundaries in this project are currently breaking one assumption from upstream DataFusion:

SendableRecordBatchStreams yield record batches in two situations:

If explicitly polled
Eagerly on an spawned task triggered by the first poll

https://github.com/apache/datafusion/blob/d9ea38b95123159161c017840d3e6256e41988dd/datafusion/physical-plan/src/execution_plan.rs#L973-L988

Today, network boundaries pulling from remote sources are breaking this rule, because they start yielding RecordBatches over the network even if no poll has ever happened to the SendableRecordBatchStream returned by the network boundary.

This has two consequences:

Greater memory consumption, as data will get accumulated in the network boundaries while nobody is polling for it.
Greater speed on JOINs, as an artifact of eagerly buffering right sides even before they are ever polled

Consequence 2 is nice, but it should be delivered using standard upstream mechanisms:

https://github.com/apache/datafusion/blob/d9ea38b95123159161c017840d3e6256e41988dd/datafusion/common/src/config.rs#L695-L709

Not accidentally by how remote network boundaries work.

This PR makes it so that remote network boundaries only start the network stream on first poll, instead of on .execute() call, as stated by the EvaluationType::Eager docs.

gabotechs · 2026-06-01T08:00:14Z

    fn default() -> Self {
        let cache = Cache::builder()
-            .time_to_idle(Duration::from_secs(60))
+            .time_to_idle(Duration::from_mins(10))


As we are no longer eagerly calling the remote streams, there's now a legitimate use case where the time gap between when the plan was sent in CoordinatorChannel until it's executed with ExecuteTask is significant.

This typically happens in JOINs, where the stage on the build side is called immediately, but the stage on the probe side will be delayed until the left side has been fully gathered. Having 10 mins here gives enough head room for the join to fully collect the build side until it starts executing the probe side.

gene-bordegaray

The idea makes sense here and I understand the issue with buffering memory before on late consumers.

I am curious though do you see a case where this was actually beneficial. In the previous behavior we could be like prefetching / buffering data before it is actually request so it is ready to go right away when the consumer asks for it. I was thinking there may be cases where the time to first batch now for the consumer might be slower without this?

Could we bench this? But trust your judgement on the call 👍

EDIT:

I am seeing your point about the JOINs and ya I think I agree here and this is more like a happy mistake from uninteded behavior

gabotechs · 2026-06-02T10:19:07Z

Could we bench this?

Doing that now. I do expect some performance impact (which is expected).

gene-bordegaray · 2026-06-02T11:45:53Z

Could we bench this?

Doing that now. I do expect some performance impact (which is expected).

@gabotechs nice feel free to ping when some results are ready 😄

gabotechs · 2026-06-02T11:48:49Z

This is the benchmark comparison against main, as expected, slower:

  ┌────────────┬────────┬────────┬───────────────────────────────┐
  │  Dataset   │  main  │ branch │            verdict            │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-H SF1  │ 17.1s  │ 24.3s  │ 1.42x slower ❌               │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-H SF10 │ 59.3s  │ 83.3s  │ 1.41x slower ❌               │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-DS SF1 │ 129.9s │ 171.5s │ 1.32x slower ❌               │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ ClickBench │ 151.2s │ 154.8s │ 1.02x slower ✖ (within noise) │
  └────────────┴────────┴────────┴───────────────────────────────┘

I'll try now running with hash_join_buffering_capacity: 1_000_000. With this, I'd expect it to be ~equally performant than main

gabotechs · 2026-06-02T12:27:54Z

Now against main with hash_join_buffering_capacity: 1_000_000:

  ┌────────────┬────────┬────────┬───────────────────────────────┐
  │  Dataset   │  main  │ branch │            verdict            │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-H SF1  │ 17.1s  │ 16.3s  │ 1.05x faster ✔                │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-H SF10 │ 59.3s  │ 65.5s  │ 1.11x slower ✖                │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-DS SF1 │ 129.9s │ 117.4s │ 1.11x faster ✔                │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ ClickBench │ 151.2s │ 171.0s │ 1.13x slower ✖                │
  └────────────┴────────┴────────┴───────────────────────────────┘

With hash join buffering enabled, things look as before

gene-bordegaray · 2026-06-02T12:49:46Z

Now against main with hash_join_buffering_capacity: 1_000_000:

  ┌────────────┬────────┬────────┬───────────────────────────────┐
  │  Dataset   │  main  │ branch │            verdict            │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-H SF1  │ 17.1s  │ 16.3s  │ 1.05x faster ✔                │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-H SF10 │ 59.3s  │ 65.5s  │ 1.11x slower ✖                │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ TPC-DS SF1 │ 129.9s │ 117.4s │ 1.11x faster ✔                │
  ├────────────┼────────┼────────┼───────────────────────────────┤
  │ ClickBench │ 151.2s │ 171.0s │ 1.13x slower ✖                │
  └────────────┴────────┴────────┴───────────────────────────────┘

With hash join buffering enabled, things look as before

@gabotechs noice, thank you 🙇

gabotechs · 2026-06-02T12:50:17Z

I think the CI failure is because of #480

This is one PR from the following stack of PRs: - #477 - #463 <- you are here - #464 - #478 - #479 - #486 - #432 This PR introduces a NetworkBoundaryBuilder argument to the network boundary injection logic, allowing more flexible and configurable strategies for determining which exchanges require network communication. This enables better optimization of data movement across distributed tasks.

This is one PR from the following stack of PRs: - #477 - #463 - #464 <- you are here - #478 - #479 - #486 - #432 This PR introduces a MaxGauge metric to provide better tracking of peak values in distributed metrics collection. This enables more accurate monitoring of resource utilization and helps identify bottlenecks in the execution pipeline.

This is one PR from the following stack of PRs: - #477 - #463 - #464 - #478 <- you are here - #479 - #486 - #432 --- Introduces the `ProducerHead` type: ```rust pub enum ProducerHead { /// No specific head node is necessary. None, /// The head node should be a [BroadcastExec]. BroadcastExec { output_partitions: usize }, /// The head node should be a [RepartitionExec]. RepartitionExec { partitioning: Partitioning }, } ``` Which is passed over the network while remotely executing tasks in order to set the appropriate node at the head of a stage. Today, this is a noop because the right head node in stages is ensured statically at planning time, but in follow up PRs, network boundaries can get swamped and reorganized arbitrarily. One example that happens in AQE: 1. A JOIN is planned as a CollectLeft ```js HashJoinExec: mode=CollectLeft CoalescePartitionsExec: [Stage 1] => NetworkBroadcastExec BroadcastExec DistributedLeafExec: unknown size DistributedLeafExec: unknown size ``` 2. While collecting runtime statistics, it happens that `Stage 1` is huge, and during AQE the JOINs are swapped ```js HashJoinExec: mode=CollectLeft DistributedLeafExec: small size CoalescePartitionsExec: [Stage 1] => NetworkBroadcastExec BroadcastExec DistributedLeafExec: big size ``` 3. The `Stage 1` is now on the probe side, so it needs to be rewritten to a `NetworkShuffleExec`, otherwise duplicate data will be returned: ```js HashJoinExec: mode=CollectLeft DistributedLeafExec: small size CoalescePartitionsExec: [Stage 1] => NetworkShuffleExec RepartitionExec // <- dynamically swapped at runtime based on the passed `ProducerHead` DistributedLeafExec: big size ``` Passing a `ProducerHead` at execution time unlocks two things: 1. dynamically set the fanout width accounting for a dynamically scaled upper stage 2. dynamically set the correct operator `BroadcastExec` or `RepartitionExec` based on the network boundary above, which might have changed because of AQE

This was referenced Jun 1, 2026

Add NetworkBoundaryBuilder argument to inject_network_boundaries #463

Merged

Add MaxGauge metric #464

Merged

Adaptive task count assignation #432

Open

gabotechs commented Jun 1, 2026

View reviewed changes

gabotechs force-pushed the gabrielmusat/no-eager-buffering-in-remote-connections branch from 5d52360 to f43a75a Compare June 1, 2026 13:01

gabotechs mentioned this pull request Jun 1, 2026

Lazily set the producer head at execution time #478

Merged

gene-bordegaray approved these changes Jun 1, 2026

View reviewed changes

Comment thread src/worker/worker_service.rs Outdated

Comment thread src/worker/worker_connection_pool.rs Outdated

gene-bordegaray reviewed Jun 1, 2026

View reviewed changes

Comment thread src/worker/worker_service.rs Outdated

gabotechs mentioned this pull request Jun 2, 2026

Refactor task spawner into QueryCoordinator and StageCoordinator #479

Open

gabotechs force-pushed the gabrielmusat/no-eager-buffering-in-remote-connections branch 3 times, most recently from 56908a9 to 08fa7fd Compare June 2, 2026 10:18

No eager buffering in network connections

5064fd0

gabotechs force-pushed the gabrielmusat/no-eager-buffering-in-remote-connections branch from 08fa7fd to 5064fd0 Compare June 2, 2026 13:30

gabotechs merged commit a9e0df3 into main Jun 2, 2026
17 checks passed

gabotechs deleted the gabrielmusat/no-eager-buffering-in-remote-connections branch June 2, 2026 15:04

gabotechs mentioned this pull request Jun 8, 2026

Add cost model #486

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No eager buffering in network connections#477

No eager buffering in network connections#477
gabotechs merged 1 commit into
mainfrom
gabrielmusat/no-eager-buffering-in-remote-connections

gabotechs commented Jun 1, 2026 •

edited

Loading

Uh oh!

gabotechs Jun 1, 2026 •

edited

Loading

Uh oh!

gene-bordegaray left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabotechs commented Jun 2, 2026

Uh oh!

gene-bordegaray commented Jun 2, 2026

Uh oh!

gabotechs commented Jun 2, 2026 •

edited

Loading

Uh oh!

gabotechs commented Jun 2, 2026 •

edited

Loading

Uh oh!

gene-bordegaray commented Jun 2, 2026

Uh oh!

gabotechs commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gabotechs commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabotechs Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabotechs commented Jun 2, 2026

Uh oh!

gene-bordegaray commented Jun 2, 2026

Uh oh!

gabotechs commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabotechs commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gene-bordegaray commented Jun 2, 2026

Uh oh!

gabotechs commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabotechs commented Jun 1, 2026 •

edited

Loading

gabotechs Jun 1, 2026 •

edited

Loading

gene-bordegaray left a comment •

edited

Loading

gabotechs commented Jun 2, 2026 •

edited

Loading

gabotechs commented Jun 2, 2026 •

edited

Loading