Skip to content

Adaptive task count assignation#432

Open
gabotechs wants to merge 1 commit into
gabrielmusat/cost-calculationfrom
gabrielmusat/dynamic-task-count
Open

Adaptive task count assignation#432
gabotechs wants to merge 1 commit into
gabrielmusat/cost-calculationfrom
gabrielmusat/dynamic-task-count

Conversation

@gabotechs

@gabotechs gabotechs commented May 4, 2026

Copy link
Copy Markdown
Collaborator

This is one PR from the following stack of PRs:


This PR implements support for dynamic task count assignation, allowing the distributed planner to determine the optimal number of tasks for each stage based on runtime characteristics and available resources. This enables adaptive query execution where task counts are optimized for the specific execution context rather than being determined statically.

There are two fundamental pieces delivered in this PR that make adaptive task count assignation possible:

  1. SamplerExec
  2. prepare_dynamic_plan

SamplerExec

This is a new ExecutionPlan implementation that peeks a few initial record batches before execution, gathers statistics over them, and then reports a LoadInfo message to the prepare_dynamic_plan running on the coordinator

                    ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐                       
                        ┌─────────────────────┐                           
                    │   │                     │   │                       
   ┌───────────────────▶│   DistributedExec   │◀──────────────────────┐   
   │                │   │                     │   │                   │   
   │                    └─────────────────────┘                       │   
┌──┴─────┐          │   ┌─────────────────────┐   │             ┌─────┴──┐
│LoadInfo│              │                     │                 │LoadInfo│
└──┬─────┘          │   │         ...         │   │             └─────┬──┘
   │                    │                     │                       │   
   │                │   └─────────────────────┘   │                   │   
   │                 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                    │   
   │                                                                  │   
   │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐│   
   │     ┌─────────────────────┐           ┌─────────────────────┐    │   
   │ │   │    ProducerHead     │   │   │   │    ProducerHead     │   ││   
   │     │ (RepartitionExec or │           │ (RepartitionExec or │    │   
   │ │   │   BroadcastExec)    │   │   │   │   BroadcastExec)    │   ││   
   │     └─────────────────────┘           └─────────────────────┘    │   
   │ │   ┌─────────────────────┐   │   │   ┌─────────────────────┐   ││   
   │     │                     │           │                     │    │   
   └─┼───│     SamplerExec     │   │   │   │     SamplerExec     │───┼┘   
         │                     │           │                     │        
     │   └─────────────────────┘   │   │   └─────────────────────┘   │    
         ┌─────────────────────┐           ┌─────────────────────┐        
     │   │                     │   │   │   │                     │   │    
         │         ...         │           │         ...         │        
     │   │                     │   │   │   │                     │   │    
         └─────────────────────┘           └─────────────────────┘        
     │   ┌─────────────────────┐   │   │   ┌─────────────────────┐   │    
         │                     │           │                     │        
     │   │   DataSourceExec    │   │   │   │   DataSourceExec    │   │    
         │                     │           │                     │        
     │   └─────────────────────┘   │   │   └─────────────────────┘   │    
      ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     

Each SamplerExec node contains one PartitionSampler per partition, and each partition sampling is treated independently:

┌────────────────────────────────────────────────────────────────────────────┐
│                                SamplerExec                                 │
│ ┌────────────────┐┌────────────────┐                    ┌────────────────┐ │
│ │                ││                │                    │                │ │
│ │PartitionSampler││PartitionSampler│         ...        │PartitionSampler│ │
│ │                ││                │                    │                │ │
│ └────────────────┘└────────────────┘                    └────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘

The way a PartitionSampler works is that, before actual execution, it will peek a representative number of RecordBatches and gather stats over them (per column byte count, per column NDV, per column null count, total row count, etc...). For that the following things happen:

  1. Pulls all the RecordBatches that are available synchronously without async gaps. Some operators like Aggregate(Partial) retain big internal representations in memory that can yield more than 1 RecordBatch synchronously, because they were already computed and stored in memory.
┌───────────────────────────────────────────────────────────────┐
│                       PartitionSampler                        │
│                                                               │
│┌───────────┬───────────┬───────────┐                          │
││           │           │           │                          │
││RecordBatchRecordBatchRecordBatch│                          │
││           │           │           │                          │
│└───────────┴───────────┴───────────┘                          │
└───────────────────────────────────────────────────────────────┘
  1. Once an async gap is reached, it pulls just 1 more record batch, and it measures the time it takes in this async gap:
┌───────────────────────────────────────────────────────────────┐
│                       PartitionSampler                        │
│                                                               │
│┌───────────┬───────────┬───────────┐             ┌───────────┐│
││           │           │           │             │           ││
││RecordBatchRecordBatchRecordBatchasync wait │RecordBatch││
││           │           │           │             │           ││
│└───────────┴───────────┴───────────┘             └───────────┘│
└───────────────────────────────────────────────────────────────┘
  1. With this information, two sources of useful information are present:
  • Bytes ready to be yielded immediately
  • The velocity at with new RecordBatches are arriving
┌───────────────────────────────────────────────────────────────┐
│                       PartitionSampler                        │
│                                                               │
│┌───────────┬───────────┬───────────┐             ┌───────────┐│
││           │           │           │             │           ││
││RecordBatchRecordBatchRecordBatchasync wait │RecordBatch││
││           │           │           │             │           ││
│└─────┬─────┴─────┬─────┴───┬─────┬─┘             └────▲────┬─┘│
└──────┼───────────┼─────────┼─────┼────────────────────┼────┼──┘
       │           │         │     ─────────────────────│    │   
       │           │         │                               │   
       │           │         │      RecordBatch velocity     │   
       │           │         │                               │   
       │           │         │                               │   
       │           │         │                               │   
       │           │         │     ┌─────────────────────────┘   
       ▼           ▼         ▼     ▼                             
   Batches to be yielded immediately                                                                                         

This velocity information is what really matters towards deciding how many task count should be yielded.

  • Scanning a lot of data is not problematic if you can do it in a streaming fashion and the DataSources are bottleneck on IO, in this situations, 1 node might be enough.
  • If DataSources yield data very quickly, or if the compute cost over that data is very expensive, even if it's not a lot of data, it might still bottleneck the query en CPU power, so we'll benefit from more CPUs (more machines).

prepare_dynamic_plan

This is analogous to the existing prepare_static_plan, but with a runtime based dynamic task count assignation.

In order to understand how it works, let's use the classical aggregation example:

┌───────────────────┐
│CoalescePartitions │
└───────────────────┘
┌───────────────────┐
│  ProjectionExec   │
└───────────────────┘
┌───────────────────┐
│ Aggregate(final)  │
└───────────────────┘
┌───────────────────┐
│  RepartitionExec  │
└───────────────────┘
┌───────────────────┐
│Aggregate(partial) │
└───────────────────┘
┌───────────────────┐
│    FilterExec     │
└───────────────────┘
┌───────────────────┐
│  DataSourceExec   │
└───────────────────┘

The plan walks from bottom to top, until we delimit a stage, same as before:

                  Stage 1 
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ 
  ┌───────────────────┐   
│ │  RepartitionExec  │ │ 
  └───────────────────┘   
│ ┌───────────────────┐ │ 
  │Aggregate(partial) │   
│ └───────────────────┘ │ 
  ┌───────────────────┐   
│ │    FilterExec     │ │ 
  └───────────────────┘   
│ ┌───────────────────┐ │ 
  │  DataSourceExec   │   
│ └───────────────────┘ │ 
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  

We'll use runtime information of "Stage 1" in order to determine the task count for the future "Stage 2". For that, the plan is modified and a SamplerExec is inserted just below the producer head (RepartitionExec):

                  Stage 1           
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐           
  ┌───────────────────┐             
│ │  RepartitionExec  │ │           
  └───────────────────┘             
│ ┌───────────────────┐ │           
  │    SamplerExec    │◀────────────
│ └───────────────────┘ │           
  ┌───────────────────┐             
│ │Aggregate(partial) │ │           
  └───────────────────┘             
│ ┌───────────────────┐ │           
  │    FilterExec     │             
│ └───────────────────┘ │           
  ┌───────────────────┐             
│ │  DataSourceExec   │ │           
  └───────────────────┘             
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘           

Let's say that the TaskEstimator for the leaf node DataSourceExec decided a Desired(3) task estimation. This subplan is then immediately sent to three workers, and upon setting the plan on the workers, all the SamplerExecs are kicked off and they start sampling even before any call to .execute() in any node:

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─Stage 1 
   ┌ ─ ─ ─ ─ ─ ─ ─Worker 1  ┌ ─ ─ ─ ─ ─ ─ ─Worker 2  ┌ ─ ─ ─ ─ ─ ─ ─Worker 3       
│   ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │    
   ││  RepartitionExec  ││  ││  RepartitionExec  ││  ││  RepartitionExec  ││       
│   └───────────────────┘    └───────────────────┘    └───────────────────┘   │    
   │┌───────────────────┐│  │┌───────────────────┐│  │┌───────────────────┐│       
│   │    SamplerExec    │    │    SamplerExec    │    │    SamplerExec    │   │    
   │└───────────────────┘│  │└───────────────────┘│  │└───────────────────┘│       
│             │                        │                        │             │    
   │   Eager kick off    │  │   Eager kick off    │  │   Eager kick off    │       
│             ▼                        ▼                        ▼             │    
   │┌───────────────────┐│  │┌───────────────────┐│  │┌───────────────────┐│       
│   │Aggregate(partial) │    │Aggregate(partial) │    │Aggregate(partial) │   │    
   │└───────────────────┘│  │└───────────────────┘│  │└───────────────────┘│       
│   ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │    
   ││    FilterExec     ││  ││    FilterExec     ││  ││    FilterExec     ││       
│   └───────────────────┘    └───────────────────┘    └───────────────────┘   │    
   │┌───────────────────┐│  │┌───────────────────┐│  │┌───────────────────┐│       
│   │  DataSourceExec   │    │  DataSourceExec   │    │  DataSourceExec   │   │    
   │└───────────────────┘│  │└───────────────────┘│  │└───────────────────┘│       
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │    
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     

All the SamplerExecs start sampling, and they start reporting LoadInfo messages to the prepare_dynamic_plan function, which builds the next NetworkShuffleExec boundary with some pre-defined Statistics, the ones that where collected at runtime by the SamplerExecs below:

                               ┌───────────────────┐  *ProjectionExec*                              
                               └───────────────────┘  *                              
                               ┌───────────────────┐  *Aggregate(final)*                              
                               └───────────────────┘  *   Still not planned          
                               ┌───────────────────┐  *NetworkShuffleExec*(+ runtime stats)*                              
                               └───────▲──▲──▲─────┘  *                              
                                       │  │  │                                       
┌──────────────────────────────────────┘  │  └────────┐                              
│                            ┌────────────┘           │                              
│                            │                        │                              
│ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─Stage 1 
│    ┌ ─ ─ ─ ─ ─ ─ ─Worker 1 │┌ ─ ─ ─ ─ ─ ─ ─Worker 2 │┌ ─ ─ ─ ─ ─ ─ ─Worker 3       
│ │   ┌───────────────────┐  │ ┌───────────────────┐  │ ┌───────────────────┐   │    
│    ││  RepartitionExec  ││ │││  RepartitionExec  ││ │││  RepartitionExec  ││       
│ │   └───────────────────┘  │ └───────────────────┘  │ └───────────────────┘   │    
│    │┌───────────────────┐│ ││┌───────────────────┐│ ││┌───────────────────┐│       
└─┼───┤    SamplerExec    │  └─┤    SamplerExec    │  └─┤    SamplerExec    │   │    
     │└───────────────────┘│  │└───────────────────┘│  │└───────────────────┘│       
  │             │                        │                        │             │    
     │   Eager kick off    │  │   Eager kick off    │  │   Eager kick off    │       
  │             ▼                        ▼                        ▼             │    
     │┌───────────────────┐│  │┌───────────────────┐│  │┌───────────────────┐│       
  │   │Aggregate(partial) │    │Aggregate(partial) │    │Aggregate(partial) │   │    
     │└───────────────────┘│  │└───────────────────┘│  │└───────────────────┘│       
  │   ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │    
     ││    FilterExec     ││  ││    FilterExec     ││  ││    FilterExec     ││       
  │   └───────────────────┘    └───────────────────┘    └───────────────────┘   │    
     │┌───────────────────┐│  │┌───────────────────┐│  │┌───────────────────┐│       
  │   │  DataSourceExec   │    │  DataSourceExec   │    │  DataSourceExec   │   │    
     │└───────────────────┘│  │└───────────────────┘│  │└───────────────────┘│       
  │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │    
   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     

While delimiting the next stage, the one that contains the ProjectionExec and the Aggregate(final), the task count for that stage is decided purely over its compute cost, based on stats. At this point, the stats we have in that slice of the plan are very accurate, as we manage to gather them and condense them in NetworkShuffleExec, which at that point acts as a leaf node (as it's stage was already set to Stage::Remote because it already was sent and started sampling).

Based on the compute cost inferred from the runtime stats, we suddenly realize that there's very very little data flowing through the SamplerExec, and as a consequence, the compute cost of "Stage 2" is estimated to be super low, so we collapse early to a single node:

┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─Stage 2   
                            ┌ ─ ─ ─ ─ ─ ─ ─Worker 1                                
│                            ┌───────────────────┐                            │    
                            ││  ProjectionExec   ││                                
│                            └───────────────────┘                            │    
                            │┌───────────────────┐│                                
│                            │ Aggregate(final)Very little data flowing, │    
                            │└───────────────────┘│     1 task is enough           
│                            ┌───────────────────┐                            │    
                            ││NetworkShuffleExec ││                                
│                            │ (+ runtime stats) │                            │    
                            │└───────────────────┘│                                
│                            ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                            │    
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─     
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─Stage 1 
   ┌ ─ ─ ─ ─ ─ ─ ─Worker 1  ┌ ─ ─ ─ ─ ─ ─ ─Worker 2  ┌ ─ ─ ─ ─ ─ ─ ─Worker 3       
│   ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │    
   ││  RepartitionExec  ││  ││  RepartitionExec  ││  ││  RepartitionExec  ││       
│   └───────────────────┘    └───────────────────┘    └───────────────────┘   │    
   │┌───────────────────┐│  │┌───────────────────┐│  │┌───────────────────┐│       
│   │    SamplerExec    │    │    SamplerExec    │    │    SamplerExec    │   │    
   │└───────────────────┘│  │└───────────────────┘│  │└───────────────────┘│       
│   ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │    
   ││Aggregate(partial) ││  ││Aggregate(partial) ││  ││Aggregate(partial) ││       
│   └───────────────────┘    └───────────────────┘    └───────────────────┘   │    
   │┌───────────────────┐│  │┌───────────────────┐│  │┌───────────────────┐│       
│   │    FilterExec     │    │    FilterExec     │    │    FilterExec     │   │    
   │└───────────────────┘│  │└───────────────────┘│  │└───────────────────┘│       
│   ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐   │    
   ││  DataSourceExec   ││  ││  DataSourceExec   ││  ││  DataSourceExec   ││       
│   └───────────────────┘    └───────────────────┘    └───────────────────┘   │    
   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘       
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘    

Benchmarks

Extracted comparing runs of this same branch with --dynamic false VS --dynamic true:

TPCH SF1: Tasks 279 -> 318, prev=14631 ms, new=6727 ms, diff=2.17 faster ✅
      q1: prev= 319 ms, new= 307 ms, diff=1.04 faster ✔
      q2: prev=1054 ms, new= 393 ms, diff=2.68 faster ✅
      q3: prev= 989 ms, new= 504 ms, diff=1.96 faster ✅
      q4: prev= 436 ms, new= 449 ms, diff=1.03 slower ✖
      q5: prev=1019 ms, new= 314 ms, diff=3.25 faster ✅
      q6: prev= 255 ms, new= 206 ms, diff=1.24 faster ✅
      q7: prev= 934 ms, new= 310 ms, diff=3.01 faster ✅
      q8: prev=1307 ms, new= 467 ms, diff=2.80 faster ✅
      q9: prev=1208 ms, new= 331 ms, diff=3.65 faster ✅
     q10: prev= 585 ms, new= 381 ms, diff=1.54 faster ✅
     q11: prev= 380 ms, new= 183 ms, diff=2.08 faster ✅
     q12: prev= 328 ms, new= 170 ms, diff=1.93 faster ✅
     q13: prev= 477 ms, new= 176 ms, diff=2.71 faster ✅
     q14: prev= 644 ms, new= 222 ms, diff=2.90 faster ✅
     q15: prev= 946 ms, new= 175 ms, diff=5.41 faster ✅
     q16: prev= 268 ms, new= 297 ms, diff=1.11 slower ✖
     q17: prev= 828 ms, new= 281 ms, diff=2.95 faster ✅
     q18: prev= 838 ms, new= 261 ms, diff=3.21 faster ✅
     q19: prev= 424 ms, new= 313 ms, diff=1.35 faster ✅
     q20: prev= 374 ms, new= 200 ms, diff=1.87 faster ✅
     q21: prev= 815 ms, new= 608 ms, diff=1.34 faster ✅
     q22: prev= 203 ms, new= 179 ms, diff=1.13 faster ✔
   TOTAL: prev=14631 ms, new=6727 ms, diff=2.17 faster ✅
TPCH SF10: Tasks 363 -> 734, prev=35577 ms, new=13014 ms, diff=2.73 faster ✅
      q1: prev=1498 ms, new= 604 ms, diff=2.48 faster ✅
      q2: prev=2258 ms, new= 529 ms, diff=4.27 faster ✅
      q3: prev=2005 ms, new= 567 ms, diff=3.54 faster ✅
      q4: prev= 841 ms, new= 468 ms, diff=1.80 faster ✅
      q5: prev=2051 ms, new= 701 ms, diff=2.93 faster ✅
      q6: prev= 679 ms, new= 575 ms, diff=1.18 faster ✔
      q7: prev=1810 ms, new= 723 ms, diff=2.50 faster ✅
      q8: prev=3028 ms, new= 769 ms, diff=3.94 faster ✅
      q9: prev=3011 ms, new= 885 ms, diff=3.40 faster ✅
     q10: prev=1699 ms, new= 764 ms, diff=2.22 faster ✅
     q11: prev=1142 ms, new= 304 ms, diff=3.76 faster ✅
     q12: prev= 913 ms, new= 427 ms, diff=2.14 faster ✅
     q13: prev= 789 ms, new= 925 ms, diff=1.17 slower ✖
     q14: prev=1261 ms, new= 489 ms, diff=2.58 faster ✅
     q15: prev=2132 ms, new= 541 ms, diff=3.94 faster ✅
     q16: prev= 610 ms, new= 242 ms, diff=2.52 faster ✅
     q17: prev=1771 ms, new= 589 ms, diff=3.01 faster ✅
     q18: prev=2162 ms, new= 795 ms, diff=2.72 faster ✅
     q19: prev=1157 ms, new= 480 ms, diff=2.41 faster ✅
     q20: prev=1350 ms, new= 669 ms, diff=2.02 faster ✅
     q21: prev=2830 ms, new= 747 ms, diff=3.79 faster ✅
     q22: prev= 580 ms, new= 221 ms, diff=2.62 faster ✅
   TOTAL: prev=35577 ms, new=13014 ms, diff=2.73 faster ✅
TPCH SF100: Tasks 387 -> 1175, prev=307187 ms, new=62943 ms, diff=4.88 faster ✅
      q1: prev=16724 ms, new=3086 ms, diff=5.42 faster ✅
      q2: prev=3282 ms, new=1038 ms, diff=3.16 faster ✅
      q3: prev=14599 ms, new=2683 ms, diff=5.44 faster ✅
      q4: prev=9351 ms, new=1378 ms, diff=6.79 faster ✅
      q5: prev=16400 ms, new=3628 ms, diff=4.52 faster ✅
      q6: prev=9168 ms, new=1552 ms, diff=5.91 faster ✅
      q7: prev=17370 ms, new=4049 ms, diff=4.29 faster ✅
      q8: prev=22325 ms, new=4329 ms, diff=5.16 faster ✅
      q9: prev=24036 ms, new=5788 ms, diff=4.15 faster ✅
     q10: prev=14957 ms, new=5010 ms, diff=2.99 faster ✅
     q11: prev=2642 ms, new= 852 ms, diff=3.10 faster ✅
     q12: prev=10561 ms, new=1726 ms, diff=6.12 faster ✅
     q13: prev=4839 ms, new=1503 ms, diff=3.22 faster ✅
     q14: prev=9423 ms, new=1625 ms, diff=5.80 faster ✅
     q15: prev=26224 ms, new=3120 ms, diff=8.41 faster ✅
     q16: prev=1426 ms, new= 566 ms, diff=2.52 faster ✅
     q17: prev=21121 ms, new=4623 ms, diff=4.57 faster ✅
     q18: prev=25218 ms, new=5256 ms, diff=4.80 faster ✅
     q19: prev=11387 ms, new=1751 ms, diff=6.50 faster ✅
     q20: prev=13517 ms, new=2141 ms, diff=6.31 faster ✅
     q21: prev=29777 ms, new=6310 ms, diff=4.72 faster ✅
     q22: prev=2840 ms, new= 929 ms, diff=3.06 faster ✅
   TOTAL: prev=307187 ms, new=62943 ms, diff=4.88 faster ✅
TPCDS SF1: Tasks 3079->2770, prev=62815 ms, new=44737 ms, diff=1.40 faster ✅
     q1: prev=1039 ms, new= 531 ms, diff=1.96 faster ✅
      q2: prev= 677 ms, new= 396 ms, diff=1.71 faster ✅
      q3: prev= 618 ms, new= 288 ms, diff=2.15 faster ✅
      q4: prev=3420 ms, new=2582 ms, diff=1.32 faster ✅
      q5: prev= 767 ms, new= 554 ms, diff=1.38 faster ✅
      q6: prev=1002 ms, new= 967 ms, diff=1.04 faster ✔
      q7: prev= 614 ms, new= 421 ms, diff=1.46 faster ✅
      q8: prev= 484 ms, new= 422 ms, diff=1.15 faster ✔
      q9: prev= 267 ms, new= 341 ms, diff=1.28 slower ❌
     q10: prev= 648 ms, new= 826 ms, diff=1.27 slower ❌
     q11: prev=2281 ms, new=1663 ms, diff=1.37 faster ✅
     q12: prev= 287 ms, new= 229 ms, diff=1.25 faster ✅
     q13: prev= 962 ms, new= 720 ms, diff=1.34 faster ✅
     q14: prev=1095 ms, new= 737 ms, diff=1.49 faster ✅
     q15: prev= 314 ms, new= 212 ms, diff=1.48 faster ✅
     q16: prev= 528 ms, new= 597 ms, diff=1.13 slower ✖
     q17: prev= 768 ms, new= 257 ms, diff=2.99 faster ✅
     q18: prev= 780 ms, new= 339 ms, diff=2.30 faster ✅
     q19: prev= 280 ms, new= 269 ms, diff=1.04 faster ✔
     q20: prev= 180 ms, new= 169 ms, diff=1.07 faster ✔
     q21: prev= 708 ms, new= 382 ms, diff=1.85 faster ✅
     q22: prev= 490 ms, new= 435 ms, diff=1.13 faster ✔
     q23: prev=1149 ms, new= 567 ms, diff=2.03 faster ✅
     q24: prev=1148 ms, new= 586 ms, diff=1.96 faster ✅
     q25: prev= 558 ms, new= 275 ms, diff=2.03 faster ✅
     q26: prev= 346 ms, new= 265 ms, diff=1.31 faster ✅
     q27: prev= 480 ms, new= 328 ms, diff=1.46 faster ✅
     q28: prev= 329 ms, new= 221 ms, diff=1.49 faster ✅
     q29: prev= 663 ms, new= 251 ms, diff=2.64 faster ✅
     q31: prev=1268 ms, new= 218 ms, diff=5.82 faster ✅
     q32: prev= 426 ms, new= 190 ms, diff=2.24 faster ✅
     q33: prev= 429 ms, new= 321 ms, diff=1.34 faster ✅
     q34: prev= 424 ms, new= 373 ms, diff=1.14 faster ✔
     q35: prev= 655 ms, new= 546 ms, diff=1.20 faster ✔
     q36: prev= 388 ms, new= 290 ms, diff=1.34 faster ✅
     q37: prev= 389 ms, new= 384 ms, diff=1.01 faster ✔
     q38: prev= 568 ms, new= 252 ms, diff=2.25 faster ✅
     q39: prev= 547 ms, new= 479 ms, diff=1.14 faster ✔
     q40: prev= 612 ms, new= 417 ms, diff=1.47 faster ✅
     q41: prev= 128 ms, new= 143 ms, diff=1.12 slower ✖
     q42: prev= 331 ms, new= 139 ms, diff=2.38 faster ✅
     q43: prev= 158 ms, new= 204 ms, diff=1.29 slower ❌
     q44: prev= 514 ms, new= 353 ms, diff=1.46 faster ✅
     q45: prev= 398 ms, new= 237 ms, diff=1.68 faster ✅
     q46: prev= 641 ms, new= 495 ms, diff=1.29 faster ✅
     q47: prev= 964 ms, new= 488 ms, diff=1.98 faster ✅
     q48: prev= 514 ms, new= 418 ms, diff=1.23 faster ✅
     q49: prev= 550 ms, new= 314 ms, diff=1.75 faster ✅
     q50: prev= 549 ms, new= 445 ms, diff=1.23 faster ✅
     q51: prev= 425 ms, new= 226 ms, diff=1.88 faster ✅
     q52: prev= 252 ms, new= 117 ms, diff=2.15 faster ✅
     q53: prev= 202 ms, new= 223 ms, diff=1.10 slower ✖
     q54: prev= 679 ms, new= 383 ms, diff=1.77 faster ✅
     q55: prev= 191 ms, new= 155 ms, diff=1.23 faster ✅
     q56: prev= 448 ms, new= 332 ms, diff=1.35 faster ✅
     q57: prev= 673 ms, new= 288 ms, diff=2.34 faster ✅
     q58: prev= 644 ms, new= 289 ms, diff=2.23 faster ✅
     q59: prev= 447 ms, new= 294 ms, diff=1.52 faster ✅
     q60: prev= 301 ms, new= 291 ms, diff=1.03 faster ✔
     q61: prev=1017 ms, new=1095 ms, diff=1.08 slower ✖
     q62: prev= 776 ms, new= 761 ms, diff=1.02 faster ✔
     q63: prev= 243 ms, new= 223 ms, diff=1.09 faster ✔
     q64: prev=1907 ms, new=1370 ms, diff=1.39 faster ✅
     q65: prev= 432 ms, new= 259 ms, diff=1.67 faster ✅
     q66: prev= 839 ms, new= 781 ms, diff=1.07 faster ✔
     q67: prev= 486 ms, new= 389 ms, diff=1.25 faster ✅
     q68: prev= 432 ms, new= 373 ms, diff=1.16 faster ✔
     q69: prev= 596 ms, new= 618 ms, diff=1.04 slower ✖
     q70: prev= 440 ms, new= 541 ms, diff=1.23 slower ❌
     q71: prev= 406 ms, new= 298 ms, diff=1.36 faster ✅
     q72: prev=5276 ms, new=3156 ms, diff=1.67 faster ✅
     q73: prev= 259 ms, new= 294 ms, diff=1.14 slower ✖
     q74: prev= 733 ms, new= 656 ms, diff=1.12 faster ✔
     q75: prev=1004 ms, new= 542 ms, diff=1.85 faster ✅
     q76: prev= 362 ms, new= 232 ms, diff=1.56 faster ✅
     q77: prev= 549 ms, new= 305 ms, diff=1.80 faster ✅
     q78: prev= 954 ms, new= 299 ms, diff=3.19 faster ✅
     q79: prev= 278 ms, new= 266 ms, diff=1.05 faster ✔
     q80: prev= 689 ms, new= 398 ms, diff=1.73 faster ✅
     q81: prev= 441 ms, new= 344 ms, diff=1.28 faster ✅
     q82: prev= 315 ms, new= 310 ms, diff=1.02 faster ✔
     q83: prev= 551 ms, new= 299 ms, diff=1.84 faster ✅
     q84: prev= 321 ms, new= 359 ms, diff=1.12 slower ✖
     q85: prev= 594 ms, new= 545 ms, diff=1.09 faster ✔
     q86: prev= 168 ms, new= 186 ms, diff=1.11 slower ✖
     q87: prev= 394 ms, new= 283 ms, diff=1.39 faster ✅
     q88: prev= 479 ms, new= 484 ms, diff=1.01 slower ✖
     q89: prev= 262 ms, new= 207 ms, diff=1.27 faster ✅
     q90: prev= 352 ms, new= 274 ms, diff=1.28 faster ✅
     q91: prev= 541 ms, new= 404 ms, diff=1.34 faster ✅
     q92: prev= 326 ms, new= 207 ms, diff=1.57 faster ✅
     q93: prev= 287 ms, new= 226 ms, diff=1.27 faster ✅
     q94: prev= 401 ms, new= 451 ms, diff=1.12 slower ✖
     q95: prev= 444 ms, new= 354 ms, diff=1.25 faster ✅
     q96: prev= 193 ms, new= 235 ms, diff=1.22 slower ❌
     q97: prev= 268 ms, new= 174 ms, diff=1.54 faster ✅
     q98: prev= 165 ms, new= 176 ms, diff=1.07 slower ✖
     q99: prev=1038 ms, new=1229 ms, diff=1.18 slower ✖
   TOTAL: prev=62815 ms, new=44737 ms, diff=1.40 faster ✅
ClickBench 0-100: Tasks 912->609, prev=35116 ms, new=34185 ms, diff=1.03 faster ✔
      q0: prev=   2 ms, new=   2 ms, diff=1.00 slower ✖
      q1: prev= 506 ms, new= 396 ms, diff=1.28 faster ✅
      q2: prev= 477 ms, new= 337 ms, diff=1.42 faster ✅
      q3: prev= 323 ms, new= 452 ms, diff=1.40 slower ❌
      q4: prev= 405 ms, new= 439 ms, diff=1.08 slower ✖
      q5: prev= 585 ms, new= 555 ms, diff=1.05 faster ✔
      q6: prev=   2 ms, new=   2 ms, diff=1.00 slower ✖
      q7: prev= 253 ms, new= 363 ms, diff=1.43 slower ❌
      q8: prev= 492 ms, new= 477 ms, diff=1.03 faster ✔
      q9: prev= 620 ms, new= 635 ms, diff=1.02 slower ✖
     q10: prev= 478 ms, new= 461 ms, diff=1.04 faster ✔
     q11: prev= 449 ms, new= 392 ms, diff=1.15 faster ✔
     q12: prev= 596 ms, new= 550 ms, diff=1.08 faster ✔
     q13: prev= 771 ms, new= 792 ms, diff=1.03 slower ✖
     q14: prev= 536 ms, new= 549 ms, diff=1.02 slower ✖
     q15: prev= 477 ms, new= 414 ms, diff=1.15 faster ✔
     q16: prev= 775 ms, new= 753 ms, diff=1.03 faster ✔
     q17: prev= 749 ms, new= 726 ms, diff=1.03 faster ✔
     q18: prev=1114 ms, new=1038 ms, diff=1.07 faster ✔
     q19: prev= 380 ms, new= 391 ms, diff=1.03 slower ✖
     q20: prev=1886 ms, new=1786 ms, diff=1.06 faster ✔
     q21: prev=1807 ms, new=1775 ms, diff=1.02 faster ✔
     q22: prev=2144 ms, new=2053 ms, diff=1.04 faster ✔
     q23: prev=5267 ms, new=5354 ms, diff=1.02 slower ✖
     q24: prev= 461 ms, new= 554 ms, diff=1.20 slower ❌
     q25: prev= 409 ms, new= 482 ms, diff=1.18 slower ✖
     q26: prev= 582 ms, new= 463 ms, diff=1.26 faster ✅
     q27: prev=1958 ms, new=1955 ms, diff=1.00 faster ✔
     q28: prev=2504 ms, new=2367 ms, diff=1.06 faster ✔
     q29: prev= 297 ms, new= 261 ms, diff=1.14 faster ✔
     q30: prev= 614 ms, new= 589 ms, diff=1.04 faster ✔
     q31: prev= 714 ms, new= 656 ms, diff=1.09 faster ✔
     q32: prev= 946 ms, new= 891 ms, diff=1.06 faster ✔
     q33: prev=2557 ms, new=2397 ms, diff=1.07 faster ✔
     q34: prev=2521 ms, new=2440 ms, diff=1.03 faster ✔
     q35: prev= 459 ms, new= 438 ms, diff=1.05 faster ✔
   TOTAL: prev=35116 ms, new=34185 ms, diff=1.03 faster ✔

@gabotechs gabotechs changed the base branch from main to gabrielmusat/local-worker-connections May 4, 2026 12:50
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch from 37a2990 to 9f3d99b Compare May 4, 2026 12:52
@gabotechs gabotechs changed the title Adaptative task count assignation Adaptive task count assignation May 4, 2026
@gabotechs gabotechs force-pushed the gabrielmusat/local-worker-connections branch from 024e0f1 to 5751a15 Compare May 6, 2026 19:46
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch from 9f3d99b to 48cae57 Compare May 6, 2026 19:50
@gabotechs gabotechs force-pushed the gabrielmusat/local-worker-connections branch from 5751a15 to d2a57e1 Compare May 6, 2026 19:52
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch from 48cae57 to 2aa45ce Compare May 6, 2026 19:52
gabotechs added a commit that referenced this pull request May 11, 2026
PR factored out from
#416.

This is one PR from the following stack of PRs:
- #422
<- you are here
- #424
- #416
- #425
- #426
- #427
- #432

Previously, we where force-propagating a max task count assignation
below
the NetworkBroadcast so that the remote build side has never more tasks
than the stage above.

In a dynamic task count assignation context, we can no longer do this,
as by the time
you realize a remote build side is going to have more tasks than the
stage above, the build side might have
already started executing, and by that time its task count is set in
stone.

This is fine, build side in broadcast should have an arbitrarily more
expensive
build side. What matters there is not that the build side is cheap to
execute, but that it returns little amount of data. A build side can
return
very little data (just a couple of rows) and still be very expensive to
execute

This is actually the reason why there is a small speedup in benchmarks.

---

<details><summary>tpch_sf10 1.09 faster ✔</summary>

```text
=== Comparing tpch_sf10 results from engine 'datafusion-distributed-main' [prev] with 'datafusion-distributed-dynamic-task-allocation' [new] ===
      q1: prev=1269 ms, new=1286 ms, diff=1.01 slower ✖
      q2: prev= 390 ms, new= 395 ms, diff=1.01 slower ✖
      q3: prev= 784 ms, new= 826 ms, diff=1.05 slower ✖
      q4: prev= 413 ms, new= 392 ms, diff=1.05 faster ✔
      q5: prev=1306 ms, new=1242 ms, diff=1.05 faster ✔
      q6: prev= 534 ms, new= 528 ms, diff=1.01 faster ✔
      q7: prev=1483 ms, new=1420 ms, diff=1.04 faster ✔
      q8: prev=3001 ms, new=1585 ms, diff=1.89 faster ✅
      q9: prev=2054 ms, new=2009 ms, diff=1.02 faster ✔
     q10: prev= 951 ms, new= 921 ms, diff=1.03 faster ✔
     q11: prev= 322 ms, new= 304 ms, diff=1.06 faster ✔
     q12: prev= 670 ms, new= 676 ms, diff=1.01 slower ✖
     q13: prev= 624 ms, new= 613 ms, diff=1.02 faster ✔
     q14: prev= 594 ms, new= 546 ms, diff=1.09 faster ✔
     q15: prev= 778 ms, new= 756 ms, diff=1.03 faster ✔
     q16: prev= 223 ms, new= 219 ms, diff=1.02 faster ✔
     q17: prev=1644 ms, new=1733 ms, diff=1.05 slower ✖
     q18: prev=1884 ms, new=1966 ms, diff=1.04 slower ✖
     q19: prev= 802 ms, new= 727 ms, diff=1.10 faster ✔
     q20: prev= 784 ms, new= 706 ms, diff=1.11 faster ✔
     q21: prev=2112 ms, new=1925 ms, diff=1.10 faster ✔
     q22: prev= 251 ms, new= 261 ms, diff=1.04 slower ✖
   TOTAL: prev=68651.703894 ms, new=63144.566305999986 ms, diff=1.09 faster ✔
```

</details>

<details><summary>tpcds_sf1 1.02 faster ✔</summary>

```text
=== Comparing tpcds_sf1 results from engine 'datafusion-distributed-dynamic-task-allocation' [prev] with 'datafusion-distributed-dynamic-task-allocation' [new] ===
      q1: prev= 260 ms, new= 336 ms, diff=1.29 slower ❌
      q2: prev= 290 ms, new= 321 ms, diff=1.11 slower ✖
      q3: prev= 181 ms, new= 215 ms, diff=1.19 slower ✖
      q4: prev=2039 ms, new=2184 ms, diff=1.07 slower ✖
      q5: prev= 333 ms, new= 325 ms, diff=1.02 faster ✔
      q6: prev= 622 ms, new= 676 ms, diff=1.09 slower ✖
      q7: prev= 225 ms, new= 225 ms, diff=1.00 slower ✖
      q8: prev= 312 ms, new= 200 ms, diff=1.56 faster ✅
      q9: prev= 242 ms, new= 189 ms, diff=1.28 faster ✅
     q10: prev= 480 ms, new= 494 ms, diff=1.03 slower ✖
     q11: prev=1511 ms, new=1382 ms, diff=1.09 faster ✔
     q12: prev= 262 ms, new= 292 ms, diff=1.11 slower ✖
     q13: prev= 477 ms, new= 487 ms, diff=1.02 slower ✖
     q14: prev= 637 ms, new= 782 ms, diff=1.23 slower ❌
     q15: prev= 170 ms, new= 144 ms, diff=1.18 faster ✔
     q16: prev= 350 ms, new= 379 ms, diff=1.08 slower ✖
     q17: prev= 229 ms, new= 250 ms, diff=1.09 slower ✖
     q18: prev= 295 ms, new= 281 ms, diff=1.05 faster ✔
     q19: prev= 286 ms, new= 254 ms, diff=1.13 faster ✔
     q20: prev= 206 ms, new= 143 ms, diff=1.44 faster ✅
     q21: prev= 305 ms, new= 282 ms, diff=1.08 faster ✔
     q22: prev= 390 ms, new= 401 ms, diff=1.03 slower ✖
     q23: prev= 672 ms, new= 640 ms, diff=1.05 faster ✔
     q24: prev= 368 ms, new= 376 ms, diff=1.02 slower ✖
     q25: prev= 203 ms, new= 279 ms, diff=1.37 slower ❌
     q26: prev= 147 ms, new= 198 ms, diff=1.35 slower ❌
     q27: prev= 406 ms, new= 358 ms, diff=1.13 faster ✔
     q28: prev= 195 ms, new= 161 ms, diff=1.21 faster ✅
     q29: prev= 237 ms, new= 219 ms, diff=1.08 faster ✔
     q31: prev= 343 ms, new= 327 ms, diff=1.05 faster ✔
     q32: prev= 142 ms, new= 152 ms, diff=1.07 slower ✖
     q33: prev= 277 ms, new= 211 ms, diff=1.31 faster ✅
     q34: prev= 199 ms, new= 188 ms, diff=1.06 faster ✔
     q35: prev= 514 ms, new= 498 ms, diff=1.03 faster ✔
     q36: prev= 341 ms, new= 311 ms, diff=1.10 faster ✔
     q37: prev= 256 ms, new= 302 ms, diff=1.18 slower ✖
     q38: prev= 228 ms, new= 245 ms, diff=1.07 slower ✖
     q39: prev= 259 ms, new= 266 ms, diff=1.03 slower ✖
     q40: prev= 281 ms, new= 325 ms, diff=1.16 slower ✖
     q41: prev=  87 ms, new=  90 ms, diff=1.03 slower ✖
     q42: prev= 116 ms, new= 124 ms, diff=1.07 slower ✖
     q43: prev= 190 ms, new= 132 ms, diff=1.44 faster ✅
     q44: prev= 214 ms, new= 144 ms, diff=1.49 faster ✅
     q45: prev= 244 ms, new= 186 ms, diff=1.31 faster ✅
     q46: prev= 355 ms, new= 288 ms, diff=1.23 faster ✅
     q47: prev= 374 ms, new= 387 ms, diff=1.03 slower ✖
     q48: prev= 384 ms, new= 360 ms, diff=1.07 faster ✔
     q49: prev= 285 ms, new= 229 ms, diff=1.24 faster ✅
     q50: prev= 352 ms, new= 343 ms, diff=1.03 faster ✔
     q51: prev= 305 ms, new= 224 ms, diff=1.36 faster ✅
     q52: prev= 138 ms, new= 127 ms, diff=1.09 faster ✔
     q53: prev= 143 ms, new= 158 ms, diff=1.10 slower ✖
     q54: prev= 331 ms, new= 271 ms, diff=1.22 faster ✅
     q55: prev= 132 ms, new= 145 ms, diff=1.10 slower ✖
     q56: prev= 298 ms, new= 233 ms, diff=1.28 faster ✅
     q57: prev= 335 ms, new= 354 ms, diff=1.06 slower ✖
     q58: prev= 280 ms, new= 284 ms, diff=1.01 slower ✖
     q59: prev= 293 ms, new= 270 ms, diff=1.09 faster ✔
     q60: prev= 361 ms, new= 311 ms, diff=1.16 faster ✔
     q61: prev= 856 ms, new= 849 ms, diff=1.01 faster ✔
     q62: prev= 639 ms, new= 665 ms, diff=1.04 slower ✖
     q63: prev= 224 ms, new= 148 ms, diff=1.51 faster ✅
     q64: prev=1159 ms, new=1193 ms, diff=1.03 slower ✖
     q65: prev= 229 ms, new= 228 ms, diff=1.00 faster ✔
     q66: prev= 730 ms, new= 714 ms, diff=1.02 faster ✔
     q67: prev= 406 ms, new= 420 ms, diff=1.03 slower ✖
     q68: prev= 289 ms, new= 320 ms, diff=1.11 slower ✖
     q69: prev= 513 ms, new= 570 ms, diff=1.11 slower ✖
     q70: prev= 394 ms, new= 386 ms, diff=1.02 faster ✔
     q71: prev= 250 ms, new= 329 ms, diff=1.32 slower ❌
     q72: prev=6644 ms, new=6609 ms, diff=1.01 faster ✔
     q73: prev= 201 ms, new= 210 ms, diff=1.04 slower ✖
     q74: prev= 797 ms, new= 743 ms, diff=1.07 faster ✔
     q75: prev= 375 ms, new= 452 ms, diff=1.21 slower ❌
     q76: prev= 165 ms, new= 230 ms, diff=1.39 slower ❌
     q77: prev= 232 ms, new= 271 ms, diff=1.17 slower ✖
     q78: prev= 341 ms, new= 353 ms, diff=1.04 slower ✖
     q79: prev= 226 ms, new= 228 ms, diff=1.01 slower ✖
     q80: prev= 332 ms, new= 336 ms, diff=1.01 slower ✖
     q81: prev= 216 ms, new= 191 ms, diff=1.13 faster ✔
     q82: prev= 258 ms, new= 262 ms, diff=1.02 slower ✖
     q83: prev= 240 ms, new= 287 ms, diff=1.20 slower ✖
     q84: prev= 240 ms, new= 228 ms, diff=1.05 faster ✔
     q85: prev= 455 ms, new= 364 ms, diff=1.25 faster ✅
     q86: prev= 124 ms, new= 138 ms, diff=1.11 slower ✖
     q87: prev= 203 ms, new= 208 ms, diff=1.02 slower ✖
     q88: prev= 404 ms, new= 350 ms, diff=1.15 faster ✔
     q89: prev= 237 ms, new= 167 ms, diff=1.42 faster ✅
     q90: prev= 189 ms, new= 187 ms, diff=1.01 faster ✔
     q91: prev= 377 ms, new= 328 ms, diff=1.15 faster ✔
     q92: prev= 284 ms, new= 131 ms, diff=2.17 faster ✅
     q93: prev= 154 ms, new= 142 ms, diff=1.08 faster ✔
     q94: prev= 302 ms, new= 308 ms, diff=1.02 slower ✖
     q95: prev= 365 ms, new= 290 ms, diff=1.26 faster ✅
     q96: prev= 177 ms, new= 157 ms, diff=1.13 faster ✔
     q97: prev= 235 ms, new= 170 ms, diff=1.38 faster ✅
     q98: prev= 165 ms, new= 159 ms, diff=1.04 faster ✔
     q99: prev= 951 ms, new= 995 ms, diff=1.05 slower ✖
   TOTAL: prev=123029.07797800002 ms, new=120962.55825200005 ms, diff=1.02 faster ✔
```

</details>
gabotechs added a commit that referenced this pull request May 11, 2026
An independent refactor factored out from
#416

This is one PR from the following stack of PRs:
- #422
- #424
<- you are here
- #416
- #425
- #426
- #427
- #432


Previously the stage struct was a "hidden" state machine that could have
two states:

1. A state where the Stage contains the input plan and is locally
accessible and traversible.

```rust
pub struct Stage {
    query_id: ...
    num: ...
    plan: Some(plan),
    tasks: vec![None, None, None]
}
```

2. A state where the input plan is serialized, and the worker URLs are
assigned. This happens in `DistributedExec` right before execution on
`prepare_plan()`

```rust
pub struct Stage {
    query_id: ...
    num: ...
    plan: None,
    tasks: vec![Some("http://1"), Some("http://2"), Some("http://3")]
}
```

This PR makes this behavior explicit, and represented with an `enum`:

```rust
pub enum Stage {
    Local(LocalStage),
    Remote(RemoteStage),
}

pub struct LocalStage {
    pub query_id: Uuid,
    pub num: usize,
    pub plan: Arc<dyn ExecutionPlan>,
    pub tasks: usize,
}

pub struct RemoteStage {
    pub query_id: Uuid,
    pub num: usize,
    pub workers: Vec<Url>,
}
```
@gabotechs gabotechs force-pushed the gabrielmusat/local-worker-connections branch from d2a57e1 to a147fd7 Compare May 11, 2026 17:41
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch from 2aa45ce to d4a4236 Compare May 11, 2026 17:42
@gabotechs gabotechs force-pushed the gabrielmusat/local-worker-connections branch 2 times, most recently from f643281 to b0cb082 Compare May 11, 2026 21:00
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch 3 times, most recently from 139aa3f to 96dfd61 Compare May 12, 2026 13:33
gabotechs added a commit that referenced this pull request May 14, 2026
… planner (#416)

This is a preparatory step towards:
-
#377

This is one PR from the following stack of PRs:
- #422
- #424
- #416
<- you are here
- #425
- #426
- #427
- #432

The main purpose of this PR is to make distributed planning in a single
pass, rather than the current two that communicate each other via an
intermediate struct (`AnnotatedPlan`). This change cascades into several
other changes that produce a nicer public API for building custom
distributed plans, but also produce a big diff.

## Dropping the two-step annotation + NB injection

On a dynamic task assignation context, choosing the task count for a
stage based on the previous one can no longer
be done statically. 

After "annotating" a stage, and before "annotating" the
one above, we need to be able to send it for execution, collect runtime
metrics, and based on that decide the task count for the stage above.

This means that the stage below should be good to be sent for execution
before the full annotation process has finished, meaning that we need to
do everything there is to be done in the "annotation" process, we can no
longer divide the distribution process in several steps that recurse the
whole plan.

## Network boundaries no longer mutate their children

In order for network boundaries to know what mutations to apply to their
children, they need to now how many consumer tasks are they going to be 
running, but this might not be know until execution time, so if we want
to
dynamically assign tasks to stages, there's no way at planning time that
we can know how to mutate the children.

For example, we do not now how to scale up a `RepartitionExec` if we
don't
know how many `NetworkShuffleExec`s are going to be consuming it.

The responsibility of preparing network boundaries inputs (e.g., scaling
RepartitionExec)
is now factor out into a separate `network_boundary_scale_input()`
function
that can be called either at planning time or at execution time.

Right now, it's still just called at planning time.
gabotechs added a commit that referenced this pull request May 15, 2026
This is a preparatory step towards:
-
#377

This is one PR from the following stack of PRs:
- #422
- #424
- #416 
- #425
<- you are here
- #426
- #427
- #432


Removes `impl_set_plan.rs` in favor of just inlining its contents to
`impl_coordinator_channel.rs`.

In future changes, the relationship between `impl_set_plan.rs` and
`impl_coordinator_channel.rs` will get more complex, increasing the
function signature `impl_set_plan.rs` exposes to
`impl_coordinator_channel.rs`. This proves that the split between those
two files does not make sense, as they have never been able to evolve
independently, so we may as well just not pay the price of a complex
function signature in between.
@gabotechs gabotechs force-pushed the gabrielmusat/local-worker-connections branch from 00eeb5b to 62e00ac Compare May 15, 2026 15:13
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch from cdf6f6a to 07c09e9 Compare May 15, 2026 15:13
gabotechs added a commit that referenced this pull request May 16, 2026
This is a preparatory step towards:
-
#377

This is one PR from the following stack of PRs:
- #422
- #424
- #416 
- #425
- #426
<- you are here
- #427
- #432

`distributed.rs` contains the `DistributedExec` node, which has evolved
towards acting as a "coordinator". It's in charge of assigning tasks to
worker URLs, setting the subplans in the appropriate workers, collecting
metrics, streaming work units, etc...

Soon, it will evolve even more as we prepare for adaptative query
execution.

This PR ships two things:
- A refactor that dismantles the old `distributed.rs` into smaller
reusable modules in the `coordinator/` module.
- Bypass the metrics collection machinery if metrics collection is
disabled
@gabotechs gabotechs force-pushed the gabrielmusat/local-worker-connections branch from 62e00ac to ca2d4a9 Compare May 16, 2026 15:29
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch 2 times, most recently from eef3497 to 6be38b4 Compare May 18, 2026 13:17
@gabotechs gabotechs force-pushed the gabrielmusat/max-gauge branch 2 times, most recently from 64a36b4 to 8e539eb Compare June 1, 2026 07:47
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch from fe050cc to 709612a Compare June 1, 2026 07:47
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch 2 times, most recently from 66478d3 to e81bfcf Compare June 1, 2026 08:35
@gabotechs gabotechs force-pushed the gabrielmusat/max-gauge branch from 8e539eb to c1922d9 Compare June 1, 2026 13:01
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch from e81bfcf to 2a69af1 Compare June 1, 2026 13:02
@gabotechs gabotechs changed the base branch from gabrielmusat/max-gauge to gabrielmusat/producer-head June 1, 2026 13:21
@gabotechs gabotechs force-pushed the gabrielmusat/dynamic-task-count branch 4 times, most recently from c80380c to 7b9b3ab Compare June 2, 2026 08:48
@gabotechs gabotechs changed the base branch from gabrielmusat/producer-head to gabrielmusat/task-spawner-refactor-and-cache-invalidation June 2, 2026 08:49
@gabotechs gabotechs force-pushed the gabrielmusat/task-spawner-refactor-and-cache-invalidation branch from 77e9d3f to dc07ea6 Compare June 2, 2026 08:58
@gabotechs gabotechs mentioned this pull request Jun 8, 2026

@jayshrivastava jayshrivastava left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty promising results. Left a few comments. I also had codex do a sweep:

Last Commit Review Findings

Review target: last commit, Support dynamic task count assignation.

1. Blocking: dynamic planning can deadlock on set-but-never-executed stages

DistributedExec::execute keeps the end-query guard alive until after drain_pending_tasks():

  • src/coordinator/distributed.rs:199
  • src/coordinator/distributed.rs:218

At the same time, send_plan_task keeps the coordinator-to-worker request stream open with keep_stream_alive:

  • src/coordinator/query_coordinator.rs:184

The worker response stream also waits on the metrics stream:

  • src/worker/impl_coordinator_channel.rs:180

If the dynamic planner sets and samples a stage but later never executes that stage, metrics never completes. Then drain_pending_tasks() waits for the worker response stream to end, but the response stream is kept alive by the coordinator channel, and that channel only closes when the guard drops. The guard cannot drop because execution is waiting inside drain_pending_tasks().

This is a circular wait.

2. Blocking: task-data invalidation is still coupled to execution finish

The commit rationale says task entry invalidation moves from task execution finish to coordinator-channel lifetime, but the old invalidation path is still present:

  • src/worker/impl_execute_task.rs:97

The new coordinator-channel invalidation is also present:

  • src/worker/impl_coordinator_channel.rs:167

That means task data can still be removed when partition execution drains, even while the coordinator channel is intentionally held open. This keeps the old early-cleanup behavior and can break dynamic cases with late consumers, retries, or future execution requests that rely on the coordinator channel still owning task-data lifetime.

3. High: per-column throughput stats are almost always zero

set_per_col_bytes_per_second computes:

ready / total_ready * total_bytes_per_second

at:

  • src/execution_plans/sampler.rs:357

Because this is integer division, any column with ready < total_ready becomes 0 before multiplication. In ordinary multi-column cases, most or all columns will report zero bytes per second.

The fix is to multiply before dividing, with overflow-safe arithmetic if needed:

ready.saturating_mul(total_bytes_per_second as u64) / total_ready

4. High: bytes_per_partition_per_second = 0 can panic

The config setter accepts any usize:

  • src/distributed_ext.rs:761

But dynamic planning uses the value as a divisor:

  • src/coordinator/prepare_dynamic_plan.rs:49

If a user sets bytes_per_partition_per_second to 0, planning can panic in div_ceil. The setter should reject zero, or planning should clamp or error before dividing.

5. High: remote partition statistics ignore the requested partition

Stage::partition_statistics returns the whole remote stage statistics regardless of the requested partition:

  • src/stage.rs:149

If DataFusion calls partition_statistics(Some(partition)), each partition can appear to contain the full stage statistics. That can inflate rows and bytes by partition count and distort AQE decisions.

The safer behavior is either:

  • return unknown stats for per-partition requests when only global stats are available, or
  • track real per-partition stats and return the matching partition.

6. Medium: SamplerExec::execute can panic on stale partition ids

SamplerExec::execute indexes into partition_samplers before validating the partition:

  • src/execution_plans/sampler.rs:547

If a stale or mismatched dynamic partition id reaches this plan, this panics instead of returning a DataFusion execution error.

This should be changed to get(partition) and return exec_err! when the partition is invalid.

7. Medium: dynamic coalesce boundaries discard runtime stats

Dynamic planning explicitly skips runtime-stat collection for NetworkCoalesceExec:

  • src/coordinator/prepare_dynamic_plan.rs:98

This may be fine for the final gather stage, but if a coalesce boundary feeds another planned stage, downstream task sizing gets unknown stats. That can make later dynamic decisions less accurate or effectively static.

If NetworkCoalesceExec can appear below additional distributed operators, it should either gather stats too or document why it is always terminal for dynamic planning.

8. Medium: plan reconstruction is destructive

PlanReconstructor removes entries from stage_map while reconstructing:

  • src/coordinator/prepare_dynamic_plan.rs:157

That makes reconstruction one-shot. If a stage is referenced by more than one network boundary, or reconstruction is retried or reused for diagnostics, the second lookup fails.

Prefer borrowing and cloning from the map instead of removing entries.

9. Medium: sampler/load-info errors can surface too late

Worker load-info errors are emitted through the same response stream as metrics:

  • src/worker/impl_coordinator_channel.rs:180

The dynamic planner waits on the separate load-info receiver, while send_plan_task sees response-stream errors later:

  • src/coordinator/query_coordinator.rs:202

This can let planning continue with partial or zero stats, while the actual sampler error surfaces later during pending-task draining. Runtime-stat collection should receive explicit errors, not just missing load-info messages.

10. Low: the new lifecycle behavior needs a targeted regression test

The highest-risk scenario is:

  1. Dynamic planner sends SetPlanRequest for a stage.
  2. Worker starts sampling and sends LoadInfo.
  3. AQE chooses a different shape and that stage never reaches execute_task.
  4. The query still must finish and drain_pending_tasks() must not hang.

Existing correctness tests likely exercise stages that are eventually consumed, so they may not catch this lifecycle bug. Add a targeted test for a planned-but-abandoned dynamic stage and assert that pending-task draining completes.

Comment thread src/execution_plans/sampler.rs Outdated
/// Maximum number of record batches buffered by a sampler.
max_batches_buffered: MaxGaugeMetric,
/// Peak memory buffered by any partition sampler during the sampling phase.
max_mem_used: Gauge,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this measures what you want. Right now we do this for every batch in all partitions.

        self.max_mem_used.add(batch_size);

So this measures the size of all the batches pushed to the buffer.

You probably want to use an int to store the total size of bytes in the buffer on every push/pop. Then make this a MaxGaugeMetric and update it on every push/pop.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also none of these metrics are tagged by partition. We should probably do that. SamplerExecMetrics::new() can take a partition id.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the comment on MaxGaugeMetric should say that it only stores a max.

/// Similar to DataFusion's Gauge metric, but aggregates between instances using `max` instead of
/// `sum`.

I honestly thought it was a regular Guage with a different aggregator.

@gabotechs gabotechs Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It do is a regular Gauge with a max aggregator.

So this measures the size of all the batches pushed to the buffer.

This is equal to the max memory used. Note that entries in the buffer are never popped. Instead, they are converted into a chained stream of the buffered entries + the ones that are yet to come, but this is an only-push buffer.

This is actually not really a buffer, it's more like a peek, where some entries are peeked and then streamed back as-is without any extra buffering in the middle.

I'll update the name.

let metrics: Arc<LazyLock<_, Box<dyn FnOnce() -> SamplerExecMetrics + Send>>> =
Arc::new(LazyLock::new(Box::new(move || {
SamplerExecMetrics::new(&metric_set_clone)
})));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this, or the comment.

otherwise the coordinator side will register them when they are never relevant there

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended a bit the comment to explain it a bit more

Comment thread src/execution_plans/sampler.rs Outdated
let n_cols = self.input.schema().fields.len();

let mut reporter = LoadInfoDropHandler {
load_info: pb::LoadInfo {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we implement new on this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a protobuf message autogenerated, and typically further methods are not implemented for protobuf messages.

I can add a separate function though, I'll do that.

Comment thread src/execution_plans/sampler.rs
Some(self.worker_connections.metrics.clone_inner())
}

fn partition_statistics(&self, partition: Option<usize>) -> Result<Arc<Statistics>> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could use a little test coverage. Maybe it's covered elsewhere, I'm still reviewing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might notice that there's very little new unit tests overall. This is mainly because this new feature is covered by all tpch, tpcds and clickbench integration tests, by actually running the full queries with AQE enabled

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have explicitly not added almost any new unit test because of this, and because of the amount of from scratch iterations I've done over this, but as this stabilizes, if you see opportunities for covering things that are not already covered by the integration tests, those are more than welcome

stage_map: DashMap<usize, (Arc<dyn ExecutionPlan>, MetricsSet)>,
}

impl PlanReconstructor {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels super random. it would be nice if it was shared between here and prepare_network_boundaries which also does things like insert_producer_head

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both prepare_network_boundaries and this code are doing very different things:

  • prepare_network_boundaries is actually laying out the plan that will eventually get executed and sent to workers
  • This is just reconstructing the plan dynamically as different stages get sent to the workers during AQE so that we can then have something to visualize. The insert_producer_head for example is really just for the sake of a future visualization

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add some comments about this

return Ok(TreeNodeRecursion::Continue);
};

if let Stage::Remote(remote) = nb.input_stage()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok to add the remote stage to the NetworkBoundaryBuilder? So you don't have to recurse here. You could store an Option<RemoteStage>

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a bit messy, because there can be an arbitrary number of RemoteStages present in the subplan here.

I think it's not too bad, recursing is negligible as long as no Arc operations or plan restructures happen during the recursion. .apply takes the plan by reference, and at no point it needs an owned reference to an Arc<dyn ExecutionPlan>, so the cost of having such recursion is ~0, unlike if you were down a transform_down() returning Transformed::yes at some point in the query

if d_cfg.dynamic_task_count {
// The task count will be decided dynamically at execution time.
return Ok(Arc::new(
DistributedExec::new(plan).with_metrics_collection(d_cfg.collect_metrics),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a little bit of a smell.

The dynamic prepare plan does a lot of stuff that inject_network_boundaries, prepare_network_boundaries is supposed to do. Should we move those function calls to the static prepare plan? So the DistributedExec is always responsible for that stuff?

The downside is that the prepare static plan which happens below during planning would happen during execution in DistributedExec

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's a big drawback. For static planning, we really like to see the full plan before it's send for execution.

The split of responsibilities seems correct here:

  • static planner: we plan statically inside distributed_query_planner.rs, so all further logic is here (e.g., prepare_network_boundaries)
  • dynamic planner: we plan during execution, so the logic is in DistributedExec

}));

// Stream back the metrics once the task finishes executing.
// The oneshot receiver resolves when impl_execute_task sends the collected

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I noticed that we still have the old code in the impl_execute_task.rs

            if num_partitions_remaining.fetch_sub(1, Ordering::SeqCst) == 1 {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we still need that old code for a couple of things:

  • Marking execution as finished for stage metrics
  • Sending the runtime metrics over the wire back to the coordinator

Additionally, the cache entry is still promptly invalidated here rather than waiting for the coordinator->worker channel to drop.

@gabotechs

gabotechs commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author
  1. Blocking: dynamic planning can deadlock on set-but-never-executed stages

This should not be a problem, if the query finishes streaming arrow data, it does not matter what is waiting for what, everything is abruptly cancelled (but the metrics collection).

This is actually not even related to this PR, although I think there might be better ways of doing this. As it's not related to this PR, I think it might be worth evaluating in a different one

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. Blocking: task-data invalidation is still coupled to execution finish

This is fine, the only consequence of this is that task data will get invalidated a bit earlier, which is acceptable and even desirable.

Again, this did not change in this PR

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. High: per-column throughput stats are almost always zero

👍 This is a good call, doing this

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. High: bytes_per_partition_per_second = 0 can panic

👍 Doing that

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. High: remote partition statistics ignore the requested partition

Because of there being repartitions in between samplers and network boundaries, it's not possible to attribute some statistics to specific partitions.

One thing that comes to mind is, if a specific partition is requested, maybe it's better to just divide the stats by the amount of partitions available, assuming that the partitions are going to be perfectly evenly distributed

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. Medium: SamplerExec::execute can panic on stale partition ids

👍 Sounds reasonable

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. Medium: dynamic coalesce boundaries discard runtime stats

This is fine an expected. We don't care about sampling this boundaries because we always know that the stage above should have Maximum(1) tasks

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. Medium: plan reconstruction is destructive

This is fine, we are good with taking owned references for this, we don't need to hold them in memory.

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. Medium: sampler/load-info errors can surface too late

I think this is fine, the timing for all of this should be negligible, and I'm afraid that doing it differently can bring complexity to the code

@gabotechs

Copy link
Copy Markdown
Collaborator Author
  1. Low: the new lifecycle behavior needs a targeted regression test

I actually have fought this as this was surfaced by the existing tests. This situation happens in many TPCH and TPC-DS queries, so I think we are covered on this front

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants