fix: keep large LOAD external scan multi-CN#24855
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the statistics produced for LOAD DATA external scans so that large LOAD jobs keep row/cardinality semantics for Cost, Outcnt, TableCnt, and BlockNum, while still preserving Cost * Rowsize as a byte-size hint for external-scan parallel sizing—helping large CSV/TBL LOAD pick the expected multi-CN AP execution path.
Changes:
- Reworked
makeLoadExternalStatsto estimate row count and blocks instead of treating input bytes asCost. - Added row-size estimation helpers to keep
Cost * Rowsizeclose to input bytes. - Expanded/updated tests to validate byte-hint preservation and multi-CN selection for large LOAD.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/sql/plan/build_load.go | Recomputes LOAD external scan stats using row/cardinality estimates while preserving byte-hint semantics. |
| pkg/sql/plan/build_load_parquet_test.go | Updates tests to validate new stat semantics and multi-CN exec-type behavior for large LOAD. |
| pkg/sql/plan/bind_load.go | Updates binder path to pass tableDef into the revised stats builder. |
| stats.TableCnt = rowCount | ||
| stats.Rowsize = rowSize | ||
| stats.Selectivity = 1 | ||
| stats.BlockNum = int32(rowCount/float64(options.DefaultBlockMaxRows)) + 1 |
| @@ -281,30 +284,66 @@ func TestValidateLoadParquetOptionsIgnoresNonParquet(t *testing.T) { | |||
| } | |||
|
|
|||
| func TestMakeLoadExternalStatsUsesInputBytes(t *testing.T) { | |||
What type of PR is this?
Which issue(s) this PR fixes:
issue #24846
What this PR does / why we need it:
This fixes LOAD external scan stats so large LOAD jobs keep row/cardinality semantics for
Cost,Outcnt,TableCnt, andBlockNum, while preservingCost * Rowsizeas the input-size hint used by external scan parallel sizing.Previously the LOAD stats used input bytes as
CostwithRowsize=1and forcedBlockNum=1/TableCnt=1. That can make large CSV/TBL LOAD chooseAP_ONECNinstead of the expected multi-CN AP path.Tests: