Skip to content

fix: keep large LOAD external scan multi-CN#24855

Draft
aptend wants to merge 2 commits into
matrixorigin:mainfrom
aptend:fix-load-external-stats-multicn
Draft

fix: keep large LOAD external scan multi-CN#24855
aptend wants to merge 2 commits into
matrixorigin:mainfrom
aptend:fix-load-external-stats-multicn

Conversation

@aptend
Copy link
Copy Markdown
Contributor

@aptend aptend commented Jun 4, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24846

What this PR does / why we need it:

This fixes LOAD external scan stats so large LOAD jobs keep row/cardinality semantics for Cost, Outcnt, TableCnt, and BlockNum, while preserving Cost * Rowsize as the input-size hint used by external scan parallel sizing.

Previously the LOAD stats used input bytes as Cost with Rowsize=1 and forced BlockNum=1 / TableCnt=1. That can make large CSV/TBL LOAD choose AP_ONECN instead of the expected multi-CN AP path.

Tests:

CGO_CFLAGS="-I$(pwd)/cgo -I$(pwd)/thirdparties/install/include" go test ./pkg/sql/plan -count=1

@aptend aptend requested review from aunjgr and ouyuanning as code owners June 4, 2026 15:19
Copilot AI review requested due to automatic review settings June 4, 2026 15:19
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the statistics produced for LOAD DATA external scans so that large LOAD jobs keep row/cardinality semantics for Cost, Outcnt, TableCnt, and BlockNum, while still preserving Cost * Rowsize as a byte-size hint for external-scan parallel sizing—helping large CSV/TBL LOAD pick the expected multi-CN AP execution path.

Changes:

  • Reworked makeLoadExternalStats to estimate row count and blocks instead of treating input bytes as Cost.
  • Added row-size estimation helpers to keep Cost * Rowsize close to input bytes.
  • Expanded/updated tests to validate byte-hint preservation and multi-CN selection for large LOAD.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
pkg/sql/plan/build_load.go Recomputes LOAD external scan stats using row/cardinality estimates while preserving byte-hint semantics.
pkg/sql/plan/build_load_parquet_test.go Updates tests to validate new stat semantics and multi-CN exec-type behavior for large LOAD.
pkg/sql/plan/bind_load.go Updates binder path to pass tableDef into the revised stats builder.

stats.TableCnt = rowCount
stats.Rowsize = rowSize
stats.Selectivity = 1
stats.BlockNum = int32(rowCount/float64(options.DefaultBlockMaxRows)) + 1
@@ -281,30 +284,66 @@ func TestValidateLoadParquetOptionsIgnoresNonParquet(t *testing.T) {
}

func TestMakeLoadExternalStatsUsesInputBytes(t *testing.T) {
@aptend aptend requested a review from iamlinjunhong June 5, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants