Skip to content

GH-50247: Reuse abstraction for null partitions in sorting functions#50248

Open
taepper wants to merge 12 commits into
apache:mainfrom
taepper:better-null-partitions
Open

GH-50247: Reuse abstraction for null partitions in sorting functions#50248
taepper wants to merge 12 commits into
apache:mainfrom
taepper:better-null-partitions

Conversation

@taepper

@taepper taepper commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

@pitrou mentioned this as a follow-up in #46926

What changes are included in this PR?

Refactoring sorting methods to reuse the helper methods avoid maintaining two abstractions for null partitions. The new abstraction was very seamless to implement in most cases, but a few spots required some care

In particular, these functions were severly simlpified by the new abstraction:

  • MarkDuplicates: duplicate nulls and nans were detected by checking every single row for Null one additional time, after we already had (and discarded) the nullness information
  • GenericMergeImpl: merging of null-ranges involved repartitioning null and nan values in every merge invocation. Now, we track this distinction and do not need any merge function for null and nan blocks

Are these changes tested?

Yes, the compute test suite passes as before

Are there any user-facing changes?

No.

@taepper taepper requested a review from pitrou as a code owner June 25, 2026 01:48
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #50247 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is excellent, and the simplification is very welcome. Just a couple minor comments below.

IndexType* non_nulls_end;
IndexType* nulls_begin;
IndexType* nulls_end;
struct GenericPartitionResultByNullLikeness {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep the old name? Or name it GenericNullLikePartition which is a bit shorter?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is definitely sensible, I like removing the Result from the name as it is used to store that Partition in various places. NullPartitionResult sort of implied it is only used as a single-use struct which is only returned by a NullPartition function.

Having NullPartition, NanPartition (as helpers), and NullLikePartition for the total struct sounds great!

null_range.size()}};
}

static GenericPartitionResultByNullLikeness fromCounts(std::span<IndexType> indices,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: FromCounts

sorted[i].null_range.size()),
batch.num_rows());
begin_offset = end_offset;
// XXX this is an upper bound on the true null count

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this XXX still true? Presumably it implied that null_count could also account for nan values, but that is not the case anymore?

@taepper taepper Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems right. I also noticed that this null_count was able to be removed entirely (no longer used in Merge{,AtStart,AtEnd})

Comment on lines +665 to +667
DCHECK_EQ(static_cast<int64_t>(sorted[i].non_null_like_range.size() +
sorted[i].null_range.size()),
batch.num_rows());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also add nan_range.size() here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 25, 2026
@pitrou

pitrou commented Jun 25, 2026

Copy link
Copy Markdown
Member

Hmm, there are some regressions in the test suite (see CI runs).

Also this runtime assertion on Windows CI might give a clue:

27/98 Test  #49: arrow-compute-vector-sort-test ...............Exit code 0xc0000409
***Exception:   0.43 sec
[==========] Running 284 tests from 130 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from TestNthToIndicesForReal/0, where TypeParam = arrow::FloatType
[ RUN      ] TestNthToIndicesForReal/0.NthToIndicesDoesNotProvideDefaultOptions
[       OK ] TestNthToIndicesForReal/0.NthToIndicesDoesNotProvideDefaultOptions (1 ms)
[ RUN      ] TestNthToIndicesForReal/0.Real
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.44.35207\include\span(160) : Assertion failed: cannot compare incompatible span iterators

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants