GH-50247: Reuse abstraction for null partitions in sorting functions#50248
GH-50247: Reuse abstraction for null partitions in sorting functions#50248taepper wants to merge 12 commits into
Conversation
|
|
pitrou
left a comment
There was a problem hiding this comment.
Thank you! This is excellent, and the simplification is very welcome. Just a couple minor comments below.
| IndexType* non_nulls_end; | ||
| IndexType* nulls_begin; | ||
| IndexType* nulls_end; | ||
| struct GenericPartitionResultByNullLikeness { |
There was a problem hiding this comment.
Let's just keep the old name? Or name it GenericNullLikePartition which is a bit shorter?
There was a problem hiding this comment.
Yes, that is definitely sensible, I like removing the Result from the name as it is used to store that Partition in various places. NullPartitionResult sort of implied it is only used as a single-use struct which is only returned by a NullPartition function.
Having NullPartition, NanPartition (as helpers), and NullLikePartition for the total struct sounds great!
| null_range.size()}}; | ||
| } | ||
|
|
||
| static GenericPartitionResultByNullLikeness fromCounts(std::span<IndexType> indices, |
| sorted[i].null_range.size()), | ||
| batch.num_rows()); | ||
| begin_offset = end_offset; | ||
| // XXX this is an upper bound on the true null count |
There was a problem hiding this comment.
Is this XXX still true? Presumably it implied that null_count could also account for nan values, but that is not the case anymore?
There was a problem hiding this comment.
That seems right. I also noticed that this null_count was able to be removed entirely (no longer used in Merge{,AtStart,AtEnd})
| DCHECK_EQ(static_cast<int64_t>(sorted[i].non_null_like_range.size() + | ||
| sorted[i].null_range.size()), | ||
| batch.num_rows()); |
There was a problem hiding this comment.
Shouldn't we also add nan_range.size() here?
|
Hmm, there are some regressions in the test suite (see CI runs). Also this runtime assertion on Windows CI might give a clue: |
Rationale for this change
@pitrou mentioned this as a follow-up in #46926
What changes are included in this PR?
Refactoring sorting methods to reuse the helper methods avoid maintaining two abstractions for null partitions. The new abstraction was very seamless to implement in most cases, but a few spots required some care
In particular, these functions were severly simlpified by the new abstraction:
MarkDuplicates: duplicate nulls and nans were detected by checking every single row forNullone additional time, after we already had (and discarded) the nullness informationGenericMergeImpl: merging ofnull-ranges involved repartitioningnullandnanvalues in every merge invocation. Now, we track this distinction and do not need any merge function fornullandnanblocksAre these changes tested?
Yes, the compute test suite passes as before
Are there any user-facing changes?
No.