[cub] Replace cub parameter framework with cuda::argument#9074
Conversation
845daaf to
5dd3c87
Compare
5dd3c87 to
8a3b299
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
OverviewThis PR replaces the CUB parameter framework in cub/cub/detail/segmented_params.cuh with CUDA ::cuda::__argument wrappers (from PR Key ChangesCore Framework Refactoringcub/cub/detail/segmented_params.cuh
Dispatch and Kernel Updatescub/cub/device/dispatch/dispatch_batched_topk.cuh
cub/cub/device/dispatch/kernels/kernel_batched_topk.cuh
Agent and Worker Updatescub/cub/agent/agent_batched_topk.cuh
Benchmarks and TestsBenchmarks (cub/benchmarks/bench/segmented_topk/fixed/keys.cu and variable/keys.cu)
Tests (cub/test/device_segmented_topk.cu)
Public API / Notable Signature Changes
Files Touched (high-level)
Metrics
Notes for reviewers
important: WalkthroughRefactors batched top-K parameter handling to use cuda::__argument wrapper types and ::cuda::__argument::__traits with a unified detail::params::get_param API; dispatch now accepts unwrapped select_direction and num_segments, kernels and agent code read parameters via get_param, and tests/benchmarks updated accordingly. ChangesCUB Batched Top-K Integration
Possibly related PRs
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
cub/cub/detail/segmented_params.cuh (1)
31-43: 💤 Low valuesuggestion: Missing
[[nodiscard]]onget_paramoverloads. Per coding guidelines, most functions with non-void return should have this attribute._CCCL_TEMPLATE(class _Tp) _CCCL_REQUIRES((!::cuda::argument::__is_wrapper_v<::cuda::std::remove_cv_t<::cuda::std::remove_reference_t<_Tp>>>) ) -_CCCL_HOST_DEVICE constexpr auto get_param(_Tp&& __arg, [[maybe_unused]] size_t __index) noexcept +[[nodiscard]] _CCCL_HOST_DEVICE constexpr auto get_param(_Tp&& __arg, [[maybe_unused]] size_t __index) noexceptSame applies to the other
get_paramoverloads on lines 46-47, 53-54, 67-68, 74-75. As per coding guidelines, most functions with a non-void return type should use[[nodiscard]].cub/cub/device/dispatch/dispatch_batched_topk.cuh (1)
51-66: 💤 Low valuesuggestion: Both
wrap_select_directionoverloads return non-void and should have[[nodiscard]].-_CCCL_HOST_DEVICE inline auto wrap_select_direction(detail::topk::select dir) +[[nodiscard]] _CCCL_HOST_DEVICE inline auto wrap_select_direction(detail::topk::select dir)-_CCCL_HOST_DEVICE auto wrap_select_direction(IteratorT iter) +[[nodiscard]] _CCCL_HOST_DEVICE auto wrap_select_direction(IteratorT iter)libcudacxx/include/cuda/__argument/argument_bounds.h (1)
103-113: ⚡ Quick winsuggestion: Complete Doxygen tags for the documented
__boundsoverloads. The documented non-void factory functions currently only provide//!@brief; add `//! `@paramfor each parameter and//!@return`` for both overloads to satisfy header documentation requirements.As per coding guidelines: "When a function is documented with Doxygen, it must include:
//!@brief, `//! `@param`[in/out/in,out]` for every parameter, and `//! `@returnfor non-void functions."
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a53ca9f6-f66d-4f20-a942-2e8bd23c2c84
📒 Files selected for processing (20)
cub/benchmarks/bench/segmented_topk/fixed/keys.cucub/benchmarks/bench/segmented_topk/variable/keys.cucub/cub/agent/agent_batched_topk.cuhcub/cub/detail/segmented_params.cuhcub/cub/device/dispatch/dispatch_batched_topk.cuhcub/cub/device/dispatch/kernels/kernel_batched_topk.cuhcub/test/catch2_test_device_segmented_topk_keys.cucub/test/catch2_test_device_segmented_topk_pairs.culibcudacxx/include/cuda/__argument/argument.hlibcudacxx/include/cuda/__argument/argument_bounds.hlibcudacxx/include/cuda/argumentlibcudacxx/include/cuda/std/__internal/namespaces.hlibcudacxx/test/libcudacxx/cuda/argument/argument_bounds.pass.cpplibcudacxx/test/libcudacxx/cuda/argument/argument_traits.pass.cpplibcudacxx/test/libcudacxx/cuda/argument/deferred_argument.pass.cpplibcudacxx/test/libcudacxx/cuda/argument/dynamic_argument.pass.cpplibcudacxx/test/libcudacxx/cuda/argument/static_argument.pass.cpplibcudacxx/test/libcudacxx/cuda/argument/static_bounds_conversion.fail.cpplibcudacxx/test/libcudacxx/cuda/argument/usage_example.pass.cpplibcudacxx/test/support/test_macros.h
| template <auto _Lowest, auto _Max> | ||
| _CCCL_API constexpr __immediate(_Arg __arg, __static_bounds<_Lowest, _Max>) noexcept | ||
| : arg{::cuda::std::move(__arg)} | ||
| { | ||
| __validate_bounds(); | ||
| __validate_value(); | ||
| } |
There was a problem hiding this comment.
important: These constructors accept a __static_bounds<_Lowest, _Max> argument but validate against the class template parameter _StaticBounds, so explicitly-instantiated types can silently ignore the bounds token passed at construction. Add a compile-time constraint (for example, static_assert(::cuda::std::is_same_v<_StaticBounds, __static_bounds<_Lowest, _Max>>) or a requires clause) so construction fails when the token and _StaticBounds disagree.
Also applies to: 294-302
| template <auto _Lowest, auto _Max> | ||
| _CCCL_API constexpr __deferred_base(_Arg __arg, __static_bounds<_Lowest, _Max>) noexcept | ||
| : arg{::cuda::std::move(__arg)} | ||
| { | ||
| __validate_bounds_intersection<__element_type, _StaticBounds>(__runtime_bounds_); | ||
| } |
There was a problem hiding this comment.
important: __deferred_base has the same bounds-token mismatch risk: constructor parameters carry __static_bounds<_Lowest, _Max> but all checks use _StaticBounds. This can make user-provided static bounds inert for explicitly-typed wrappers. Enforce _StaticBounds == __static_bounds<_Lowest, _Max> at compile time (or remove the redundant bounds parameter in favor of _StaticBounds-typed overloads).
Also applies to: 357-366
This comment has been minimized.
This comment has been minimized.
8a3b299 to
06977ac
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
cub/cub/detail/segmented_params.cuh (3)
16-19: ⚡ Quick winsuggestion: Use a direct
cstddefinclude and a qualified size type here. These new signatures andsupported_options::countintroduce baresize_t, but this header does not include the defining header directly. Switching to::cuda::std::size_tand adding the precisecstddefheader removes the transitive-include dependency and matches CCCL header rules. As per coding guidelines, "Type names must be fully qualified" and "Files must include all headers related to the symbols that they are using. No transitive header inclusions are allowed."Also applies to: 34-70, 83-83
152-152: ⚡ Quick winsuggestion: Qualify the free-function call at Line 152.
dispatch_impl(...)is invoked unqualified, but CCCL headers require free-function calls to start from the global namespace even inside the same namespace. As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace."
29-31: ⚡ Quick winsuggestion: Complete the Doxygen tags on the documented functions. The new
//!blocks forget_param,dispatch_impl, anddispatch_discreteomit@paramand@return, which this repo requires whenever a function is documented. As per coding guidelines, "When a function is documented with Doxygen, it must include://!@brief, `//! `@param`[in/out/in,out]` for every parameter, and `//! `@returnfor non-void functions."Also applies to: 135-146
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7627989d-73f6-4559-875e-d2820ef16b49
📒 Files selected for processing (8)
cub/benchmarks/bench/segmented_topk/fixed/keys.cucub/benchmarks/bench/segmented_topk/variable/keys.cucub/cub/agent/agent_batched_topk.cuhcub/cub/detail/segmented_params.cuhcub/cub/device/dispatch/dispatch_batched_topk.cuhcub/cub/device/dispatch/kernels/kernel_batched_topk.cuhcub/test/catch2_test_device_segmented_topk_keys.cucub/test/catch2_test_device_segmented_topk_pairs.cu
💤 Files with no reviewable changes (2)
- cub/test/catch2_test_device_segmented_topk_pairs.cu
- cub/test/catch2_test_device_segmented_topk_keys.cu
🚧 Files skipped from review as they are similar to previous changes (5)
- cub/cub/agent/agent_batched_topk.cuh
- cub/benchmarks/bench/segmented_topk/fixed/keys.cu
- cub/benchmarks/bench/segmented_topk/variable/keys.cu
- cub/cub/device/dispatch/kernels/kernel_batched_topk.cuh
- cub/cub/device/dispatch/dispatch_batched_topk.cuh
This comment has been minimized.
This comment has been minimized.
elstehle
left a comment
There was a problem hiding this comment.
I think this looks much cleaner with the argument annotation framework replacements. Thanks for working on this!
| const auto num_segments = ::cuda::std::max<std::size_t>(1, (max_elements / segment_size)); | ||
| const auto elements = num_segments * segment_size; | ||
| const auto total_num_items = total_num_items_guarantee_t{static_cast<::cuda::std::int64_t>(elements)}; | ||
| const auto total_num_items = ::cuda::__argument::__immediate{static_cast<::cuda::std::int64_t>(elements)}; |
There was a problem hiding this comment.
note to myself: I think this should become part of the guarantees API, as, in the device interface, there is no concrete argument that could be annotated.
| using segment_size_val_t = typename ::cuda::__argument::__traits<SegmentSizeParameterT>::element_type; | ||
| using num_segments_val_t = typename ::cuda::__argument::__traits<NumSegmentsParameterT>::element_type; |
There was a problem hiding this comment.
Note to myself: I think we want to narrower the index/offset/size types to the more narrow of the two: the static upper bound or the element type.
|
|
||
| constexpr bool is_full_tile = params::has_single_static_value_v<SegmentSizeParameterT> | ||
| && params::static_min_value_v<SegmentSizeParameterT> == tile_size; | ||
| constexpr bool is_full_tile = ::cuda::__argument::__traits<SegmentSizeParameterT>::is_constant |
There was a problem hiding this comment.
question: For something like an __immediate where we have sharp bounds, i.e., lowest=max, we wouldn't hit the full_tile branch, right? Do you think this is something we should somehow cover or do you think we can expect users to be always be using __constant? I guess there could be a scenario where users themselves do have something like bounds template parameters and would not check for narrow bounds before instantiating something with __immediate?
There was a problem hiding this comment.
I think we could make __immediate smarter and report is_constant when it has static bounds where lower == higher. I will take a look into it in a follow-up PR.
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 1h 12m: Pass: 100%/284 | Total: 2d 20h | Max: 42m 30s | Hits: 89%/216258See results here. |
This PR replaces most of the functionality in
segmented_params.cuhwithcuda::argumentwrappers from #8875. This PR contains the other one, since it's not merged yet.There are two things that were left from the original implementation, the static dispatch over bounded set of values and
get_paramthat either gets item from a sequence at a given index or returns a uniform value depending on the argument. Both of those things were more fitting for a cub-specific functionality, but its not set in stone