Skip to content

[libcu++] Adds exec::guarantee and the max_total_num_items guarantee#9278

Open
elstehle wants to merge 4 commits into
NVIDIA:mainfrom
elstehle:fea/total-num-items-guarantees
Open

[libcu++] Adds exec::guarantee and the max_total_num_items guarantee#9278
elstehle wants to merge 4 commits into
NVIDIA:mainfrom
elstehle:fea/total-num-items-guarantees

Conversation

@elstehle
Copy link
Copy Markdown
Contributor

@elstehle elstehle commented Jun 5, 2026

Closes #9279

Description

Adds cuda::execution::guarantee together with its first guarantee, cuda::execution::max_total_num_items. Where require lets a caller demand properties from an algorithm, guarantee lets a caller promise properties of the problem that an algorithm may exploit. Guarantees are bundled with guarantee(...) and surfaced through a dedicated __get_guarantees query, mirroring require.

max_total_num_items communicates an upper bound on the total number of items processed (e.g. the combined size of all segments in cub::DeviceBatchedTopK), which an algorithm can use to size intermediate offset types. Since this bound-information may not be attachable to a specific parameter (e.g., on a DeviceBatchedTopK and similarly for segmented algorithms), we decided it should go into the guarantees API.

Design decisions

  • max_total_num_items instead a single a total_num_items taking both lower and upper bounds: Lower bounds are presumably rare in practice, so we optimize for convenience in the common case and keep the two as separate, composable guarantees (guarantee(max_total_num_items<N>(), min_total_num_items<M>())), with min_total_num_items kept as follow-up work.
  • Compile-time and runtime upper bounds, both first-class: max_total_num_items<N>() (static), max_total_num_items(n) (runtime), and max_total_num_items<N>(n) (static bound + runtime refinement, asserting n <= N).
  • Inferred integral bound type rather than a hard-coded int64_t: a 32-bit bound stays 32-bit instead of widening to 64-bit, such that a max_total_num_items(1000000) still provides an int32 static upper bound. Narrower types can be requested explicitly (max_total_num_items<cuda::std::int16_t{1000}>()).

Example

#include <cuda/execution.guarantee.h>
#include <cuda/execution.max_total_num_items.h>

namespace ex = cuda::execution;

// Compile-time upper bound (type inferred: fits int -> 32-bit offsets):
auto env = cuda::std::execution::env{ex::guarantee(ex::max_total_num_items<1'000'000'000>())};

// ... or a runtime upper bound:
auto env = cuda::std::execution::env{ex::guarantee(ex::max_total_num_items(num_items))};

// Passed to an algorithm that understands the guarantee, e.g. (once wired up) cub::DeviceBatchedTopK(..., env);

@elstehle elstehle requested a review from a team as a code owner June 5, 2026 08:52
@elstehle elstehle requested a review from Jacobfaib June 5, 2026 08:52
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 5, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 5, 2026
@elstehle elstehle requested review from ericniebler and pciolkosz June 5, 2026 08:52
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f7e85098-e6f7-4f89-be58-54101ff7789e

📥 Commits

Reviewing files that changed from the base of the PR and between 38ff9c8 and 1a1e06b.

📒 Files selected for processing (1)
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgement when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

Adds a guarantees facility to libcudacxx and the first guarantee type, max_total_num_items, enabling callers to promise an upper bound on the total number of items an algorithm will process so algorithms can choose internal types or allocate temporary storage more efficiently.

Key Features

  • Guarantees mechanism

    • New base type exec::__guarantee.
    • exec::guarantee(...) factory to bundle guarantee objects into a cuda::std::execution::env and produce a prop keyed by __get_guarantees_t.
    • exec::__get_guarantees query token (forwarding query) to extract guarantees from an environment.
  • max_total_num_items guarantee

    • exec::__max_total_num_items_holder_t holds a compile-time static_highest and a runtime highest() with element_type inferred from provided integral values.
    • Query token exec::__get_max_total_num_items (forwarding query) returns the holder.
    • Factory overloads:
      • Compile-time: max_total_num_items()
      • Runtime: max_total_num_items(n) (element type inferred from n; static_highest = numeric_limits<element_type>::max())
      • Hybrid: max_total_num_items(n) (element type inferred from N; runtime n must be <= N)
    • Preserves narrow integral types via inference; supports explicit typed non-type template parameters (e.g., int16_t literal).
  • Design note: only an upper-bound guarantee is added initially; min_total_num_items considered later.

Files Changed (high level)

  • New/updated implementation headers

    • libcudacxx/include/cuda/__execution/guarantee.h — guarantee mechanism, __get_guarantees_t, guarantee().
    • libcudacxx/include/cuda/__execution/max_total_num_items.h — holder type, __get_max_total_num_items_t, and three max_total_num_items overloads.
  • Public headers

    • libcudacxx/include/cuda/execution.guarantee.h — wrapper include.
    • libcudacxx/include/cuda/execution.max_total_num_items.h — wrapper include.
    • libcudacxx/include/cuda/execution — umbrella updated to include the new headers.
  • Tests

    • libcudacxx/test/libcudacxx/cuda/execution/guarantee.pass.cpp — validates environment construction, query extraction, and forwarding-query property.
    • libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp — negative test: non-guarantee env rejected.
    • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp — verifies compile-time, runtime, hybrid bounds, type inference, and queryability.
    • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp — verifies non-integral bounds are rejected.

Summary

Introduces a composable, queryable guarantees API and the max_total_num_items guarantee to convey an upper bound on total items processed. The API supports static, runtime, and hybrid bounds, preserves narrow integral types via inference, integrates with existing execution query semantics (forwarding queries), and includes positive and negative tests covering the new functionality.

important:

Walkthrough

Adds a guarantee facility (base type, query key, variadic guarantee(...)) and a max_total_num_items guarantee with static/runtime bounds, public wrapper headers, and tests exercising positive and negative cases.

Changes

Execution Guarantees Facility

Layer / File(s) Summary
Guarantee facility and query mechanism
libcudacxx/include/cuda/__execution/guarantee.h
__guarantee base class, __get_guarantees_t query object and global constant, and guarantee(...) variadic template that validates all arguments derive from __guarantee and packages them into a cuda::std::execution::prop.
Max total num items guarantee implementation
libcudacxx/include/cuda/__execution/max_total_num_items.h
__max_total_num_items_holder_t stores compile-time and runtime bounds with highest() and query(...) accessors. Three max_total_num_items(...) factories provide compile-time only, runtime only, and combined construction modes with compile-time validation that runtime bounds do not exceed static bounds.
Public header exports and integration
libcudacxx/include/cuda/execution, libcudacxx/include/cuda/execution.guarantee.h, libcudacxx/include/cuda/execution.max_total_num_items.h
Wrapper headers forward to implementation files with system-header pragmas; main execution umbrella header includes both wrappers to expose the complete guarantee API.
Guarantee facility tests
libcudacxx/test/libcudacxx/cuda/execution/guarantee.pass.cpp, guarantee.fail.cpp
guarantee.pass.cpp validates that guarantee(max_total_num_items<1000>()) correctly exposes guarantees through environment queries with forwarding-query assertions; guarantee.fail.cpp exercises rejection of non-guarantee properties.
Max total num items tests
libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp, max_total_num_items.fail.cpp
max_total_num_items.pass.cpp validates compile-time type deduction, runtime inference, combined bound narrowing, element-type selection, and forwarding-query behavior via static_assert and assert. max_total_num_items.fail.cpp confirms integral-only constraints via floating-point rejection test.

Assessment against linked issues

Objective Addressed Explanation
Add an option to guarantee a static and runtime max_total_num_items to DeviceBatchedTopK (#9279)

Suggested labels

libcu++

Suggested reviewers

  • ericniebler
  • bernhardmgruber

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp

libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp:11:10: fatal error: 'cuda/execution.max_total_num_items.h' file not found
11 | #include <cuda/execution.max_total_num_items.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp:68:3-74:3: ERROR translating statement 'CompoundStmt'
Aborting translation of method 'test' in file 'libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp': "Assert_failure src/clang/cAst_utils.ml:249:53"
Uncaught Internal Error: "Assert_failure src/clang/cAst_utils.ml:249:53"
Error backtrace:
Raised at ClangFrontend__CAst_utils.get_decl_from_typ_ptr in file "src/clang/cAst_utils.ml", line 249, characters 53-65
Called from ClangFrontend__CTrans.CTrans_funct.get_destructor_decl_ref in file "src/clang/cTrans.ml", line 658, characters 12-59
Called from ClangFrontend__CTrans.CTrans_funct.destructor_calls.(fun) in file "src/clang/cTr

... [truncated 2200 characters] ...

e "src/clang/cTrans.ml" (inlined), line 4765, characters 38-71
Called from ClangFrontend__CTrans.CTrans_funct.exec_with_node_creation in file "src/clang/cTrans.ml" (inlined), line 104, characters 20-38
Called from ClangFrontend__CTrans.CTrans_funct.get_clang_stmt_trans in file "src/clang/cTrans.ml" (inlined), line 5395, characters 4-69
Called from ClangFrontend__CTrans.CTrans_funct.get_custom_stmt_trans in file "src/clang/cTrans.ml", line 5401, characters 8-55
Called from ClangFrontend__CTrans.CTrans_funct.exec_trans_instrs.exec_trans_instrs_rev in file "src/clang/cTrans.ml" (inlined), line 5365, characters 28-54
Called from ClangFrontend__CTrans.CTrans_funct.exec_trans_instrs in file "src/clang/cTrans.ml" (inlined), line 5389, characters 6-70
Called from ClangFrontend__CTrans.CTrans_funct


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1cc66350-916c-4788-a194-ec55f2aa4233

📥 Commits

Reviewing files that changed from the base of the PR and between 2b21bec and 90b7581.

📒 Files selected for processing (9)
  • libcudacxx/include/cuda/__execution/guarantee.h
  • libcudacxx/include/cuda/__execution/max_total_num_items.h
  • libcudacxx/include/cuda/execution
  • libcudacxx/include/cuda/execution.guarantee.h
  • libcudacxx/include/cuda/execution.max_total_num_items.h
  • libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/guarantee.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp

Comment thread libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp
@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Copy Markdown
Contributor

Why can't we use the argument annotation framework to put an upper bound on the num items?

How does the guarantee work for problems with multiple num items, like DeviceMerge?

@elstehle
Copy link
Copy Markdown
Contributor Author

elstehle commented Jun 7, 2026

Why can't we use the argument annotation framework to put an upper bound on the num items?

This bound-information is not always attachable to a specific parameter. E.g., for segmented top-k there is only a parameter for the segment sizes (for which we support the argument annotation) but not a parameter for the total number of items. Bounds on the total number of items is an optional parameter here.

How does the guarantee work for problems with multiple num items, like DeviceMerge?

I think for algorithms that have required parameters to which bounds can be attached, the information should be attached to the specific parameters. So for single-problem algorithms, I would advise to be using the argument annotation in favor of max_total_num_items. We could still support both by taking the sharper of the two bounds (bounds on the specific parameter and the guarantee) for single-problem algorithms.

For DeviceMerge as a single-problem algorithm, this guarantee would still be useful, because users can now provide bounds on the total number of items (num_keys1+ num_keys2), in addition to specific bounds on each of the two parameters individually.

Edit: Obviously, this is all forward-looking in preparation to a world where num_items could be a device-accessible-only value.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 05m: Pass: 100%/115 | Total: 23h 57m | Max: 41m 53s | Hits: 99%/337638

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Add an option to guarantee a static and runtime max_total_num_items to DeviceBatchedTopK

2 participants