Skip to content

[cudax] Update lane mask inside mappings only when unit is thread#9264

Merged
davebayer merged 1 commit into
NVIDIA:mainfrom
davebayer:groups_update_lane_mask_only_for_thread_level
Jun 5, 2026
Merged

[cudax] Update lane mask inside mappings only when unit is thread#9264
davebayer merged 1 commit into
NVIDIA:mainfrom
davebayer:groups_update_lane_mask_only_for_thread_level

Conversation

@davebayer
Copy link
Copy Markdown
Contributor

There was a bug in our mappings that made the map method update the lane mask no matter what the unit is. We want to modify lane mask only when the unit is a thread.

@davebayer davebayer requested a review from a team as a code owner June 4, 2026 20:29
@davebayer davebayer requested a review from andralex June 4, 2026 20:29
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR fixes a bug in the CUDA Experimental Group mappings implementation where the map method incorrectly updated the lane mask regardless of the unit type. The fix restricts lane mask updates to occur only when the unit is thread_level, ensuring correct behavior in hierarchical group operations.

Changes Made

Core API Update

The map method signature across all mapping implementations has been updated to accept an additional leading _Unit parameter:

  • Before: map(const _ParentGroup&, const _PrevMappingResult&)
  • After: map(const _Unit&, const _ParentGroup&, const _PrevMappingResult&)

This change propagates through:

  • cuda::experimental::group (base group implementation)
  • cuda::experimental::binary_partition
  • cuda::experimental::composite_mapping
  • cuda::experimental::group_as (both static and dynamic specializations)
  • cuda::experimental::group_by (both fixed-extent and dynamic-extent variants)
  • cuda::experimental::identity_mapping

Lane Mask Computation Fix

The critical fix for the lane mask issue is in the mapping implementations (group_as and group_by). Lane mask computation is now conditional at compile time:

  • When _Unit is thread_level: Lane mask is derived using __make_lane_mask_for_n with the previous mapping's lane mask plus computed lane indices and rank
  • Otherwise: The previous mapping's lane mask is reused unchanged without modification

This ensures lane masks are only updated when performing thread-level grouping operations.

Supporting Updates

  • Group construction: The __mapping_result_ is now computed via the updated __do_mapping(__unit, ...) path, passing the group _Unit through to mapping calls
  • Synchronizer creation: Updated to accept and forward the provided _unit to __synchronizer.make_instance(...) instead of using default-constructed _Unit{}
  • Type inference helpers: Updated __group_mapping_result_t and __group_synchronizer_instance_t in traits.cuh to use cuda::std::declval expressions instead of default-constructed temporaries for more accurate type deduction

Test Updates

All mapping tests have been updated to pass cuda::gpu_thread as the first argument to map invocations:

  • binary_partition.cu
  • composite_mapping.cu
  • group_as.cu
  • group_by.cu
  • identity_mapping.cu
  • barrier_synchronizer.cu
  • lane_synchronizer.cu

The test assertions for result types and noexcept specifications were updated to match the new map(...) signature.

Impact

This is a breaking API change for any code that directly calls the map method on mapping objects. All callers must be updated to pass a _Unit argument as the first parameter. The fix ensures correctness in hierarchical group operations where unit type determines whether lane mask tracking should be updated.

important: Walkthrough

This PR updates the CUDA group mapping pipeline to accept and propagate a _Unit template parameter as the first argument to all mapping map(...) method signatures and to forward that unit into synchronizer factory calls; lane-mask generation is made conditional on _Unit being thread_level.

important: Changes

Unit parameter integration across group mappings

Layer / File(s) Summary
Group mapping invocation contract
cudax/include/cuda/experimental/__group/group.cuh, cudax/include/cuda/experimental/__group/traits.cuh
Group construction, mapping result inference, and synchronizer factory make_instance now use const _Unit& in decltype and calls, and the constructor forwards __unit into mapping and synchronizer creation.
Identity and binary partition mappings
cudax/include/cuda/experimental/__group/mapping/identity_mapping.cuh, cudax/include/cuda/experimental/__group/mapping/binary_partition.cuh
map now takes const _Unit& first; identity returns the previous result unchanged, binary_partition adds a static_assert requiring _Unit == thread_level while preserving partition logic.
Composite mapping recursive forwarding
cudax/include/cuda/experimental/__group/mapping/composite_mapping.cuh
Recursive __map_impl and public map now accept and forward a const _Unit& through each mapping step, threading intermediate mapping results.
Conditional lane-mask mappings
cudax/include/cuda/experimental/__group/mapping/group_as.cuh, cudax/include/cuda/experimental/__group/mapping/group_by.cuh
map signatures accept const _Unit&; lane-mask computation is performed only when _Unit is thread_level, otherwise prior lane_mask() is forwarded.
Mapping and synchronizer test suite updates
cudax/test/group/mapping/*.cu, cudax/test/group/synchronizer/*.cu
All tests updated to pass cuda::gpu_thread as the first argument to mapping.map(...); corresponding __group_mapping_result and noexcept static assertions were adjusted.

suggestion: Possibly related PRs

  • NVIDIA/cccl#8894: Related changes to binary_partition mapping API introducing _Unit parameter.
  • NVIDIA/cccl#9140: Related work touching lane-mask storage/usage in mapping results.

suggestion: Suggested labels

cudax

suggestion: Suggested reviewers

  • andralex
  • gevtushenko
  • griwes

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cudax/include/cuda/experimental/__group/mapping/identity_mapping.cuh (1)

37-39: ⚡ Quick win

suggestion: The new _Unit parameter is unconstrained here, so direct map(...) callers can pass arbitrary types and bypass group's __is_hierarchy_level_v<_Unit> contract. Add the hierarchy-level constraint on the overload itself, and mirror it on the other updated mapping map(...) signatures.

As per coding guidelines: "Use C++20 concept macros instead of SFINAE, e.g., _CCCL_TEMPLATE(...) and _CCCL_REQUIRES(...), for template constraints."

cudax/include/cuda/experimental/__group/mapping/group_as.cuh (1)

166-169: ⚡ Quick win

suggestion: Add a regression that instantiates this mapping with a non-thread unit and asserts lane_mask() is forwarded unchanged. The updated test cohort for this stack only passes cuda::gpu_thread, so the new false arm can still regress silently.

Also applies to: 292-295


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8c8dc5c7-7b33-4549-bb6e-da4e1566714c

📥 Commits

Reviewing files that changed from the base of the PR and between 89c81d7 and 7261ef1.

📒 Files selected for processing (13)
  • cudax/include/cuda/experimental/__group/group.cuh
  • cudax/include/cuda/experimental/__group/mapping/binary_partition.cuh
  • cudax/include/cuda/experimental/__group/mapping/composite_mapping.cuh
  • cudax/include/cuda/experimental/__group/mapping/group_as.cuh
  • cudax/include/cuda/experimental/__group/mapping/group_by.cuh
  • cudax/include/cuda/experimental/__group/mapping/identity_mapping.cuh
  • cudax/test/group/mapping/binary_partition.cu
  • cudax/test/group/mapping/composite_mapping.cu
  • cudax/test/group/mapping/group_as.cu
  • cudax/test/group/mapping/group_by.cu
  • cudax/test/group/mapping/identity_mapping.cu
  • cudax/test/group/synchronizer/barrier_synchronizer.cu
  • cudax/test/group/synchronizer/lane_synchronizer.cu

@github-actions

This comment has been minimized.

Comment thread cudax/include/cuda/experimental/__group/mapping/composite_mapping.cuh Outdated
@davebayer davebayer force-pushed the groups_update_lane_mask_only_for_thread_level branch from 7261ef1 to 6239984 Compare June 5, 2026 07:05
@davebayer davebayer requested a review from miscco June 5, 2026 07:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__group/traits.cuh (1)

35-36: suggestion: Consider removing or updating __group_mapping_result_t to avoid dead, stale 2-arg map inference. The only reference to __group_mapping_result_t is its definition in cudax/include/cuda/experimental/__group/traits.cuh (lines 35-36); no traits/concepts use it elsewhere, so the 2-arg contract mismatch with the current 3-arg map(unit, parent, initial_mapping_result) interface won’t impact builds.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f50647ac-1ff9-4b69-9ab5-b79649a06662

📥 Commits

Reviewing files that changed from the base of the PR and between 7261ef1 and 6239984.

📒 Files selected for processing (14)
  • cudax/include/cuda/experimental/__group/group.cuh
  • cudax/include/cuda/experimental/__group/mapping/binary_partition.cuh
  • cudax/include/cuda/experimental/__group/mapping/composite_mapping.cuh
  • cudax/include/cuda/experimental/__group/mapping/group_as.cuh
  • cudax/include/cuda/experimental/__group/mapping/group_by.cuh
  • cudax/include/cuda/experimental/__group/mapping/identity_mapping.cuh
  • cudax/include/cuda/experimental/__group/traits.cuh
  • cudax/test/group/mapping/binary_partition.cu
  • cudax/test/group/mapping/composite_mapping.cu
  • cudax/test/group/mapping/group_as.cu
  • cudax/test/group/mapping/group_by.cu
  • cudax/test/group/mapping/identity_mapping.cu
  • cudax/test/group/synchronizer/barrier_synchronizer.cu
  • cudax/test/group/synchronizer/lane_synchronizer.cu
🚧 Files skipped from review as they are similar to previous changes (11)
  • cudax/test/group/synchronizer/lane_synchronizer.cu
  • cudax/test/group/synchronizer/barrier_synchronizer.cu
  • cudax/include/cuda/experimental/__group/mapping/binary_partition.cuh
  • cudax/include/cuda/experimental/__group/mapping/identity_mapping.cuh
  • cudax/test/group/mapping/group_as.cu
  • cudax/test/group/mapping/binary_partition.cu
  • cudax/include/cuda/experimental/__group/mapping/composite_mapping.cuh
  • cudax/test/group/mapping/identity_mapping.cu
  • cudax/test/group/mapping/composite_mapping.cu
  • cudax/test/group/mapping/group_by.cu
  • cudax/include/cuda/experimental/__group/mapping/group_by.cuh

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🥳 CI Workflow Results

🟩 Finished in 34m 58s: Pass: 100%/55 | Total: 8h 16m | Max: 34m 58s | Hits: 69%/47114

See results here.

@davebayer davebayer merged commit 281a0e4 into NVIDIA:main Jun 5, 2026
78 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Jun 5, 2026
@davebayer davebayer deleted the groups_update_lane_mask_only_for_thread_level branch June 5, 2026 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants