Gtest roce infra by NirWolfer · Pull Request #11490 · openucx/ucx

NirWolfer · 2026-05-26T12:09:20Z

No description provided.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

Gtest failures when running in k8s env, as unlimites max_threads cause resource exhaustion. Setting the CPU affinity will limit max_threads to 2 dynamically. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

RoCE on Blossom hits a crash in rdma_get_cm_event during test_ucp_sockaddr.ep_query that doesn't reproduce on Azure RoCE. Land IB (CX8) coverage first; RoCE will return in a separate PR after the rdmacm path is triaged. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

NirWolfer · 2026-05-26T12:09:32Z

/build

NirWolfer · 2026-05-26T13:01:34Z

/build

The matrix runner pinned gtest to 2 CPUs to keep thread counts bounded in k8s. Under multi-threaded variants the UCS async thread (one per process) contended on those 2 cores with the gtest main thread, worker threads, and UCP progress threads. Async starvation grew the missed events queue and pushed rdmacm events through __ucs_async_poll_missed, exposing a residual race in uct_rdmacm_cm_event_handler that segfaulted inside rdma_get_cm_event. Widen the affinity mask to up to 4 CPUs (clamped to the cgroup's allowed range), still well under the pod's 10-core request. Keeps the thread-count rationale intact while giving async breathing room. Signed-off-by: NirWolfer <nwolfer@nvidia.com>

NirWolfer · 2026-05-27T11:08:44Z

@svc-nvidia-pr-review

svc-nvidia-pr-review · 2026-05-27T11:08:48Z

🤖 Starting review — findings will be posted here when done.

svc-nvidia-pr-review · 2026-05-27T11:11:10Z

OK. UCS_ASYNC_BLOCK is recursive in UCX. Good.

Now I have a complete picture. Let me also check whether the mixing of bash variable expansion with the Jenkins YAML interpolation in timeout: "${STEP_TIMEOUT_MINUTES}" works. That's a YAML/Jenkins concern, not a code concern.

One more concern about the bash CPU pinning in test_matrix.yaml: it uses [[ syntax requiring bash. if [[ "${first_chunk}" == *-* ]] etc. The shebang isn't shown but Jenkins typically runs under bash. Acceptable.

Now I'll compile findings. The PR mixes two distinct changes: CI infra (.ci/pipeline/test_matrix.yaml) and a real correctness fix in rdmacm. Per REVIEW.md, "Bug fix PRs — must contain only the fix and its tests". Mixing CI changes with a bug fix is borderline, though arguably the CI changes are needed to expose/test the fix (the comment in the YAML mentions "exposed a poll_missed race in rdmacm"). This is worth flagging as a scope question but not a blocker.

Let me also check the comment on line 488-490 "Hold the cm's async block over all librdmacm calls on this channel so they cannot race with rdma_get_cm_event() running in the cm event handler." The cm event handler is uct_rdmacm_cm_event_handler which calls rdma_get_cm_event BEFORE acquiring the block. So the block does NOT actually protect against rdma_get_cm_event racing — it protects against uct_rdmacm_cm_process_event which is what dispatches per-event handlers. The comment is slightly inaccurate.

Looking again at the event handler:

ret = rdma_get_cm_event(cm->ev_ch, &event);
...
UCS_ASYNC_BLOCK(uct_rdmacm_cm_get_async(cm));
uct_rdmacm_cm_process_event(cm, event);
UCS_ASYNC_UNBLOCK(...)

So rdma_get_cm_event itself runs unblocked. The block only protects event processing. The comments in the diff at lines 488-490, 598-600, 52-55 are technically misleading: blocking does NOT prevent races with rdma_get_cm_event. It prevents races with process_event (the handlers that mutate cep state). However, since librdmacm's own data structures are presumably thread-safe internally, the actual fix is preventing the EP/listener state from being mutated concurrently.

This is a meaningful documentation accuracy issue worth flagging.

Let me finalize findings.

[
{"file": ".ci/pipeline/test_matrix.yaml", "line": 45, "severity": "minor", "comment": "renaming ${axis_index} to ${worker} is fine for this matrix where the only axis is worker, but test_gpu_matrix.yaml already uses the same ${name}/${arch}/w${worker} form — pls double-check the matrix engine actually exposes the axis name as a placeholder here (the previous code used ${axis_index} which suggests it might not)."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": 57, "severity": "minor", "comment": "the previous timeout was 360 (minutes, matching the 90 min job timeout x4), now it's 180. With timeout_minutes: 240 on the job and STEP_TIMEOUT_MINUTES: 180 on the step, a single slow worker can still consume most of the job budget. Is 180 enough headroom given valgrind/asan disabled but MT tests now exposed?"},
{"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the taskset -c mask is built from the first cpu range only; if Cpus_allowed_list looks like 2,5,7-10 we drop everything after the first comma and bind to a single cpu, which silently shrinks parallelism. Worth either documenting the assumption or handling the comma-separated case."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR mixes a real correctness fix in rdmacm with a CI infra rework (new HCA images, CPU pinning, timeout/limit changes, ASAN_CHECK removal, taskName rename). Per REVIEW.md bug-fix PRs should contain only the fix and its tests — can the CI changes be split into a separate PR, keeping only the minimum needed to reproduce the rdmacm race here?"},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with rdma_get_cm_event(), but the cm event handler calls rdma_get_cm_event() before taking UCS_ASYNC_BLOCK (see uct_rdmacm_cm_event_handler in rdmacm_cm.c). What this block actually serializes is uct_rdmacm_cm_process_event() and the per-ep state it mutates — pls update the wording so future readers don't look for a guarantee that isn't there. Same comment applies at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55."},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment on L598-600 says server_send_priv_data takes the same recursive block on the success path, suggesting continuous coverage — but here we UNBLOCK before calling it, so there is a small unblocked window between the two. Either move the UNBLOCK after the call (relying on recursion) or drop the misleading wording."},
{"file": "src/uct/ib/rdmacm/rdmacm_listener.c", "line": 140, "severity": "info", "comment": "minor: the introduced status local can be avoided by keeping the existing return uct_rdmacm_cm_ack_event(event); pattern if you UNBLOCK before the return — though as written it's clearer. Up to you."}
]

Wait, I need to re-check if worker is actually bound from the axis. Let me drop my speculation since I can't verify, and tighten findings.

Also, I should not include an item about ${worker} since test_gpu_matrix.yaml already uses it successfully on line 64. So this likely works.

[
{"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the affinity mask is built from first_chunk only — if Cpus_allowed_list is something like 2,5,7-10 we silently drop everything after the first comma and bind to a single cpu, shrinking parallelism. Either document the assumption (k8s gives a single contiguous range) or handle the comma case."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR bundles a real rdmacm race fix with a fairly large CI infra rework (new HCA RoCE/IB images, CPU pinning, timeout/limit changes, taskName rename, ASAN_CHECK removal). Per REVIEW.md, bug-fix PRs should contain only the fix and its tests — can the CI changes be split out, keeping only what's strictly needed to reproduce the race here?"},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with rdma_get_cm_event(), but in uct_rdmacm_cm_event_handler rdma_get_cm_event() runs outside the block — only uct_rdmacm_cm_process_event() is covered. What we actually serialize against is the per-ep state mutated by the dispatched handlers. Pls reword so readers don't look for a guarantee that isn't there (same applies to the comment at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55)."},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment above says server_send_priv_data takes the same recursive block on the success path, implying continuous coverage — but here we UNBLOCK before calling it, so there's a brief unblocked window. Either keep the block held across the call (relying on recursion in server_send_priv_data) or drop the misleading wording."}
]

[
  {"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the affinity mask is built from `first_chunk` only — if `Cpus_allowed_list` is e.g. `2,5,7-10` we silently drop everything after the first comma and bind to a single cpu, shrinking parallelism. Either document the assumption (k8s gives a single contiguous range) or handle the comma case."},
  {"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR bundles a real rdmacm race fix with a fairly large CI infra rework (new HCA RoCE/IB images, CPU pinning, timeout/limit changes, taskName rename, ASAN_CHECK removal). Per REVIEW.md, bug-fix PRs should contain only the fix and its tests — can the CI changes be split out, keeping only what's strictly needed to reproduce the race here?"},
  {"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with `rdma_get_cm_event()`, but in `uct_rdmacm_cm_event_handler` the `rdma_get_cm_event()` call runs *outside* the block — only `uct_rdmacm_cm_process_event()` is covered. What we actually serialize against is the per-ep state mutated by the dispatched handlers. Pls reword so readers don't look for a guarantee that isn't there (same applies to the comment at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55)."},
  {"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment above says `server_send_priv_data` takes the same recursive block on the success path, implying continuous coverage — but here we UNBLOCK before calling it, so there's a brief unblocked window. Either keep the block held across the call (relying on recursion in `server_send_priv_data`) or drop the misleading wording."}
]

Signed-off-by: NirWolfer <nwolfer@nvidia.com>

Alexey-Rivkin added 11 commits May 19, 2026 12:05

CI: Target IB-capable nodes for tests

ab6316b

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Add Bluefield tests

a61f46d

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Dynamiclayy set CPU affinity

5348ca5

Gtest failures when running in k8s env, as unlimites max_threads cause resource exhaustion. Setting the CPU affinity will limit max_threads to 2 dynamically. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Test SRIOV settings

c24856b

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Rename to IB as opposed to RoCE

a49719f

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: address PR comments

430f99f

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: remove ASAN-only override from gtest matrix run

4c3d5ce

CI: document CPU affinity limitation in gtest matrix

d45deed

CI: Add RoCE (CX7) test target alongside IB (CX8)

21fad89

Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Add RoCE (CX7) test target alongside IB (CX8)

4d8ef2f

Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

NirWolfer force-pushed the gtest_roce_infra branch from 56cfcf8 to 85aa477 Compare May 26, 2026 12:59

NirWolfer force-pushed the gtest_roce_infra branch from 85aa477 to 5cac0af Compare May 26, 2026 14:26

NirWolfer force-pushed the gtest_roce_infra branch from d6bc784 to 36911bc Compare May 27, 2026 10:14

NirWolfer force-pushed the gtest_roce_infra branch from 36911bc to 4b3f6a2 Compare May 27, 2026 11:14

Upgrade builder docker to newer version

09f0b65

Signed-off-by: NirWolfer <nwolfer@nvidia.com>

NirWolfer force-pushed the gtest_roce_infra branch from 4b3f6a2 to 09f0b65 Compare May 27, 2026 11:44

CI: set ulimit -l unlimited for gtest

75f78f5

Signed-off-by: NirWolfer <nwolfer@nvidia.com>

NirWolfer force-pushed the gtest_roce_infra branch 2 times, most recently from 4ec8f8d to 75f78f5 Compare May 28, 2026 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gtest roce infra#11490

Gtest roce infra#11490
NirWolfer wants to merge 14 commits into
openucx:masterfrom
NirWolfer:gtest_roce_infra

NirWolfer commented May 26, 2026

Uh oh!

NirWolfer commented May 26, 2026

Uh oh!

NirWolfer commented May 26, 2026

Uh oh!

NirWolfer commented May 27, 2026

Uh oh!

svc-nvidia-pr-review commented May 27, 2026

Uh oh!

svc-nvidia-pr-review commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NirWolfer commented May 26, 2026

Uh oh!

NirWolfer commented May 26, 2026

Uh oh!

NirWolfer commented May 26, 2026

Uh oh!

NirWolfer commented May 27, 2026

Uh oh!

svc-nvidia-pr-review commented May 27, 2026

Uh oh!

svc-nvidia-pr-review commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants