Skip to content

Gtest roce infra#11490

Draft
NirWolfer wants to merge 14 commits into
openucx:masterfrom
NirWolfer:gtest_roce_infra
Draft

Gtest roce infra#11490
NirWolfer wants to merge 14 commits into
openucx:masterfrom
NirWolfer:gtest_roce_infra

Conversation

@NirWolfer
Copy link
Copy Markdown
Contributor

No description provided.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Gtest failures when running in k8s env, as unlimites
max_threads cause resource exhaustion. Setting the CPU
affinity will limit max_threads to 2 dynamically.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Run RoCE first, then IB.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
RoCE on Blossom hits a crash in rdma_get_cm_event during
test_ucp_sockaddr.ep_query that doesn't reproduce on Azure RoCE.
Land IB (CX8) coverage first; RoCE will return in a separate PR
after the rdmacm path is triaged.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Run RoCE first, then IB.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
@NirWolfer
Copy link
Copy Markdown
Contributor Author

/build

@NirWolfer
Copy link
Copy Markdown
Contributor Author

/build

The matrix runner pinned gtest to 2 CPUs to keep thread counts bounded
in k8s. Under multi-threaded variants the UCS async thread (one per
process) contended on those 2 cores with the gtest main thread, worker
threads, and UCP progress threads. Async starvation grew the missed
events queue and pushed rdmacm events through __ucs_async_poll_missed,
exposing a residual race in uct_rdmacm_cm_event_handler that segfaulted
inside rdma_get_cm_event.

Widen the affinity mask to up to 4 CPUs (clamped to the cgroup's
allowed range), still well under the pod's 10-core request. Keeps the
thread-count rationale intact while giving async breathing room.

Signed-off-by: NirWolfer <nwolfer@nvidia.com>
@NirWolfer
Copy link
Copy Markdown
Contributor Author

@svc-nvidia-pr-review

@svc-nvidia-pr-review
Copy link
Copy Markdown

🤖 Starting review — findings will be posted here when done.

@svc-nvidia-pr-review
Copy link
Copy Markdown

OK. UCS_ASYNC_BLOCK is recursive in UCX. Good.

Now I have a complete picture. Let me also check whether the mixing of bash variable expansion with the Jenkins YAML interpolation in timeout: "${STEP_TIMEOUT_MINUTES}" works. That's a YAML/Jenkins concern, not a code concern.

One more concern about the bash CPU pinning in test_matrix.yaml: it uses [[ syntax requiring bash. if [[ "${first_chunk}" == *-* ]] etc. The shebang isn't shown but Jenkins typically runs under bash. Acceptable.

Now I'll compile findings. The PR mixes two distinct changes: CI infra (.ci/pipeline/test_matrix.yaml) and a real correctness fix in rdmacm. Per REVIEW.md, "Bug fix PRs — must contain only the fix and its tests". Mixing CI changes with a bug fix is borderline, though arguably the CI changes are needed to expose/test the fix (the comment in the YAML mentions "exposed a poll_missed race in rdmacm"). This is worth flagging as a scope question but not a blocker.

Let me also check the comment on line 488-490 "Hold the cm's async block over all librdmacm calls on this channel so they cannot race with rdma_get_cm_event() running in the cm event handler." The cm event handler is uct_rdmacm_cm_event_handler which calls rdma_get_cm_event BEFORE acquiring the block. So the block does NOT actually protect against rdma_get_cm_event racing — it protects against uct_rdmacm_cm_process_event which is what dispatches per-event handlers. The comment is slightly inaccurate.

Looking again at the event handler:

ret = rdma_get_cm_event(cm->ev_ch, &event);
...
UCS_ASYNC_BLOCK(uct_rdmacm_cm_get_async(cm));
uct_rdmacm_cm_process_event(cm, event);
UCS_ASYNC_UNBLOCK(...)

So rdma_get_cm_event itself runs unblocked. The block only protects event processing. The comments in the diff at lines 488-490, 598-600, 52-55 are technically misleading: blocking does NOT prevent races with rdma_get_cm_event. It prevents races with process_event (the handlers that mutate cep state). However, since librdmacm's own data structures are presumably thread-safe internally, the actual fix is preventing the EP/listener state from being mutated concurrently.

This is a meaningful documentation accuracy issue worth flagging.

Let me finalize findings.

[
{"file": ".ci/pipeline/test_matrix.yaml", "line": 45, "severity": "minor", "comment": "renaming ${axis_index} to ${worker} is fine for this matrix where the only axis is worker, but test_gpu_matrix.yaml already uses the same ${name}/${arch}/w${worker} form — pls double-check the matrix engine actually exposes the axis name as a placeholder here (the previous code used ${axis_index} which suggests it might not)."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": 57, "severity": "minor", "comment": "the previous timeout was 360 (minutes, matching the 90 min job timeout x4), now it's 180. With timeout_minutes: 240 on the job and STEP_TIMEOUT_MINUTES: 180 on the step, a single slow worker can still consume most of the job budget. Is 180 enough headroom given valgrind/asan disabled but MT tests now exposed?"},
{"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the taskset -c mask is built from the first cpu range only; if Cpus_allowed_list looks like 2,5,7-10 we drop everything after the first comma and bind to a single cpu, which silently shrinks parallelism. Worth either documenting the assumption or handling the comma-separated case."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR mixes a real correctness fix in rdmacm with a CI infra rework (new HCA images, CPU pinning, timeout/limit changes, ASAN_CHECK removal, taskName rename). Per REVIEW.md bug-fix PRs should contain only the fix and its tests — can the CI changes be split into a separate PR, keeping only the minimum needed to reproduce the rdmacm race here?"},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with rdma_get_cm_event(), but the cm event handler calls rdma_get_cm_event() before taking UCS_ASYNC_BLOCK (see uct_rdmacm_cm_event_handler in rdmacm_cm.c). What this block actually serializes is uct_rdmacm_cm_process_event() and the per-ep state it mutates — pls update the wording so future readers don't look for a guarantee that isn't there. Same comment applies at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55."},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment on L598-600 says server_send_priv_data takes the same recursive block on the success path, suggesting continuous coverage — but here we UNBLOCK before calling it, so there is a small unblocked window between the two. Either move the UNBLOCK after the call (relying on recursion) or drop the misleading wording."},
{"file": "src/uct/ib/rdmacm/rdmacm_listener.c", "line": 140, "severity": "info", "comment": "minor: the introduced status local can be avoided by keeping the existing return uct_rdmacm_cm_ack_event(event); pattern if you UNBLOCK before the return — though as written it's clearer. Up to you."}
]

Wait, I need to re-check if worker is actually bound from the axis. Let me drop my speculation since I can't verify, and tighten findings.

Also, I should not include an item about ${worker} since test_gpu_matrix.yaml already uses it successfully on line 64. So this likely works.

[
{"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the affinity mask is built from first_chunk only — if Cpus_allowed_list is something like 2,5,7-10 we silently drop everything after the first comma and bind to a single cpu, shrinking parallelism. Either document the assumption (k8s gives a single contiguous range) or handle the comma case."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR bundles a real rdmacm race fix with a fairly large CI infra rework (new HCA RoCE/IB images, CPU pinning, timeout/limit changes, taskName rename, ASAN_CHECK removal). Per REVIEW.md, bug-fix PRs should contain only the fix and its tests — can the CI changes be split out, keeping only what's strictly needed to reproduce the race here?"},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with rdma_get_cm_event(), but in uct_rdmacm_cm_event_handler rdma_get_cm_event() runs outside the block — only uct_rdmacm_cm_process_event() is covered. What we actually serialize against is the per-ep state mutated by the dispatched handlers. Pls reword so readers don't look for a guarantee that isn't there (same applies to the comment at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55)."},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment above says server_send_priv_data takes the same recursive block on the success path, implying continuous coverage — but here we UNBLOCK before calling it, so there's a brief unblocked window. Either keep the block held across the call (relying on recursion in server_send_priv_data) or drop the misleading wording."}
]

[
  {"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the affinity mask is built from `first_chunk` only — if `Cpus_allowed_list` is e.g. `2,5,7-10` we silently drop everything after the first comma and bind to a single cpu, shrinking parallelism. Either document the assumption (k8s gives a single contiguous range) or handle the comma case."},
  {"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR bundles a real rdmacm race fix with a fairly large CI infra rework (new HCA RoCE/IB images, CPU pinning, timeout/limit changes, taskName rename, ASAN_CHECK removal). Per REVIEW.md, bug-fix PRs should contain only the fix and its tests — can the CI changes be split out, keeping only what's strictly needed to reproduce the race here?"},
  {"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with `rdma_get_cm_event()`, but in `uct_rdmacm_cm_event_handler` the `rdma_get_cm_event()` call runs *outside* the block — only `uct_rdmacm_cm_process_event()` is covered. What we actually serialize against is the per-ep state mutated by the dispatched handlers. Pls reword so readers don't look for a guarantee that isn't there (same applies to the comment at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55)."},
  {"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment above says `server_send_priv_data` takes the same recursive block on the success path, implying continuous coverage — but here we UNBLOCK before calling it, so there's a brief unblocked window. Either keep the block held across the call (relying on recursion in `server_send_priv_data`) or drop the misleading wording."}
]

Signed-off-by: NirWolfer <nwolfer@nvidia.com>
Signed-off-by: NirWolfer <nwolfer@nvidia.com>
@NirWolfer NirWolfer force-pushed the gtest_roce_infra branch 2 times, most recently from 4ec8f8d to 75f78f5 Compare May 28, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants