Gtest roce infra#11490
Conversation
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Gtest failures when running in k8s env, as unlimites max_threads cause resource exhaustion. Setting the CPU affinity will limit max_threads to 2 dynamically. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
RoCE on Blossom hits a crash in rdma_get_cm_event during test_ucp_sockaddr.ep_query that doesn't reproduce on Azure RoCE. Land IB (CX8) coverage first; RoCE will return in a separate PR after the rdmacm path is triaged. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
|
/build |
56cfcf8 to
85aa477
Compare
|
/build |
85aa477 to
5cac0af
Compare
The matrix runner pinned gtest to 2 CPUs to keep thread counts bounded in k8s. Under multi-threaded variants the UCS async thread (one per process) contended on those 2 cores with the gtest main thread, worker threads, and UCP progress threads. Async starvation grew the missed events queue and pushed rdmacm events through __ucs_async_poll_missed, exposing a residual race in uct_rdmacm_cm_event_handler that segfaulted inside rdma_get_cm_event. Widen the affinity mask to up to 4 CPUs (clamped to the cgroup's allowed range), still well under the pod's 10-core request. Keeps the thread-count rationale intact while giving async breathing room. Signed-off-by: NirWolfer <nwolfer@nvidia.com>
d6bc784 to
36911bc
Compare
|
🤖 Starting review — findings will be posted here when done. |
|
OK. Now I have a complete picture. Let me also check whether the mixing of bash variable expansion with the Jenkins YAML interpolation in One more concern about the bash CPU pinning in test_matrix.yaml: it uses Now I'll compile findings. The PR mixes two distinct changes: CI infra ( Let me also check the comment on line 488-490 "Hold the cm's async block over all librdmacm calls on this channel so they cannot race with rdma_get_cm_event() running in the cm event handler." The cm event handler is Looking again at the event handler: ret = rdma_get_cm_event(cm->ev_ch, &event);
...
UCS_ASYNC_BLOCK(uct_rdmacm_cm_get_async(cm));
uct_rdmacm_cm_process_event(cm, event);
UCS_ASYNC_UNBLOCK(...)So This is a meaningful documentation accuracy issue worth flagging. Let me finalize findings. [ Wait, I need to re-check if Also, I should not include an item about [ [
{"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the affinity mask is built from `first_chunk` only — if `Cpus_allowed_list` is e.g. `2,5,7-10` we silently drop everything after the first comma and bind to a single cpu, shrinking parallelism. Either document the assumption (k8s gives a single contiguous range) or handle the comma case."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR bundles a real rdmacm race fix with a fairly large CI infra rework (new HCA RoCE/IB images, CPU pinning, timeout/limit changes, taskName rename, ASAN_CHECK removal). Per REVIEW.md, bug-fix PRs should contain only the fix and its tests — can the CI changes be split out, keeping only what's strictly needed to reproduce the race here?"},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with `rdma_get_cm_event()`, but in `uct_rdmacm_cm_event_handler` the `rdma_get_cm_event()` call runs *outside* the block — only `uct_rdmacm_cm_process_event()` is covered. What we actually serialize against is the per-ep state mutated by the dispatched handlers. Pls reword so readers don't look for a guarantee that isn't there (same applies to the comment at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55)."},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment above says `server_send_priv_data` takes the same recursive block on the success path, implying continuous coverage — but here we UNBLOCK before calling it, so there's a brief unblocked window. Either keep the block held across the call (relying on recursion in `server_send_priv_data`) or drop the misleading wording."}
] |
36911bc to
4b3f6a2
Compare
Signed-off-by: NirWolfer <nwolfer@nvidia.com>
4b3f6a2 to
09f0b65
Compare
Signed-off-by: NirWolfer <nwolfer@nvidia.com>
4ec8f8d to
75f78f5
Compare
No description provided.