Skip to content

UCT/EP: Enhance uct_ep_check#11495

Open
evgeny-leksikov wants to merge 4 commits into
openucx:masterfrom
evgeny-leksikov:uct_ud_ep_check_comp
Open

UCT/EP: Enhance uct_ep_check#11495
evgeny-leksikov wants to merge 4 commits into
openucx:masterfrom
evgeny-leksikov:uct_ud_ep_check_comp

Conversation

@evgeny-leksikov
Copy link
Copy Markdown
Contributor

What?

Enhance uct_ep_check to support asynchronous completion callbacks.

Why?

  • This change improves the flexibility and responsiveness of endpoint status checks in the UCP layer.
  • Required for effective recovery after failure

How?

  • Modified uct_ep_check to accept a completion callback, allowing for asynchronous reporting of peer status.
  • Updated related functions across various transport implementations to handle the new completion parameter.
  • Added tests to verify the correct behavior of the asynchronous completion mechanism during endpoint checks.

@roiedanino
Copy link
Copy Markdown
Contributor

@svc-nvidia-pr-review

@svc-nvidia-pr-review
Copy link
Copy Markdown

🤖 Starting review — findings will be posted here when done.

@roiedanino
Copy link
Copy Markdown
Contributor

@svc-nvidia-pr-review

@svc-nvidia-pr-review
Copy link
Copy Markdown

🤖 Starting review — findings will be posted here when done.

Copy link
Copy Markdown

@svc-nvidia-pr-review svc-nvidia-pr-review left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

8 findings — 🚨 1 blocker, ⚠️ 7 minors

7 findings posted as inline comments.

General Findings

⚠️ MINOR src/uct/api/uct.h

API doc for uct_ep_check is not updated to describe the new comp semantics. Public header still implies comp must be NULL (matching the just-removed UCT_CHECK_PARAM). pls add a docstring update covering UCS_OK vs UCS_INPROGRESS returns and when comp is invoked.

Comment thread src/uct/ib/ud/base/ud_ep.c Outdated
Comment thread src/uct/ib/ud/base/ud_ep.c
Comment thread src/ucp/core/ucp_worker.c
Comment thread src/uct/ib/rc/base/rc_ep.c
Comment thread test/gtest/uct/test_peer_failure.cc
Comment thread src/uct/tcp/tcp_ep.c
ep->flags &= ~UCT_RC_EP_FLAG_KEEPALIVE_PENDING;
if (status == UCS_OK) {
ep->flags &= ~UCT_RC_EP_FLAG_KEEPALIVE_PENDING;
ucs_assert(comp == NULL);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

ucs_assert(comp == NULL) here is true by construction (uct_rc_ep_check_internal returns UCS_INPROGRESS when comp != NULL), but the early if (status == UCS_ERR_NO_RESOURCE) + late if (status == UCS_INPROGRESS) + final error branch reads easier as a switch. Minor readability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to keep the per-case if-chain. Each arm has a different post-action (release-and-return / leave-req-and-return / invoke-completion-then-release) and the early UCS_ERR_NO_RESOURCE return is semantically distinct (the req is re-queued by the dispatcher, not consumed). A switch would either need an early return outside it or a fall-through tail — neither makes the asymmetry clearer.

@evgeny-leksikov evgeny-leksikov changed the title UCP/EP: Enhance uct_ep_check UCT/EP: Enhance uct_ep_check May 28, 2026
Copy link
Copy Markdown
Contributor Author

@evgeny-leksikov evgeny-leksikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix commit comment UCP -> UCT

}

return status;
if (status == UCS_INPROGRESS) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a queued ep_check posts, uct_rc_ep_check_progress() returns UCS_INPROGRESS. uct_rc_ep_process_pending() maps that to NEXT_GROUP -> the same pending request stays queued. Further progress can repost w same comp -> purge can cancel a comp that the CQE path will also invoke. Free the pending request and return UCS_OK once the async check is armed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted operations should be removed from pending queue and enqueued to transport level outstanding ops. So, pending purge won't take any effect on it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In uct_rc_ep_check after uct_rc_ep_check_internal, there's

ucs_assert((comp == NULL) ? (status == UCS_OK) :
                   (status == UCS_INPROGRESS));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user requests check with uct_completion, we return UCS_INPROGRESS with promise return actual status in uct_completion (can be used to acknowledge new connection), otherwise UCS_OK which means EP is fine at this exact moment, but in case if check WQE completes with ERR CQE, we report error by error callback (used to detect peer/link failure). What am I missing?

/* in case if no TX resources are available then there is at least
* one signaled operation which provides actual peer status, in this case
* just return without any actions */
UCT_RC_CHECK_TXQP_RET(iface, ep, UCS_OK);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns UCS_OK even when comp != NULL. In debug -> new assertion expects UCS_INPROGRESS; in release -> ignores the user comp. The async path should either attach comp to the relevant outstanding signaled op or keep/retry the check until it can arm a real comp.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In debug -> new assertion expects UCS_INPROGRESS

could you please point which one?

in release -> ignores the user comp

this is the common UCT/UCP design - don't invoke completion if the operation completed immediately.

Copy link
Copy Markdown
Contributor

@gleon99 gleon99 Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uct_rc_ep_process_pending() gets UCS_INPROGRESS from the callback -> returns UCS_ARBITER_CB_RESULT_NEXT_GROUP. Then, after uct_rc_ep_check_progress() returns UCS_INPROGRESS, the same uct_pending_req_t still remains on the arbiter, while its comp was already attached to the txqp outstanding op?

evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this pull request Jun 5, 2026
Address PR openucx#11495 review r3318024007: in the wireup branch
uct_ud_ep_check() fell through to uct_ud_ep_comp_skb_add(), which
returns UCS_INPROGRESS for comp==NULL and broke the previous UCS_OK
contract observed by legacy NULL-comp callers.

Short-circuit the comp==NULL case in a unified tail block; the now
redundant "connected + in-flight + comp==NULL" else-if is folded into
the same check.
…acks

- Modified uct_ep_check to accept a completion callback, allowing for asynchronous reporting of peer status.
- Updated related functions across various transport implementations to handle the new completion parameter.
- Added tests to verify the correct behavior of the asynchronous completion mechanism during endpoint checks.

This change improves the flexibility and responsiveness of endpoint status checks in the UCT layer.
Address PR openucx#11495 review r3318024007: in the wireup branch
uct_ud_ep_check() fell through to uct_ud_ep_comp_skb_add(), which
returns UCS_INPROGRESS for comp==NULL and broke the previous UCS_OK
contract observed by legacy NULL-comp callers.

Short-circuit the comp==NULL case in a unified tail block; the now
redundant "connected + in-flight + comp==NULL" else-if is folded into
the same check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants