UCT/EP: Enhance uct_ep_check#11495
Conversation
|
🤖 Starting review — findings will be posted here when done. |
|
🤖 Starting review — findings will be posted here when done. |
svc-nvidia-pr-review
left a comment
There was a problem hiding this comment.
Code Review
8 findings — 🚨 1 blocker,
7 findings posted as inline comments.
General Findings
src/uct/api/uct.h
API doc for uct_ep_check is not updated to describe the new comp semantics. Public header still implies comp must be NULL (matching the just-removed UCT_CHECK_PARAM). pls add a docstring update covering UCS_OK vs UCS_INPROGRESS returns and when comp is invoked.
| ep->flags &= ~UCT_RC_EP_FLAG_KEEPALIVE_PENDING; | ||
| if (status == UCS_OK) { | ||
| ep->flags &= ~UCT_RC_EP_FLAG_KEEPALIVE_PENDING; | ||
| ucs_assert(comp == NULL); |
There was a problem hiding this comment.
ucs_assert(comp == NULL) here is true by construction (uct_rc_ep_check_internal returns UCS_INPROGRESS when comp != NULL), but the early if (status == UCS_ERR_NO_RESOURCE) + late if (status == UCS_INPROGRESS) + final error branch reads easier as a switch. Minor readability.
There was a problem hiding this comment.
Prefer to keep the per-case if-chain. Each arm has a different post-action (release-and-return / leave-req-and-return / invoke-completion-then-release) and the early UCS_ERR_NO_RESOURCE return is semantically distinct (the req is re-queued by the dispatcher, not consumed). A switch would either need an early return outside it or a fall-through tail — neither makes the asymmetry clearer.
evgeny-leksikov
left a comment
There was a problem hiding this comment.
Need to fix commit comment UCP -> UCT
| } | ||
|
|
||
| return status; | ||
| if (status == UCS_INPROGRESS) { |
There was a problem hiding this comment.
When a queued ep_check posts, uct_rc_ep_check_progress() returns UCS_INPROGRESS. uct_rc_ep_process_pending() maps that to NEXT_GROUP -> the same pending request stays queued. Further progress can repost w same comp -> purge can cancel a comp that the CQE path will also invoke. Free the pending request and return UCS_OK once the async check is armed?
There was a problem hiding this comment.
Posted operations should be removed from pending queue and enqueued to transport level outstanding ops. So, pending purge won't take any effect on it.
There was a problem hiding this comment.
In uct_rc_ep_check after uct_rc_ep_check_internal, there's
ucs_assert((comp == NULL) ? (status == UCS_OK) :
(status == UCS_INPROGRESS));
There was a problem hiding this comment.
If user requests check with uct_completion, we return UCS_INPROGRESS with promise return actual status in uct_completion (can be used to acknowledge new connection), otherwise UCS_OK which means EP is fine at this exact moment, but in case if check WQE completes with ERR CQE, we report error by error callback (used to detect peer/link failure). What am I missing?
| /* in case if no TX resources are available then there is at least | ||
| * one signaled operation which provides actual peer status, in this case | ||
| * just return without any actions */ | ||
| UCT_RC_CHECK_TXQP_RET(iface, ep, UCS_OK); |
There was a problem hiding this comment.
Returns UCS_OK even when comp != NULL. In debug -> new assertion expects UCS_INPROGRESS; in release -> ignores the user comp. The async path should either attach comp to the relevant outstanding signaled op or keep/retry the check until it can arm a real comp.
There was a problem hiding this comment.
In debug -> new assertion expects UCS_INPROGRESS
could you please point which one?
in release -> ignores the user comp
this is the common UCT/UCP design - don't invoke completion if the operation completed immediately.
There was a problem hiding this comment.
uct_rc_ep_process_pending() gets UCS_INPROGRESS from the callback -> returns UCS_ARBITER_CB_RESULT_NEXT_GROUP. Then, after uct_rc_ep_check_progress() returns UCS_INPROGRESS, the same uct_pending_req_t still remains on the arbiter, while its comp was already attached to the txqp outstanding op?
d64a91a to
bbb094e
Compare
Address PR openucx#11495 review r3318024007: in the wireup branch uct_ud_ep_check() fell through to uct_ud_ep_comp_skb_add(), which returns UCS_INPROGRESS for comp==NULL and broke the previous UCS_OK contract observed by legacy NULL-comp callers. Short-circuit the comp==NULL case in a unified tail block; the now redundant "connected + in-flight + comp==NULL" else-if is folded into the same check.
…acks - Modified uct_ep_check to accept a completion callback, allowing for asynchronous reporting of peer status. - Updated related functions across various transport implementations to handle the new completion parameter. - Added tests to verify the correct behavior of the asynchronous completion mechanism during endpoint checks. This change improves the flexibility and responsiveness of endpoint status checks in the UCT layer.
Address PR openucx#11495 review r3318024007: in the wireup branch uct_ud_ep_check() fell through to uct_ud_ep_comp_skb_add(), which returns UCS_INPROGRESS for comp==NULL and broke the previous UCS_OK contract observed by legacy NULL-comp callers. Short-circuit the comp==NULL case in a unified tail block; the now redundant "connected + in-flight + comp==NULL" else-if is folded into the same check.
5a6410a to
421e02a
Compare
What?
Enhance uct_ep_check to support asynchronous completion callbacks.
Why?
How?