UCT/DEVICE: fix gda qp state#11517
Conversation
Reuse of pooled GDAKI channel blocks left QPs in RTS, so reconnect failed when devx_connect_qp expected INIT. Reset QPs on cleanup and reconnect, and add a shared devx RST2INIT helper.
| } | ||
|
|
||
| ucs_status_t | ||
| uct_ib_mlx5_devx_qp_rst2init(uct_ib_iface_t *iface, uct_ib_mlx5_qp_t *qp) |
There was a problem hiding this comment.
Maybe we can reuse it in more places? looks like code is common with code from uct_ib_mlx5_devx_create_qp_common
| } | ||
|
|
||
| static inline ucs_status_t | ||
| uct_ib_mlx5_devx_qp_rst2init(uct_ib_iface_t *iface, uct_ib_mlx5_qp_t *qp) |
There was a problem hiding this comment.
Yes, to make compile pass if without devx.
| uct_ib_iface_t *ib_iface = &iface->super.super.super; | ||
| ucs_status_t status; | ||
|
|
||
| status = uct_rc_gdaki_channel_reset_qp(ib_iface, &channel->qp); |
There was a problem hiding this comment.
why not normalize the QP back to INIT in uct_rc_gdaki_cleanup_channels_pooled (on return-to-pool) instead of on connect? a freshly created QP is already left in INIT by create_qp_common, so the contract is "connect gets an INIT QP" - the pool is the only thing that breaks it by returning QP in RTS. resetting on release keeps the pool holding INIT QPs like fresh ones, leaves connect unchanged, and avoids re-resetting fresh QPs on every first connect
There was a problem hiding this comment.
Is it possible that wireup reconfig reuse uct_ep without destroy, then connect still sees an RTS QP?
What?
Fix wrong qp state when rc_gda reconnect.
Why?
When a rc_gda EP is destroyed and a new one is created, its channels may be reused from the pool with QPs still in a RTS state.
How?