Skip to content

UCT/DEVICE: fix gda qp state#11517

Open
jeynmann wants to merge 6 commits into
openucx:masterfrom
jeynmann:fix_gda_qp_state
Open

UCT/DEVICE: fix gda qp state#11517
jeynmann wants to merge 6 commits into
openucx:masterfrom
jeynmann:fix_gda_qp_state

Conversation

@jeynmann
Copy link
Copy Markdown
Contributor

@jeynmann jeynmann commented Jun 3, 2026

What?
Fix wrong qp state when rc_gda reconnect.

Why?
When a rc_gda EP is destroyed and a new one is created, its channels may be reused from the pool with QPs still in a RTS state.

How?

  • Introduce uct_rc_gdaki_channel_reset_qp() to forward QPs' states → ERR → RESET.
  • Introduce uct_rc_gdaki_channel_connect() performs QP reset before connect.
  • Add reconnect test.

JeynmannZ added 4 commits June 3, 2026 14:40
Reuse of pooled GDAKI channel blocks left QPs in RTS, so reconnect
failed when devx_connect_qp expected INIT. Reset QPs on cleanup and
reconnect, and add a shared devx RST2INIT helper.
}

ucs_status_t
uct_ib_mlx5_devx_qp_rst2init(uct_ib_iface_t *iface, uct_ib_mlx5_qp_t *qp)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can reuse it in more places? looks like code is common with code from uct_ib_mlx5_devx_create_qp_common

Comment thread src/uct/ib/mlx5/ib_mlx5.h
}

static inline ucs_status_t
uct_ib_mlx5_devx_qp_rst2init(uct_ib_iface_t *iface, uct_ib_mlx5_qp_t *qp)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to make compile pass if without devx.

Comment thread src/uct/ib/mlx5/gdaki/gdaki.c Outdated
uct_ib_iface_t *ib_iface = &iface->super.super.super;
ucs_status_t status;

status = uct_rc_gdaki_channel_reset_qp(ib_iface, &channel->qp);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not normalize the QP back to INIT in uct_rc_gdaki_cleanup_channels_pooled (on return-to-pool) instead of on connect? a freshly created QP is already left in INIT by create_qp_common, so the contract is "connect gets an INIT QP" - the pool is the only thing that breaks it by returning QP in RTS. resetting on release keeps the pool holding INIT QPs like fresh ones, leaves connect unchanged, and avoids re-resetting fresh QPs on every first connect

Copy link
Copy Markdown
Contributor Author

@jeynmann jeynmann Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that wireup reconfig reuse uct_ep without destroy, then connect still sees an RTS QP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants