Skip to content

UCP/CORE: Print debugging tables on context and ep creation#11510

Draft
guy-ealey-morag wants to merge 12 commits into
openucx:masterfrom
guy-ealey-morag:ctx-ep-tables
Draft

UCP/CORE: Print debugging tables on context and ep creation#11510
guy-ealey-morag wants to merge 12 commits into
openucx:masterfrom
guy-ealey-morag:ctx-ep-tables

Conversation

@guy-ealey-morag
Copy link
Copy Markdown
Contributor

@guy-ealey-morag guy-ealey-morag commented Jun 1, 2026

What?

Print human-readable tables on context and endpoint creation that give information on the different transports and devices available, and also which of them is actively being used.

Why?

Users often run UCX (or NIXL with UCX) and get unexpected behavior due to misconfiguration (in env vars or in the container).
Printing the transport device tables allow users to see which transports are available, which devices are visible, and which transports and devices are enabled/disabled in their current configuration. The tables can also report that a transport that was compiled into UCX is unsupported in their current environment.
On endpoint creation we print the transports and devices that were selected, along with the lane types.

How?

All of the information about the transports and devices is collected during initialization, and then it's printed in a table for the context init, and for every endpoint init (depending on env var UCX_PRINT_TRANSPORT_TABLES)

Examples:

(The timestamp/process prefix added by ucs_log_print_compact was removed here for improved readability)

ucx_perftest with cuda

$ UCX_PRINT_TRANSPORT_TABLES=y ucx_perftest -l -t ucp_put_bw -n 500000 -s 8 -c 0 -E poll -m cuda

+---------------------------------------------------------------------------------------------------------+
| Available Transports and Devices (ctx: perftest)                                                        |
+---------------+-----------+-------------+---------------------------------------------------------------+
| Type          | Component | Transport   | Device (System device)                                        |
+---------------+-----------+-------------+---------------------------------------------------------------+
| network       | tcp       | + tcp       | + ens10f0 (ens10f0)  + ibs2 (mlx5_0)  + ibs7f0 (mlx5_1)       |
|               |           |             | + lo                                                          |
|               +-----------+-------------+---------------------------------------------------------------+
|               | ib        | + rc_verbs  | + mlx5_0:1 (mlx5_0)  + mlx5_1:1 (mlx5_1)  + mlx5_2:1 (mlx5_2) |
|               |           | + ud_verbs  | + mlx5_0:1 (mlx5_0)  + mlx5_1:1 (mlx5_1)  + mlx5_2:1 (mlx5_2) |
|               |           | + dc_mlx5   | + mlx5_0:1 (mlx5_0)  + mlx5_1:1 (mlx5_1)  + mlx5_2:1 (mlx5_2) |
|               |           | + rc_mlx5   | + mlx5_0:1 (mlx5_0)  + mlx5_1:1 (mlx5_1)  + mlx5_2:1 (mlx5_2) |
|               |           | + ud_mlx5   | + mlx5_0:1 (mlx5_0)  + mlx5_1:1 (mlx5_1)  + mlx5_2:1 (mlx5_2) |
|               |           | + rc_gda    | + cuda0-mlx5_1:1 (GPU0)                                       |
|               +-----------+-------------+---------------------------------------------------------------+
|               | gga       | + gga_mlx5  | + mlx5_1:1 (mlx5_1)                                           |
+---------------+-----------+-------------+---------------------------------------------------------------+
| intra-node    | sysv      | + sysv      | + memory                                                      |
|               +-----------+-------------+---------------------------------------------------------------+
|               | posix     | + posix     | + memory                                                      |
|               +-----------+-------------+---------------------------------------------------------------+
|               | cuda_ipc  | + cuda_ipc  | + cuda (GPU0)                                                 |
|               +-----------+-------------+---------------------------------------------------------------+
|               | cma       | + cma       | + memory                                                      |
+---------------+-----------+-------------+---------------------------------------------------------------+
| accelerator   | cuda_cpy  | + cuda_copy | + cuda (GPU0)                                                 |
+---------------+-----------+-------------+---------------------------------------------------------------+
| loopback      | self      | + self      | + memory                                                      |
+---------------+-----------+-------------+---------------------------------------------------------------+
| Legend: + = enabled, - = disabled                                                                       |
| All of the available transports are listed, some may be disabled or unsupported on your system.         |
| All of the visible devices are listed per transport, some may be disabled.                              |
+---------------+-----------+-----------+-----------------------------------------------------------------+

+--------------------------------------------------------------+
| Endpoint Config #1 (ctx: perftest, type: self)               |
+-----------+--------------------+---------+-------------------+
| Transport | Device (Sys. dev.) | # Lanes | Lane Types        |
+-----------+--------------------+---------+-------------------+
| self      | memory             |       1 | am, rma, rkey_ptr |
+-----------+--------------------+---------+-------------------+
| rc_mlx5   | mlx5_2:1 (mlx5_2)  |       2 | rma, rma_bw       |
|           | mlx5_1:1 (mlx5_1)  |       2 | rma_bw            |
+-----------+--------------------+---------+-------------------+
| cma       | memory             |       1 | rma_bw            |
+-----------+--------------------+---------+-------------------+
| cuda_copy | cuda (GPU0)        |       1 | rma_bw            |
|           | mlx5_0:1 (mlx5_0)  |       2 | rma_bw            |
+-----------+--------------------+---------+-------------------+
| cuda_ipc  | cuda (GPU0)        |       1 | rma_bw            |
+-----------+--------------------+---------+-------------------+

ucx_perftest with tcp

$ UCX_PRINT_TRANSPORT_TABLES=y UCX_TLS=tcp ucx_perftest -l -t ucp_put_bw -n 500000 -s 8 -c 0 -E poll

+---------------------------------------------------------------------------------------------------------+
| Available Transports and Devices (ctx: perftest)                                                        |
+---------------+-----------+-------------+---------------------------------------------------------------+
| Type          | Component | Transport   | Device (System device)                                        |
+---------------+-----------+-------------+---------------------------------------------------------------+
| network       | tcp       | + tcp       | + ens10f0 (ens10f0)  + ibs2 (mlx5_0)  + ibs7f0 (mlx5_1)       |
|               |           |             | + lo                                                          |
|               +-----------+-------------+---------------------------------------------------------------+
|               | ib        | - rc_verbs  | - mlx5_0:1 (mlx5_0)  - mlx5_1:1 (mlx5_1)  - mlx5_2:1 (mlx5_2) |
|               |           | - ud_verbs  | - mlx5_0:1 (mlx5_0)  - mlx5_1:1 (mlx5_1)  - mlx5_2:1 (mlx5_2) |
|               |           | - dc_mlx5   | - mlx5_0:1 (mlx5_0)  - mlx5_1:1 (mlx5_1)  - mlx5_2:1 (mlx5_2) |
|               |           | - rc_mlx5   | - mlx5_0:1 (mlx5_0)  - mlx5_1:1 (mlx5_1)  - mlx5_2:1 (mlx5_2) |
|               |           | - ud_mlx5   | - mlx5_0:1 (mlx5_0)  - mlx5_1:1 (mlx5_1)  - mlx5_2:1 (mlx5_2) |
|               |           | - rc_gda    | - cuda0-mlx5_1:1 (GPU0)                                       |
|               +-----------+-------------+---------------------------------------------------------------+
|               | gga       | - gga_mlx5  | - mlx5_1:1 (mlx5_1)                                           |
+---------------+-----------+-------------+---------------------------------------------------------------+
| intra-node    | sysv      | - sysv      | - memory                                                      |
|               +-----------+-------------+---------------------------------------------------------------+
|               | posix     | - posix     | - memory                                                      |
|               +-----------+-------------+---------------------------------------------------------------+
|               | cuda_ipc  | - cuda_ipc  | - cuda                                                        |
|               +-----------+-------------+---------------------------------------------------------------+
|               | cma       | - cma       | - memory                                                      |
+---------------+-----------+-------------+---------------------------------------------------------------+
| accelerator   | cuda_cpy  | - cuda_copy | - cuda                                                        |
+---------------+-----------+-------------+---------------------------------------------------------------+
| loopback      | self      | - self      | - memory                                                      |
+---------------+-----------+-------------+---------------------------------------------------------------+
| Legend: + = enabled, - = disabled                                                                       |
| All of the available transports are listed, some may be disabled or unsupported on your system.         |
| All of the visible devices are listed per transport, some may be disabled.                              |
+---------------+-----------+-----------+-----------------------------------------------------------------+

+-------------------------------------------------------+
| Endpoint Config #0 (ctx: perftest, type: self)        |
+-----------+--------------------+---------+------------+
| Transport | Device (Sys. dev.) | # Lanes | Lane Types |
+-----------+--------------------+---------+------------+
| tcp       | ibs2 (mlx5_0)      |       1 | am, rma_bw |
|           | ens10f0 (ens10f0)  |       1 | rma_bw     |
|           | ibs7f0 (mlx5_1)    |       1 | rma_bw     |
+-----------+--------------------+---------+------------+

Unsupported transports:

+-------------------------------------------------------------------------------------------------+
| Available Transports and Devices (ctx: ucp_context_1)                                           |
+---------------+-----------+-------------+-------------------------------------------------------+
| Type          | Component | Transport   | Device (System device)                                |
+---------------+-----------+-------------+-------------------------------------------------------+
| network       | tcp       | + tcp       | + eno1 (eno1)  + lo                                   |
+---------------+-----------+-------------+-------------------------------------------------------+
| intra-node    | sysv      | + sysv      | + memory                                              |
|               +-----------+-------------+-------------------------------------------------------+
|               | posix     | + posix     | + memory                                              |
|               +-----------+-------------+-------------------------------------------------------+
|               | cuda_ipc  | + cuda_ipc  | + cuda (GPU0)                                         |
|               +-----------+-------------+-------------------------------------------------------+
|               | cma       | + cma       | + memory                                              |
+---------------+-----------+-------------+-------------------------------------------------------+
| accelerator   | cuda_cpy  | + cuda_copy | + cuda (GPU0)                                         |
+---------------+-----------+-------------+-------------------------------------------------------+
| loopback      | self      | + self      | + memory                                              |
+---------------+-----------+-------------+-------------------------------------------------------+
| <unavailable> | ib        |             |                                                       |
+               +-----------+-------------+-------------------------------------------------------+
|               | gga       |             |                                                       |
+               +-----------+-------------+-------------------------------------------------------+
|               | rdmacm    |             |                                                       |
+---------------+-----------+-------------+-------------------------------------------------------+
| Legend: + = enabled, - = disabled                                                               |
| All of the available transports are listed, some may be disabled or unsupported on your system. |
| All of the visible devices are listed per transport, some may be disabled.                      |
+---------------+-----------+-----------+---------------------------------------------------------+

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant