Skip to content

UCS/DEBUG: Add table builder API#11477

Open
guy-ealey-morag wants to merge 17 commits into
openucx:masterfrom
guy-ealey-morag:ucx-debug-table-api
Open

UCS/DEBUG: Add table builder API#11477
guy-ealey-morag wants to merge 17 commits into
openucx:masterfrom
guy-ealey-morag:ucx-debug-table-api

Conversation

@guy-ealey-morag
Copy link
Copy Markdown
Contributor

@guy-ealey-morag guy-ealey-morag commented May 21, 2026

What?

Unify table printing logic under a common API, refactored existing table prints in sys_info.c and proto_debug.c.

Why?

Every feature that prints a table implements its own table formatting logic which is fragile and hard to extend.
A unified API would make printing new tables easier and would unify behavior across UCX.

This refactor was done to allow me print new tables as part of a new feature I'm working on.

Examples:

UCX_PROTO_INFO=used

+--------+-----------+-----------------------------------------------+-------------+
| perftest self cfg#0                                                              |
| active message by ucp_am_send*(multi) from host memory                           |
+--------+-----------+-----------------------------------------------+-------------+
| Count  |   Range   |                  Description                  |   Config    |
+--------+-----------+-----------------------------------------------+-------------+
|      0 |   0..2434 | short                                         | self/memory |
| 516383 | 2435..inf | (?) rendezvous copy from mapped remote memory | self/memory |
+--------+-----------+-----------------------------------------------+-------------+

UCX_PROTO_INFO=y

+--------------+-----------------------------------------------+-------------------+
| perftest self cfg#0                                                              |
| active message by ucp_am_send* with reply flag(fast-completion) from host memory |
+--------------+-----------------------------------------------+-------------------+
|    Range     |                  Description                  |      Config       |
+--------------+-----------------------------------------------+-------------------+
|      0..8176 | short                                         | self/memory       |
| 8177..262143 | multi-frag copy-in                            | self/memory       |
|    256K..inf | (?) rendezvous copy from mapped remote memory | self/memory       |
+--------------+-----------------------------------------------+-------------------+

ucx_info -T

#
# System topology
#
# +---------+---------+---------+---------+---------+---------+
# |         |         |         |         |         |         |
# |    MB/s | ens10f0 |  mlx5_0 |  mlx5_1 |    GPU0 |  mlx5_2 |
# |         |         |         |         |         |         |
# +---------+---------+---------+---------+---------+---------+
# |         |         |         |         |         |         |
# | ens10f0 |    -    | 17000.0 |  5100.0 |  5100.0 |  5100.0 |
# |         |         |         |         |         |         |
# +---------+---------+---------+---------+---------+---------+
# |         |         |         |         |         |         |
# |  mlx5_0 | 17000.0 |    -    |  5100.0 |  5100.0 |  5100.0 |
# |         |         |         |         |         |         |
# +---------+---------+---------+---------+---------+---------+
# |         |         |         |         |         |         |
# |  mlx5_1 |  5100.0 |  5100.0 |    -    | 17000.0 | 17000.0 |
# |         |         |         |         |         |         |
# +---------+---------+---------+---------+---------+---------+
# |         |         |         |         |         |         |
# |    GPU0 |  5100.0 |  5100.0 | 17000.0 |    -    | 17000.0 |
# |         |         |         |         |         |         |
# +---------+---------+---------+---------+---------+---------+
# |         |         |         |         |         |         |
# |  mlx5_2 |  5100.0 |  5100.0 | 17000.0 | 17000.0 |    -    |
# |         |         |         |         |         |         |
# +---------+---------+---------+---------+---------+---------+
#
# NUMA memory latency
#
# +---------+---------+---------+---------+---------+---------+
# |         |         |         |         |         |         |
# |  device | ens10f0 |  mlx5_0 |  mlx5_1 |    GPU0 |  mlx5_2 |
# |         |         |         |         |         |         |
# +---------+---------+---------+---------+---------+---------+
# |         |         |         |         |         |         |
# |    nsec |   210.0 |   210.0 |   210.0 |   210.0 |   210.0 |
# |         |         |         |         |         |         |
# +---------+---------+---------+---------+---------+---------+

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Comment thread src/tools/info/sys_info.c
Comment thread src/tools/info/sys_info.c Outdated
Comment thread src/ucs/debug/table.c Outdated
Comment thread src/ucp/proto/proto_debug.c
Comment thread src/ucs/debug/table.h
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
tomerg-nvidia
tomerg-nvidia previously approved these changes May 26, 2026
@guy-ealey-morag
Copy link
Copy Markdown
Contributor Author

@svc-nvidia-pr-review

@svc-nvidia-pr-review
Copy link
Copy Markdown

🤖 Starting review — findings will be posted here when done.

Copy link
Copy Markdown

@svc-nvidia-pr-review svc-nvidia-pr-review left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

9 findings⚠️ 7 minors, ℹ️ 2 infos

9 findings posted as inline comments.

Comment thread src/ucp/proto/proto_debug.c
Comment thread src/ucp/proto/proto_debug.c
Comment thread src/ucp/proto/proto_common.c
Comment thread src/ucs/debug/table.c Outdated
Comment thread src/ucs/debug/table.c Outdated
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
@openucx openucx deleted a comment from svc-nvidia-pr-review May 29, 2026
@openucx openucx deleted a comment from svc-nvidia-pr-review May 29, 2026
tomerg-nvidia
tomerg-nvidia previously approved these changes May 29, 2026
Comment thread src/ucs/debug/table.h Outdated
Comment thread src/ucs/debug/table.h Outdated
Comment thread src/ucs/debug/table.h Outdated
Comment thread src/ucs/debug/table.h
Comment thread src/ucs/debug/table.h Outdated
Comment thread src/ucs/debug/table.c Outdated
Comment thread src/ucs/debug/table.c
Comment thread src/ucs/debug/table.c
Comment thread src/ucs/debug/table.c Outdated
Comment thread src/ucs/debug/table.c Outdated
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Comment thread src/ucs/debug/table.c
Comment thread src/ucs/debug/table.c
Comment thread src/ucs/debug/table.c Outdated
Comment thread src/ucs/debug/table.c Outdated
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
@guy-ealey-morag guy-ealey-morag force-pushed the ucx-debug-table-api branch 2 times, most recently from 0686c21 to f6c54fb Compare June 3, 2026 10:05
Comment thread src/ucs/debug/table.h Outdated
*/
void ucs_table_init(ucs_table_t *table, const ucs_table_config_t *config);
ucs_status_t
ucs_table_init(ucs_table_t *table, const ucs_table_config_t *config);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a more compact API for tables still would be better? since tables are printed based on a special request, not as part of the regular workflow.

Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my preference, the error handling was added after the review of @iyastreb
I tend to think that failing with ucs_fatal or ucs_assertv_always is acceptable because it can be only caused by memory issues or developer error (bad config, bad args).

@iyastreb Can we open it for discussion again?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I just provided my feedback, if others think that compact API is better - we can do it. Personally I think that fatal in logging utility is not the best.

Another approach would be to add "ucs_status last_error" field to the table, so that:

  • each function checks this field in the beginning, if error is set - just return
  • first time we set an error - log ERROR

This way we keep API simple and avoid fatal

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iyastreb I implemented your suggestion, please review

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Comment thread src/ucp/proto/proto_debug.h Outdated
Comment thread src/ucs/debug/table.c Outdated
Comment thread src/ucs/debug/table.c Outdated
Comment thread src/ucs/debug/table.c Outdated
Comment thread src/ucs/debug/table.c
Comment thread src/ucs/debug/table.c
@guy-ealey-morag guy-ealey-morag requested a review from iyastreb June 5, 2026 10:12
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
iyastreb
iyastreb previously approved these changes Jun 5, 2026
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants