Skip to content

UCP/CORE: Fix num_tls overflow detection#11530

Merged
brminich merged 4 commits into
openucx:masterfrom
guy-ealey-morag:fix-num-tls-overflow-detection
Jun 10, 2026
Merged

UCP/CORE: Fix num_tls overflow detection#11530
brminich merged 4 commits into
openucx:masterfrom
guy-ealey-morag:fix-num-tls-overflow-detection

Conversation

@guy-ealey-morag

@guy-ealey-morag guy-ealey-morag commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What?

Fix num_tls overflow detection to correctly show an error even if the total number is equal or bigger than 256.

Why?

According to #11505 when the number is 256 or bigger, there is an infinite loop that happens before the overflow detection logic.

How?

  • Fix the loops to use unsigned for indexing
  • Allow num_tls to increase up to UCP_MAX_RESOURCES and show an error

@guy-ealey-morag guy-ealey-morag assigned iyastreb and unassigned iyastreb Jun 5, 2026
@guy-ealey-morag guy-ealey-morag requested a review from iyastreb June 5, 2026 11:32
@guy-ealey-morag guy-ealey-morag changed the title UCP/CORE: Fix num_tls overflow detection UCP/CORE: Fix num_tls overflow detection Jun 5, 2026
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
@guy-ealey-morag guy-ealey-morag force-pushed the fix-num-tls-overflow-detection branch from 7f40d5d to dbdec43 Compare June 5, 2026 11:38
Comment thread src/ucp/core/ucp_context.c Outdated
Comment thread src/ucp/core/ucp_context.c Outdated
Comment thread src/ucp/core/ucp_context.c Outdated
Comment thread src/ucp/core/ucp_context.c Outdated
"(%s%u requested, up to %d are supported)",
(context->num_tls == UINT8_MAX) ? ">=" : "", context->num_tls,
UCP_MAX_RESOURCES);
status = UCS_ERR_EXCEEDS_LIMIT;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change cause ucp_context initialization to fail on a machine with 256 devices?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but as I explained on the issue thread, the infinite loop that you found prevented the error message from appearing, it was always meant to fail.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this design seems a bit unreasonable to me. I'm only using one of the devices, yet I can't use it just because there are too many devices on the machine. Is there any workaround for this? After all, I can't ask the customers to remove their net device.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you can limit amount of devices with UCX_NET_DEVICES, or transports with UCX_TLS.
But I agree, that we should probably just print a warning and continue work.
Maybe @brminich @yosefe have opinion on that?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iyastreb Actually, I have restricted the devices and transport protocols using UCX_NET_DEVICES=mlx5_0:1 and UCX_TLS=rc,tcp, but the device scanning process doesn't seem to be limited by UCX_NET_DEVICES.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanallen It seems that the restrictions from UCX_NET_DEVICES and UCX_TLS would come into effect when the infinite loop bug is resolved.

@ivanallen ivanallen Jun 9, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guy-ealey-morag But it returns error here because there are too many devices, so it won't even get the chance to take effect, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanallen The function ucp_add_tl_resource_if_enabled would add a device only if it's included in the given filters (like those mentioned above), so in your case it would fail only if you don't use any filters.

iyastreb
iyastreb previously approved these changes Jun 10, 2026

@iyastreb iyastreb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM: this PR does not change existing behavior wrt device discovery, just fixes the overflow bug, that's ok
Maybe we can just emit the warning and continue operation, can be another PR

@rakhmets

rakhmets commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

LGTM: this PR does not change existing behavior wrt device discovery, just fixes the overflow bug, that's ok Maybe we can just emit the warning and continue operation, can be another PR

On the one hand, yes it seems to be a minimum fix for the issue. But on the other hand, I think it's better to fix it completely, as we've already touched the code, and found several controversial points related to the changed code.

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
@guy-ealey-morag

Copy link
Copy Markdown
Contributor Author

@iyastreb @rakhmets I modified the logic to stop parsing at UCP_MAX_RESOURCES.
I kept it as an error for now because I don't want to change the behavior yet, we can consider it for later.

rakhmets
rakhmets previously approved these changes Jun 10, 2026
Comment thread src/ucp/core/ucp_context.c Outdated
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
rakhmets
rakhmets previously approved these changes Jun 10, 2026
Comment thread src/ucp/core/ucp_context.c Outdated
Comment thread src/ucp/core/ucp_context.c Outdated
Comment thread src/ucp/core/ucp_context.c Outdated
Comment thread src/ucp/core/ucp_context.c Outdated
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
@brminich brminich enabled auto-merge (squash) June 10, 2026 15:23
@brminich brminich merged commit 7735f82 into openucx:master Jun 10, 2026
159 of 160 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants