UCP/CORE: Fix num_tls overflow detection#11530
Conversation
num_tls overflow detection
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
7f40d5d to
dbdec43
Compare
| "(%s%u requested, up to %d are supported)", | ||
| (context->num_tls == UINT8_MAX) ? ">=" : "", context->num_tls, | ||
| UCP_MAX_RESOURCES); | ||
| status = UCS_ERR_EXCEEDS_LIMIT; |
There was a problem hiding this comment.
Will this change cause ucp_context initialization to fail on a machine with 256 devices?
There was a problem hiding this comment.
Yes, but as I explained on the issue thread, the infinite loop that you found prevented the error message from appearing, it was always meant to fail.
There was a problem hiding this comment.
However, this design seems a bit unreasonable to me. I'm only using one of the devices, yet I can't use it just because there are too many devices on the machine. Is there any workaround for this? After all, I can't ask the customers to remove their net device.
There was a problem hiding this comment.
@iyastreb Actually, I have restricted the devices and transport protocols using UCX_NET_DEVICES=mlx5_0:1 and UCX_TLS=rc,tcp, but the device scanning process doesn't seem to be limited by UCX_NET_DEVICES.
There was a problem hiding this comment.
@ivanallen It seems that the restrictions from UCX_NET_DEVICES and UCX_TLS would come into effect when the infinite loop bug is resolved.
There was a problem hiding this comment.
@guy-ealey-morag But it returns error here because there are too many devices, so it won't even get the chance to take effect, right?
There was a problem hiding this comment.
@ivanallen The function ucp_add_tl_resource_if_enabled would add a device only if it's included in the given filters (like those mentioned above), so in your case it would fail only if you don't use any filters.
iyastreb
left a comment
There was a problem hiding this comment.
LGTM: this PR does not change existing behavior wrt device discovery, just fixes the overflow bug, that's ok
Maybe we can just emit the warning and continue operation, can be another PR
On the one hand, yes it seems to be a minimum fix for the issue. But on the other hand, I think it's better to fix it completely, as we've already touched the code, and found several controversial points related to the changed code. |
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
What?
Fix
num_tlsoverflow detection to correctly show an error even if the total number is equal or bigger than 256.Why?
According to #11505 when the number is 256 or bigger, there is an infinite loop that happens before the overflow detection logic.
How?
unsignedfor indexingnum_tlsto increase up toUCP_MAX_RESOURCESand show an error