CORE: build UCX with nixl SONAME suffix, load with RTLD_DEEPBIND to avoid collisions#1673
CORE: build UCX with nixl SONAME suffix, load with RTLD_DEEPBIND to avoid collisions#1673roiedanino wants to merge 4 commits into
Conversation
…void collisions Signed-off-by: Roie Danino <rdanino@nvidia.com>
|
👋 Hi roiedanino! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
Signed-off-by: Roie Danino <rdanino@nvidia.com>
|
/build |
Signed-off-by: Roie Danino <rdanino@nvidia.com>
|
👀 Investigating |
|
🤖 CI Triage Agent — All the evidence needed is in the logs. Here is the full diagnosis: Summary: Root cause: The
Black modified the file in-place and the hook exited 1 because the working-tree file changed during the check. The exact diff from the log: - module_suffix_pattern = re.compile(
- rf"-{escaped_suffix}(-[0-9a-f]{{8}})?\.so"
- )
+ module_suffix_pattern = re.compile(rf"-{escaped_suffix}(-[0-9a-f]{{8}})?\.so")
modules_without_suffix = [
- name for name in ucx_modules
+ name
+ for name in ucx_modules
if not module_suffix_pattern.search(name.rsplit("/", 1)[-1])
]All other hooks (mypy, isort, flake8, codespell, whitespace checks) passed. Implicated commit: File: Suggested fix: In the Related: none
|
Signed-off-by: Roie Danino <rdanino@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Read the relevant section of the file to confirm current content
sed -n '40,65p' contrib/check_ucx_wheel_bundle.pyLength of output: 1112
It looks like the formatting was applied after the CI run on commit |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughThis pull request adds optional UCX source parameterization, SONAME suffix support, and RTLD_DEEPBIND dynamic binding throughout the build pipeline, validation infrastructure, and runtime plugin loading. Docker builds now accept custom UCX repositories with conditional configure flag detection. New validation tools verify symbol bindings and wheel bundle contents. Runtime plugin manager and extensions conditionally enable deep binding for more controlled symbol resolution. ChangesBuild and Validation Infrastructure
Runtime Plugin Loading with RTLD_DEEPBIND Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Infer (1.2.0)src/plugins/ucx/ucx_plugin.cppsrc/plugins/ucx/ucx_plugin.cpp:18:10: fatal error: 'backend/backend_plugin.h' file not found ... [truncated 2200 characters] ... ontend__CTrans.CTrans_funct.instruction_log.(fun) in file "src/clang/cTrans.ml", line 4784, characters 10-1023 src/core/nixl_plugin_manager.cppIn file included from src/core/nixl_plugin_manager.cpp:18: ... [truncated 2200 characters] ... Called from ClangFrontend__CFrontend_decl.CFrontend_decl_funct.add_method in file "src/clang/cFrontend_decl.ml" (inlined), line 54, characters 4-52 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
What?
Add private-UCX loading support for the NIXL wheel by combining:
RTLD_DEEPBINDwhen loading the UCX backend plugin.RTLD_DEEPBINDduringnixl_ep_cppPython extension import.Why?
The NIXL wheel bundles Python code, NIXL core libraries, the NIXL UCX plugin, UCX core libraries, and UCX modules. In environments such as HPC-X/OpenMPI, another UCX version can already be loaded globally before NIXL is imported or before the NIXL UCX plugin is loaded.
Even if the NIXL wheel contains its own UCX libraries, normal ELF symbol resolution can bind NIXL’s UCX references to the globally loaded UCX instead. That can make NIXL silently use the wrong UCX version and can produce hard-to-debug runtime failures.
This change makes the intended private UCX path explicit and testable.
How?
Load the UCX backend plugin with
RTLD_DEEPBINDby default.NIXL_UCX_DEEPBIND.Import
nixl_ep_cppwithRTLD_DEEPBINDwhen supported by the platform.NIXL_UCX_DEEPBINDopt-out behavior.Add UCX build wiring to the container flow.
--ucx-soname-suffix <suffix>passes UCX--with-soname-suffix=<suffix>.--private-ucxis a shortcut for the NIXL private suffix.--ucx-repoallows building against a UCX branch/fork that contains the private SONAME support.Improve
wheel_add_ucx_plugins.pyso auditwheel-renamed libraries such aslibucp-nixl-<hash>.so...are mapped correctly.Add UCX backend diagnostics.
NIXL_UCX_EXPECTED_SONAMEcan be set to fail fast if the backend binds to an unexpected UCX library.Add validation helpers.
contrib/check_ucx_binding.pychecksDT_NEEDEDentries and glibc symbol binding behavior with/withoutRTLD_DEEPBIND.contrib/check_ucx_wheel_bundle.pychecks that a repaired wheel contains private UCX libraries, UCX modules, and the NIXL UCX plugin.This PR should only be merged (and perhaps reviewed) after CONFIGURE: added a configure option to add SONAME suffix openucx/ucx#11483 was merged
Summary by CodeRabbit