Skip to content

CORE: build UCX with nixl SONAME suffix, load with RTLD_DEEPBIND to avoid collisions#1673

Draft
roiedanino wants to merge 4 commits into
ai-dynamo:mainfrom
roiedanino:deepbind
Draft

CORE: build UCX with nixl SONAME suffix, load with RTLD_DEEPBIND to avoid collisions#1673
roiedanino wants to merge 4 commits into
ai-dynamo:mainfrom
roiedanino:deepbind

Conversation

@roiedanino
Copy link
Copy Markdown
Contributor

@roiedanino roiedanino commented May 24, 2026

What?

Add private-UCX loading support for the NIXL wheel by combining:

  • RTLD_DEEPBIND when loading the UCX backend plugin.
  • RTLD_DEEPBIND during nixl_ep_cpp Python extension import.
  • Container build support for UCX builds with a private SONAME suffix.
  • Wheel packaging fixes for auditwheel-renamed UCX libraries with suffixed names.
  • Runtime diagnostics and optional fail-fast validation for the UCX library actually bound by the UCX backend.
  • Helper scripts to validate wheel contents and loader symbol binding.

Why?

The NIXL wheel bundles Python code, NIXL core libraries, the NIXL UCX plugin, UCX core libraries, and UCX modules. In environments such as HPC-X/OpenMPI, another UCX version can already be loaded globally before NIXL is imported or before the NIXL UCX plugin is loaded.

Even if the NIXL wheel contains its own UCX libraries, normal ELF symbol resolution can bind NIXL’s UCX references to the globally loaded UCX instead. That can make NIXL silently use the wrong UCX version and can produce hard-to-debug runtime failures.

This change makes the intended private UCX path explicit and testable.

How?

  • Load the UCX backend plugin with RTLD_DEEPBIND by default.

    • Controlled by NIXL_UCX_DEEPBIND.
    • Non-UCX backend and telemetry plugins keep the existing loading behavior.
  • Import nixl_ep_cpp with RTLD_DEEPBIND when supported by the platform.

    • Uses the same NIXL_UCX_DEEPBIND opt-out behavior.
  • Add UCX build wiring to the container flow.

    • --ucx-soname-suffix <suffix> passes UCX --with-soname-suffix=<suffix>.
    • --private-ucx is a shortcut for the NIXL private suffix.
    • --ucx-repo allows building against a UCX branch/fork that contains the private SONAME support.
  • Improve wheel_add_ucx_plugins.py so auditwheel-renamed libraries such as libucp-nixl-<hash>.so... are mapped correctly.

  • Add UCX backend diagnostics.

    • Logs the UCX version and library path selected at plugin initialization.
    • NIXL_UCX_EXPECTED_SONAME can be set to fail fast if the backend binds to an unexpected UCX library.
  • Add validation helpers.

    • contrib/check_ucx_binding.py checks DT_NEEDED entries and glibc symbol binding behavior with/without RTLD_DEEPBIND.
    • contrib/check_ucx_wheel_bundle.py checks that a repaired wheel contains private UCX libraries, UCX modules, and the NIXL UCX plugin.

    This PR should only be merged (and perhaps reviewed) after CONFIGURE: added a configure option to add SONAME suffix openucx/ucx#11483 was merged

Summary by CodeRabbit

  • New Features
    • Build system now supports custom UCX source repositories and configurable SONAME suffixes.
    • Added validation tools for UCX library bindings, symbol resolution, and wheel bundle verification.
    • Plugin loading mechanism enhanced with optional deep binding support.

…void collisions

Signed-off-by: Roie Danino <rdanino@nvidia.com>
@github-actions
Copy link
Copy Markdown

👋 Hi roiedanino! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

Signed-off-by: Roie Danino <rdanino@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@roiedanino
Copy link
Copy Markdown
Contributor Author

/build

Signed-off-by: Roie Danino <rdanino@nvidia.com>
@pull-request-size pull-request-size Bot added size/XL and removed size/L labels Jun 2, 2026
@dpressle
Copy link
Copy Markdown
Contributor

dpressle commented Jun 2, 2026

👀 Investigating Run Pre-Commit Hooks

@dpressle
Copy link
Copy Markdown
Contributor

dpressle commented Jun 2, 2026

🤖 CI Triage AgentRun Pre-Commit Hooks · commit 1ce87b60

All the evidence needed is in the logs. Here is the full diagnosis:


Summary: black reformatted contrib/check_ucx_wheel_bundle.py — the file was committed without running the code formatter first.

Root cause: The black pre-commit hook found that contrib/check_ucx_wheel_bundle.py does not conform to Black's style rules. Specifically, two constructs around lines 46–55 need reformatting:

  1. A three-line re.compile(…) assignment that Black collapses to a single line (it fits within Black's line-length limit).
  2. A list comprehension whose name for name in ucx_modules body Black wants split across two lines rather than kept on one.

Black modified the file in-place and the hook exited 1 because the working-tree file changed during the check. The exact diff from the log:

-    module_suffix_pattern = re.compile(
-        rf"-{escaped_suffix}(-[0-9a-f]{{8}})?\.so"
-    )
+    module_suffix_pattern = re.compile(rf"-{escaped_suffix}(-[0-9a-f]{{8}})?\.so")
     modules_without_suffix = [
-        name for name in ucx_modules
+        name
+        for name in ucx_modules
         if not module_suffix_pattern.search(name.rsplit("/", 1)[-1])
     ]

All other hooks (mypy, isort, flake8, codespell, whitespace checks) passed.

Implicated commit: [REDACTED:Hex High Entropy String] (branch deepbind, PR #1673)

File: contrib/check_ucx_wheel_bundle.py — lines 46–55 (the module_suffix_pattern assignment and the modules_without_suffix list comprehension)

Suggested fix: In the deepbind branch, run black contrib/check_ucx_wheel_bundle.py (or simply pre-commit run --files contrib/check_ucx_wheel_bundle.py) locally, stage the resulting change, and amend / push. The exact diff to apply is shown above — it's a pure formatting change with no semantic effect. To prevent recurrence, ensure pre-commit install is run once in the local clone so the hooks fire automatically on every commit.

Related: none

🛡️ This comment had 1 potential secret(s) redacted (Hex High Entropy String). See request_id 188cb7db-b1e7-4385-a223-740d06d94013 in the triage console for the audit trail.

Signed-off-by: Roie Danino <rdanino@nvidia.com>
@roiedanino
Copy link
Copy Markdown
Contributor Author

roiedanino commented Jun 2, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the relevant section of the file to confirm current content
sed -n '40,65p' contrib/check_ucx_wheel_bundle.py

Length of output: 1112



@roiedanino The file already looks good! The current state of contrib/check_ucx_wheel_bundle.py on the branch already has both Black formatting fixes applied:

  1. re.compile(...) is already collapsed to a single line (Line 49)
  2. ✅ The list comprehension already has name and for name in ucx_modules on separate lines (Lines 50–53)

It looks like the formatting was applied after the CI run on commit 1ce87b60. The next CI run should pass the Run Pre-Commit Hooks check. You can trigger a new build to confirm. 🐇

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 96ad2e2e-3008-4adc-a189-191d66a1986e

📥 Commits

Reviewing files that changed from the base of the PR and between b293d9b and 2e92556.

📒 Files selected for processing (10)
  • contrib/Dockerfile
  • contrib/Dockerfile.manylinux
  • contrib/build-container.sh
  • contrib/check_ucx_binding.py
  • contrib/check_ucx_wheel_bundle.py
  • contrib/wheel_add_ucx_plugins.py
  • examples/device/ep/nixl_ep/__init__.py
  • src/core/nixl_plugin_manager.cpp
  • src/core/plugin_manager.h
  • src/plugins/ucx/ucx_plugin.cpp

📝 Walkthrough

Walkthrough

This pull request adds optional UCX source parameterization, SONAME suffix support, and RTLD_DEEPBIND dynamic binding throughout the build pipeline, validation infrastructure, and runtime plugin loading. Docker builds now accept custom UCX repositories with conditional configure flag detection. New validation tools verify symbol bindings and wheel bundle contents. Runtime plugin manager and extensions conditionally enable deep binding for more controlled symbol resolution.

Changes

Build and Validation Infrastructure

Layer / File(s) Summary
Docker container build parameterization
contrib/Dockerfile, contrib/Dockerfile.manylinux
Docker builds now accept UCX_REPO and UCX_SONAME_SUFFIX build arguments. Both Dockerfiles detect support for --with-soname-suffix and --enable-module-deepbind configure flags via ./configure --help inspection and fail the build with an error if requested features are unavailable. Computed parameters are appended to the UCX configure-release-mt invocation.
Build container script CLI and argument passing
contrib/build-container.sh
Adds --ucx-repo, --ucx-ref, --ucx-soname-suffix, and --private-ucx command-line options with default values. Build-option reporting displays selected UCX configuration and SONAME/deepbind status. Docker build invocation now passes UCX_REPO, UCX_REF, and UCX_SONAME_SUFFIX as build arguments.
Wheel bundle and plugin validation tools
contrib/check_ucx_wheel_bundle.py, contrib/check_ucx_binding.py
New check_ucx_wheel_bundle.py validates repaired wheel contents: ensures private UCX libraries include expected SONAME suffix, verifies bundled UCX modules match the pattern, and confirms NIXL UCX plugin presence. New check_ucx_binding.py validates symbol bindings by running a child dlopen probe with configurable RTLD flags, optional preload and plugin initialization, and parses LD_DEBUG=bindings output to verify expected symbol→target mappings and soname behavior.
Wheel plugin addition with symlink and suffix handling
contrib/wheel_add_ucx_plugins.py
Extends copytree and add_plugins with skip_symlinks and symlink_keep_suffix parameters to control symlinked plugin alias handling. Library name mapping is refactored to extract base names from already-repaired filenames using regex. CLI adds --skip-nixl-plugins, --skip-plugin-symlinks, and --plugin-soname-suffix flags; NIXL plugin addition is conditionally skipped based on --skip-nixl-plugins.

Runtime Plugin Loading with RTLD_DEEPBIND Support

Layer / File(s) Summary
Plugin manager deepbind infrastructure
src/core/plugin_manager.h, src/core/nixl_plugin_manager.cpp
loadPluginFromPath declaration adds optional deepbind boolean parameter. New shouldDeepBindPlugin() helper reads UCX backend configuration (defaulting enabled) to decide deepbind enablement. loadPluginFromPath conditionally includes RTLD_DEEPBIND in dlopen flags (warns if unsupported). Backend plugin loading calls now compute and pass deepbind decisions for both explicit-path and directory-based plugin discovery.
Python extension import with conditional deepbind
examples/device/ep/nixl_ep/__init__.py
Adds environment-controlled deepbind import support via NIXL_UCX_DEEPBIND flag (defaulting enabled). Checks availability of sys.RTLD_DEEPBIND and sys.getdlopenflags, then wraps native module import in a context manager that temporarily adjusts dlopen flags. Defers import and re-export of Buffer and EventOverlap until after native module loads.
UCX plugin soname runtime validation
src/plugins/ucx/ucx_plugin.cpp
Adds runtime validation of bound UCX library via dladdr on ucp_get_version_string. Reads optional NIXL_UCX_EXPECTED_SONAME configuration and validates resolved symbol path; returns nullptr on mismatch. Plugin creation logic is refactored into internal createUcxPlugin() helper used by both static and dynamic plugin initialization entry points.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

external-contribution, size/L

Suggested reviewers

  • dpressle
  • brminich
  • ovidiusm

Poem

🐰 A Docker dance with custom repos so fine,
Binding symbols deep with RTLD divine,
Validation wheels and plugins in place,
Your UCX build now runs at lightning pace! ⚡

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: building UCX with a private SONAME suffix and loading with RTLD_DEEPBIND to prevent symbol collisions.
Description check ✅ Passed The PR description comprehensively follows the template with well-structured What/Why/How sections that explain the feature, its motivation, and implementation details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
src/plugins/ucx/ucx_plugin.cpp

src/plugins/ucx/ucx_plugin.cpp:18:10: fatal error: 'backend/backend_plugin.h' file not found
18 | #include "backend/backend_plugin.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
src/plugins/ucx/ucx_plugin.cpp:65:5-12: ERROR translating statement 'ReturnStmt'
Aborting translation of method 'anonymous_namespace_src_plugins_ucx_ucx_plugin.cpp::validateUcxBinding' in file 'src/plugins/ucx/ucx_plugin.cpp': "Assert_failure src/clang/cAst_utils.ml:249:53"
Uncaught Internal Error: "Assert_failure src/clang/cAst_utils.ml:249:53"
Error backtrace:
Raised at ClangFrontend__CAst_utils.get_decl_from_typ_ptr in file "src/clang/cAst_utils.ml", line 249, characters 53-65
Called from ClangFrontend__CTrans.CTrans_funct.get_destructor_decl_ref in file "src/clang/cTrans.ml", line 658, characters 12-59
Called from ClangFrontend__CTrans.CTrans_funct.destructor_calls.(fun) in file "src/clang/cTrans.ml", line 2048, characters 12-69
Called from Base__List.rev_filter_map.loop in file "src/l

... [truncated 2200 characters] ...

ontend__CTrans.CTrans_funct.instruction_log.(fun) in file "src/clang/cTrans.ml", line 4784, characters 10-1023
Called from ClangFrontend__CTrans.CTrans_funct.instruction in file "src/clang/cTrans.ml" (inlined), line 4765, characters 38-71
Called from ClangFrontend__CTrans.CTrans_funct.exec_with_node_creation in file "src/clang/cTrans.ml" (inlined), line 104, characters 20-38
Called from ClangFrontend__CTrans.CTrans_funct.get_clang_stmt_trans in file "src/clang/cTrans.ml" (inlined), line 5395, characters 4-69
Called from ClangFrontend__CTrans.CTrans_funct.get_custom_stmt_trans in file "src/clang/cTrans.ml", line 5401, characters 8-55
Called from ClangFrontend__CTrans.CTrans_funct.exec_trans_instrs.exec_trans_instrs_rev in file "src/clang/cTrans.ml" (inlined), line 5365, characters 28-54
Cal

src/core/nixl_plugin_manager.cpp

In file included from src/core/nixl_plugin_manager.cpp:18:
src/core/plugin_manager.h:28:10: fatal error: 'backend/backend_plugin.h' file not found
28 | #include "backend/backend_plugin.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
src/core/nixl_plugin_manager.cpp:287:73-306:1: ERROR translating statement 'CompoundStmt'
Aborting translation of method 'nixlPluginManager::discoverPluginsFromList' in file 'src/core/nixl_plugin_manager.cpp': "Assert_failure src/clang/cAst_utils.ml:249:53"
Uncaught Internal Error: "Assert_failure src/clang/cAst_utils.ml:249:53"
Error backtrace:
Raised at ClangFrontend__CAst_utils.get_decl_from_typ_ptr in file "src/clang/cAst_utils.ml", line 249, characters 53-65
Called from ClangFrontend__CTrans.CTrans_funct.get_destructor_decl_ref in file "src/clang/cTrans.ml", line 658, characters 12-59
Called from ClangFrontend__CTrans.CTrans_funct.destructor_calls.(fun) in file "src/clang/cTrans.ml", line 2048, characters 12-69
Called from Base__Li

... [truncated 2200 characters] ...

Called from ClangFrontend__CFrontend_decl.CFrontend_decl_funct.add_method in file "src/clang/cFrontend_decl.ml" (inlined), line 54, characters 4-52
Called from ClangFrontend__CFrontend_decl.CFrontend_decl_funct.process_method_decl.add_method_if_create_procdesc in file "src/clang/cFrontend_decl.ml" (inlined), line 123, characters 16-158
Called from ClangFrontend__CFrontend_decl.CFrontend_decl_funct.process_method_decl in file "src/clang/cFrontend_decl.ml", line 126, characters 17-97
Called from ClangFrontend__CFrontend_decl.CFrontend_decl_funct.process_methods in file "src/clang/cFrontend_decl.ml" (inlined), line 270, characters 8-122
Called from Stdlib__List.iter in file "list.ml" (inlined), line 110, characters 12-15
Called from Stdlib__List.iter in file "list.ml" (inlined), line 108, cha


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants