Skip to content

[https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr<LlmRequest> lifetime fix#14979

Open
chienchunhung wants to merge 1 commit into
NVIDIA:mainfrom
chienchunhung:nvbug6104831-dataTransceiver-lifetime
Open

[https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr<LlmRequest> lifetime fix#14979
chienchunhung wants to merge 1 commit into
NVIDIA:mainfrom
chienchunhung:nvbug6104831-dataTransceiver-lifetime

Conversation

@chienchunhung
Copy link
Copy Markdown
Collaborator

@chienchunhung chienchunhung commented Jun 4, 2026

Summary

Closes a UAF / Broken-promise gap in the disagg KV cache transfer that survived PR #14768 (the always-on-baseline tier-1 PR). PR #14768 ported the shared_ptr<LlmRequest> lifetime fix at the outer CacheTransceiver layer (interface + mSenderFutures / mRequesterFutures) but not at the inner dataTransceiver layer where the actual std::promise<void> lives. This PR ports just the inner-layer lifetime piece from PR #13713.

What was missed in PR #14768

Layer What PR #14768 did Was it enough?
Outer (cacheTransceiver.{h,cpp}) Changed transceiver interface + mSender/mRequesterFutures storage to shared_ptr<LlmRequest> ✅ Keeps the LlmRequest object alive
Inner (dataTransceiver.{h,cpp}) Unchanged. Response::mRequest and RequestAndPromise::mRequest still raw pointers; sendAsync(LlmRequest&) / receiveAsync(LlmRequest&) still by reference; std::async(&Impl::requestSync, this, std::ref(llmRequest)) captures by reference ❌ The inner structures own the std::promise<void>; if they're destroyed during worker error cleanup or the async task races Python _terminate_request, the promise dies unfulfilled → std::future_error: Broken promise on the future side

The outer-layer fix prevents the C++ worker from dereferencing a freed LlmRequest, but does nothing to prevent the inner Response/RequestAndPromise structures (which hold the promise) from being destroyed in the worker's error path. Under sustained real disagg load, peer drops and worker-side errors are routine; each destroys an inner structure and fires Broken promise on the corresponding future.

Observed signature (internal incident report)

Recurring ~1.9 h MTTF on a Qwen3-Coder-480B 4P2D disagg shadow running post-PR-#14768 image. One decode worker accumulates an std::future_error: Broken promise storm (concentrated on a single worker; peers idle), trtllm_num_requests_running climbs while peers stay at 0, dynamo canary health check fails, kubelet restarts the worker, serving recovers. Repeat. Across ~19 h: 19 decode-worker restarts on 2 workers, 4 NVCF instance replacements.

[TRT-LLM] [E][batchmgr][RANK 1] Error occurred during generation transfer for request <id>:
  std::future_error: Broken promise

What this PR changes

Pure lifetime fix at the inner layer — mirrors the relevant subset of PR #13713's dataTransceiver changes:

  • sendAsync(LlmRequest&)sendAsync(std::shared_ptr<LlmRequest> const&) (both Impl and public CacheSender::sendAsync)
  • receiveAsync(LlmRequest&)receiveAsync(std::shared_ptr<LlmRequest> const&) (both Impl and public CacheReceiver::receiveAsync)
    • Critical: the inner std::async now captures the shared_ptr by value in a lambda rather than std::ref(llmRequest) — closes the UAF where Python _terminate_request beats the async task
  • requestAndReceiveAsyncMultiThreads(LlmRequest&)shared_ptr analog
  • struct Response: LlmRequest*std::shared_ptr<LlmRequest>
  • struct RequestAndPromise: LlmRequest*std::shared_ptr<LlmRequest> (move-semantics tightened — no more null-out-raw-pointer dance)
  • 4 callers in cacheTransceiver.cpp pass the already-shared_ptr llmRequest directly instead of dereferencing
  • cacheTransceiverTest.cpp WrappedLlmRequest switched from unique_ptr to shared_ptr so the test infra compiles against the new signatures

Out of scope (intentionally — preserving PR #14768's "always-on baseline" scope)

  • In-flight cancel flag registry (mInFlightCancelFlags map, getOrCreateInFlightCancelFlag, per-request flag wiring) — that's the G* cancel surface, env-gated by TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCEL
  • Defensive non-std::exception catch on response() worker — orthogonal safety improvement, separable
  • set_exception on cancelled-response promise destruction — cancel-path semantic improvement, lives with the cancel surface

These remain part of the broader cancellation follow-up tracked under TRTLLM-12721.

Summary by CodeRabbit

  • Refactor
    • Improved internal memory safety for asynchronous cache transfer operations by implementing stronger request lifetime guarantees. Background cache transfer tasks now maintain proper ownership of requests throughout the entire async operation cycle, reducing the risk of resource cleanup issues in multi-threaded scenarios.

@chienchunhung chienchunhung force-pushed the nvbug6104831-dataTransceiver-lifetime branch 3 times, most recently from 8da74f0 to fee06b4 Compare June 4, 2026 23:53
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@chienchunhung chienchunhung marked this pull request as ready for review June 5, 2026 00:01
@chienchunhung chienchunhung requested a review from a team as a code owner June 5, 2026 00:01
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Cache transceiver async APIs transition from raw-pointer/reference lifetimes to std::shared_ptr<LlmRequest> ownership. Public signatures updated; implementation stores shared pointers in internal async state (Response, RequestAndPromise) and worker tasks capture strong references. Call sites and tests updated to match.

Changes

Cache Transceiver Lifetime Safety

Layer / File(s) Summary
Public API signature updates
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
CacheSender::sendAsync() and CacheReceiver::receiveAsync() signatures updated to accept std::shared_ptr<LlmRequest> const& instead of LlmRequest&. Parameter documentation clarifies shared ownership/lifetime extension for async workers.
CacheSender ownership implementation
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
CacheSender::Impl::sendAsync() accepts shared pointer, records transfer start time, and enqueues a Response struct storing the shared pointer to extend request lifetime. Public wrapper forwards the shared pointer to implementation.
CacheReceiver ownership implementation
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
CacheReceiver::Impl::receiveAsync() and requestAndReceiveAsyncMultiThreads() accept shared pointer; async tasks capture strong references. RequestAndPromise struct replaces raw LlmRequest* with std::shared_ptr<LlmRequest> for safe cross-thread ownership. Constructor/move logic updated accordingly.
CacheTransceiver call site updates
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
Four methods (respondAndSendAsync, respondAndSendLayerWise, requestAndReceiveSync, requestAndReceiveAsync) update calls to sendAsync()/receiveAsync() to pass shared pointers directly instead of dereferencing.
Test infrastructure and usage updates
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
WrappedLlmRequest refactored to store shared_ptr instead of unique_ptr. Factories makeLlmRequest() and makeLlmRequestWithDP() construct shared pointers. Transport helpers call sendAsync()/receiveAsync() using shared pointers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14768: Both PRs update CacheTransceiver async send/receive APIs and related call sites to use std::shared_ptr<LlmRequest> instead of raw-pointer/reference lifetimes for ownership safety.

Suggested reviewers

  • reasonsolo
  • bo-nv
  • dongxuy04
  • pcastonguay
  • chuangz0
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change—porting a shared_ptr lifetime fix to the inner dataTransceiver layer to fix use-after-free and broken promise issues.
Description check ✅ Passed The PR description is comprehensive, explaining the issue, what was missed in PR #14768, what changed, and what is intentionally out of scope. Test coverage is not explicitly detailed, but the description is largely complete against the template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

774-798: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard async-resource initialization with a mutex.

requestAndReceiveAsyncMultiThreads() mutates mInstanceToAsyncResource and mRequestFutures before it takes asyncResource->mMtxForQueue. Two concurrent receiveAsync() calls for the same processInfo can race on that unordered_map/vector, which is undefined behavior and can start duplicate request workers.

Possible fix
+    std::mutex mAsyncResourceMutex;
+
     [[nodiscard]] std::future<void> requestAndReceiveAsyncMultiThreads(std::shared_ptr<LlmRequest> const& llmRequest)
     {
         try
         {
             auto promise = std::make_unique<std::promise<void>>();
             auto future = promise->get_future();
             TLLM_CHECK(llmRequest->getDataTransceiverState().getCommState().has_value());
             std::string processInfo = kDefaultProcessInfo;
             if (common::getEnvRequestKVCacheConcurrent())
             {
                 processInfo = llmRequest->getDataTransceiverState().getCommState()->toString();
             }
-            if (mInstanceToAsyncResource.find(processInfo) == mInstanceToAsyncResource.end())
             {
-
-                mInstanceToAsyncResource.emplace(processInfo, std::make_unique<AsyncResource>());
-                auto requestFuture = std::async(std::launch::async, &CacheReceiver::Impl::request, this,
-                    std::ref(*mInstanceToAsyncResource.at(processInfo)));
-                mRequestFutures.emplace_back(std::move(requestFuture));
+                std::scoped_lock lk(mAsyncResourceMutex);
+                if (mInstanceToAsyncResource.find(processInfo) == mInstanceToAsyncResource.end())
+                {
+                    mInstanceToAsyncResource.emplace(processInfo, std::make_unique<AsyncResource>());
+                    auto requestFuture = std::async(std::launch::async, &CacheReceiver::Impl::request, this,
+                        std::ref(*mInstanceToAsyncResource.at(processInfo)));
+                    mRequestFutures.emplace_back(std::move(requestFuture));
+                }
             }
             auto& asyncResource = mInstanceToAsyncResource.at(processInfo);
             {
                 std::unique_lock<std::mutex> lck(asyncResource->mMtxForQueue);
                 asyncResource->mRequestsQueue.emplace_back(llmRequest, std::move(promise));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp` around lines 774 - 798,
requestAndReceiveAsyncMultiThreads mutates mInstanceToAsyncResource and
mRequestFutures without synchronization and can race between concurrent calls;
wrap the map/vector lookup and potential insertion (the block that checks
mInstanceToAsyncResource.find(processInfo), emplaces a new AsyncResource, starts
the async worker via std::async, and pushes into mRequestFutures) with a
dedicated mutex (e.g., add or reuse a member like mInstanceMutex) so the
creation of the AsyncResource and the requestFuture is atomic, then after
releasing that mutex you can lock asyncResource->mMtxForQueue to push the
(llmRequest, promise) into asyncResource->mRequestsQueue; ensure all accesses to
mInstanceToAsyncResource and mRequestFutures use the same mutex to avoid UB and
duplicate workers.
🧹 Nitpick comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)

351-355: ⚡ Quick win

Add one regression that drops caller ownership before the future resolves.

These call sites now use the shared_ptr API, but the surrounding tests still keep the request alive until after future.get(). That means a regression back to raw/reference capture inside dataTransceiver would likely still pass. Please add at least one case that calls sendAsync() / receiveAsync(), immediately releases the caller-held shared_ptr, and then asserts the future still completes correctly.

Also applies to: 977-988

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp` around lines 351 -
355, Add a regression test that verifies caller ownership of the llmRequest is
not required for the future to complete: after calling
mSender->sendAsync(llmRequest) or mRequester->receiveAsync(llmRequest) store the
returned future in mFutures (or a local future), immediately reset the caller's
shared_ptr (e.g. llmRequest.reset()), then call future.get()/wait() and assert
completion/success. Update the test block around
mFutures.emplace_back(mSender->sendAsync(llmRequest)) / auto future =
mRequester->receiveAsync(llmRequest) to include one scenario that drops the
shared_ptr before resolving the future; apply the same pattern for the other
occurrence mentioned (lines ~977-988) so the test will fail if dataTransceiver
reintroduces raw/reference captures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`:
- Around line 477-482: The broken-promise occurs because promises are destroyed
when entries are erased; in CacheSender::Impl::sendResponse (and the other spots
around lines noted) ensure you resolve Response::mPromise before erasing from
mReadyResponses by calling mPromise.set_value() (or set_exception with an
appropriate exception on cancellation), and similarly in
CacheReceiver::Impl::cancelRequest resolve RequestAndPromise::mPromise before
removing from asyncResource->mRequestsQueue so the futures returned by
sendAsync() and requestAndReceiveAsyncMultiThreads() are not left broken; update
all indicated sites (including the other occurrences around 797-798 and
1089-1125) to fulfill the promise with success or a cancellation exception prior
to destroying the promise-containing object.

In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.h`:
- Around line 259-262: The new shared_ptr-based APIs (e.g.,
sendAsync(std::shared_ptr<LlmRequest> const& llmRequest)) accept nullptr but
implementations immediately dereference llmRequest; reject null at entry or
enforce non-null ownership to preserve the old non-null contract. Add a
precondition check in sendAsync (and the other changed APIs at the same area)
that throws or returns a failed future when llmRequest == nullptr (or
alternatively change the API to take std::shared_ptr<LlmRequest> by value and
assert/throw on null), and document the behaviour so callers cannot pass
nullptr; reference sendAsync and the other modified functions that previously
took LlmRequest& to locate and update their entry-point null handling.

---

Outside diff comments:
In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`:
- Around line 774-798: requestAndReceiveAsyncMultiThreads mutates
mInstanceToAsyncResource and mRequestFutures without synchronization and can
race between concurrent calls; wrap the map/vector lookup and potential
insertion (the block that checks mInstanceToAsyncResource.find(processInfo),
emplaces a new AsyncResource, starts the async worker via std::async, and pushes
into mRequestFutures) with a dedicated mutex (e.g., add or reuse a member like
mInstanceMutex) so the creation of the AsyncResource and the requestFuture is
atomic, then after releasing that mutex you can lock asyncResource->mMtxForQueue
to push the (llmRequest, promise) into asyncResource->mRequestsQueue; ensure all
accesses to mInstanceToAsyncResource and mRequestFutures use the same mutex to
avoid UB and duplicate workers.

---

Nitpick comments:
In `@cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp`:
- Around line 351-355: Add a regression test that verifies caller ownership of
the llmRequest is not required for the future to complete: after calling
mSender->sendAsync(llmRequest) or mRequester->receiveAsync(llmRequest) store the
returned future in mFutures (or a local future), immediately reset the caller's
shared_ptr (e.g. llmRequest.reset()), then call future.get()/wait() and assert
completion/success. Update the test block around
mFutures.emplace_back(mSender->sendAsync(llmRequest)) / auto future =
mRequester->receiveAsync(llmRequest) to include one scenario that drops the
shared_ptr before resolving the future; apply the same pattern for the other
occurrence mentioned (lines ~977-988) so the test will fail if dataTransceiver
reintroduces raw/reference captures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47b75d87-4b8a-4b26-ac1e-fe04452ef7ab

📥 Commits

Reviewing files that changed from the base of the PR and between a50b5e2 and fee06b4.

📒 Files selected for processing (4)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp

Comment thread cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
Comment thread cpp/tensorrt_llm/batch_manager/dataTransceiver.h
@chienchunhung chienchunhung force-pushed the nvbug6104831-dataTransceiver-lifetime branch from fee06b4 to 3c445b2 Compare June 5, 2026 02:19
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52210 [ run ] triggered by Bot. Commit: 3c445b2 Link to invocation

@chienchunhung chienchunhung enabled auto-merge (squash) June 5, 2026 06:15
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52210 [ run ] completed with state SUCCESS. Commit: 3c445b2
/LLM/main/L0_MergeRequest_PR pipeline #41529 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung chienchunhung force-pushed the nvbug6104831-dataTransceiver-lifetime branch from 3c445b2 to 42aa1ce Compare June 5, 2026 16:38
…est> lifetime fix

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the nvbug6104831-dataTransceiver-lifetime branch from 42aa1ce to f5ae093 Compare June 5, 2026 16:40
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "A10-PackageSanityCheck-PY310-UB2204"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52407 [ run ] triggered by Bot. Commit: f5ae093 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52407 [ run ] completed with state SUCCESS. Commit: f5ae093
/LLM/main/L0_MergeRequest_PR pipeline #41700 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52426 [ run ] triggered by Bot. Commit: f5ae093 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52426 [ run ] completed with state FAILURE. Commit: f5ae093
/LLM/main/L0_MergeRequest_PR pipeline #41718 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants