[https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr<LlmRequest> lifetime fix by chienchunhung · Pull Request #14979 · NVIDIA/TensorRT-LLM

chienchunhung · 2026-06-04T23:16:40Z

Summary

Closes a UAF / Broken-promise gap in the disagg KV cache transfer that survived PR #14768 (the always-on-baseline tier-1 PR). PR #14768 ported the shared_ptr<LlmRequest> lifetime fix at the outer CacheTransceiver layer (interface + mSenderFutures / mRequesterFutures) but not at the inner dataTransceiver layer where the actual std::promise<void> lives. This PR ports just the inner-layer lifetime piece from PR #13713.

What was missed in PR #14768

Layer	What PR #14768 did	Was it enough?
Outer (`cacheTransceiver.{h,cpp}`)	Changed transceiver interface + `mSender`/`mRequesterFutures` storage to `shared_ptr<LlmRequest>`	✅ Keeps the `LlmRequest` object alive
Inner (`dataTransceiver.{h,cpp}`)	Unchanged. `Response::mRequest` and `RequestAndPromise::mRequest` still raw pointers; `sendAsync(LlmRequest&)` / `receiveAsync(LlmRequest&)` still by reference; `std::async(&Impl::requestSync, this, std::ref(llmRequest))` captures by reference	❌ The inner structures own the `std::promise<void>`; if they're destroyed during worker error cleanup or the async task races Python `_terminate_request`, the promise dies unfulfilled → `std::future_error: Broken promise` on the future side

The outer-layer fix prevents the C++ worker from dereferencing a freed LlmRequest, but does nothing to prevent the inner Response/RequestAndPromise structures (which hold the promise) from being destroyed in the worker's error path. Under sustained real disagg load, peer drops and worker-side errors are routine; each destroys an inner structure and fires Broken promise on the corresponding future.

Observed signature (internal incident report)

Recurring ~1.9 h MTTF on a Qwen3-Coder-480B 4P2D disagg shadow running post-PR-#14768 image. One decode worker accumulates an std::future_error: Broken promise storm (concentrated on a single worker; peers idle), trtllm_num_requests_running climbs while peers stay at 0, dynamo canary health check fails, kubelet restarts the worker, serving recovers. Repeat. Across ~19 h: 19 decode-worker restarts on 2 workers, 4 NVCF instance replacements.

[TRT-LLM] [E][batchmgr][RANK 1] Error occurred during generation transfer for request <id>:
  std::future_error: Broken promise

What this PR changes

Pure lifetime fix at the inner layer — mirrors the relevant subset of PR #13713's dataTransceiver changes:

sendAsync(LlmRequest&) → sendAsync(std::shared_ptr<LlmRequest> const&) (both Impl and public CacheSender::sendAsync)
receiveAsync(LlmRequest&) → receiveAsync(std::shared_ptr<LlmRequest> const&) (both Impl and public CacheReceiver::receiveAsync)
- Critical: the inner std::async now captures the shared_ptr by value in a lambda rather than std::ref(llmRequest) — closes the UAF where Python _terminate_request beats the async task
requestAndReceiveAsyncMultiThreads(LlmRequest&) → shared_ptr analog
struct Response: LlmRequest* → std::shared_ptr<LlmRequest>
struct RequestAndPromise: LlmRequest* → std::shared_ptr<LlmRequest> (move-semantics tightened — no more null-out-raw-pointer dance)
4 callers in cacheTransceiver.cpp pass the already-shared_ptr llmRequest directly instead of dereferencing
cacheTransceiverTest.cpp WrappedLlmRequest switched from unique_ptr to shared_ptr so the test infra compiles against the new signatures

Out of scope (intentionally — preserving PR #14768's "always-on baseline" scope)

In-flight cancel flag registry (mInFlightCancelFlags map, getOrCreateInFlightCancelFlag, per-request flag wiring) — that's the G* cancel surface, env-gated by TRTLLM_DISAGG_ENABLE_INFLIGHT_CANCEL
Defensive non-std::exception catch on response() worker — orthogonal safety improvement, separable
set_exception on cancelled-response promise destruction — cancel-path semantic improvement, lives with the cancel surface

These remain part of the broader cancellation follow-up tracked under TRTLLM-12721.

Summary by CodeRabbit

Refactor
- Improved internal memory safety for asynchronous cache transfer operations by implementing stronger request lifetime guarantees. Background cache transfer tasks now maintain proper ownership of requests throughout the entire async operation cycle, reducing the risk of resource cleanup issues in multi-threaded scenarios.

chienchunhung · 2026-06-04T23:55:27Z

/bot run --disable-fail-fast

coderabbitai · 2026-06-05T00:11:35Z

📝 Walkthrough

Walkthrough

Cache transceiver async APIs transition from raw-pointer/reference lifetimes to std::shared_ptr<LlmRequest> ownership. Public signatures updated; implementation stores shared pointers in internal async state (Response, RequestAndPromise) and worker tasks capture strong references. Call sites and tests updated to match.

Changes

Cache Transceiver Lifetime Safety

Layer / File(s)	Summary
Public API signature updates `cpp/tensorrt_llm/batch_manager/dataTransceiver.h`	`CacheSender::sendAsync()` and `CacheReceiver::receiveAsync()` signatures updated to accept `std::shared_ptr<LlmRequest> const&` instead of `LlmRequest&`. Parameter documentation clarifies shared ownership/lifetime extension for async workers.
CacheSender ownership implementation `cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`	`CacheSender::Impl::sendAsync()` accepts shared pointer, records transfer start time, and enqueues a `Response` struct storing the shared pointer to extend request lifetime. Public wrapper forwards the shared pointer to implementation.
CacheReceiver ownership implementation `cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`	`CacheReceiver::Impl::receiveAsync()` and `requestAndReceiveAsyncMultiThreads()` accept shared pointer; async tasks capture strong references. `RequestAndPromise` struct replaces raw `LlmRequest*` with `std::shared_ptr<LlmRequest>` for safe cross-thread ownership. Constructor/move logic updated accordingly.
CacheTransceiver call site updates `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Four methods (`respondAndSendAsync`, `respondAndSendLayerWise`, `requestAndReceiveSync`, `requestAndReceiveAsync`) update calls to `sendAsync()`/`receiveAsync()` to pass shared pointers directly instead of dereferencing.
Test infrastructure and usage updates `cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp`	`WrappedLlmRequest` refactored to store `shared_ptr` instead of `unique_ptr`. Factories `makeLlmRequest()` and `makeLlmRequestWithDP()` construct shared pointers. Transport helpers call `sendAsync()`/`receiveAsync()` using shared pointers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14768: Both PRs update CacheTransceiver async send/receive APIs and related call sites to use std::shared_ptr<LlmRequest> instead of raw-pointer/reference lifetimes for ownership safety.

Suggested reviewers

reasonsolo
bo-nv
dongxuy04
pcastonguay
chuangz0

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 4.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change—porting a shared_ptr lifetime fix to the inner dataTransceiver layer to fix use-after-free and broken promise issues.
Description check	✅ Passed	The PR description is comprehensive, explaining the issue, what was missed in PR `#14768`, what changed, and what is intentionally out of scope. Test coverage is not explicitly detailed, but the description is largely complete against the template structure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

774-798: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard async-resource initialization with a mutex.

requestAndReceiveAsyncMultiThreads() mutates mInstanceToAsyncResource and mRequestFutures before it takes asyncResource->mMtxForQueue. Two concurrent receiveAsync() calls for the same processInfo can race on that unordered_map/vector, which is undefined behavior and can start duplicate request workers.

Possible fix

+    std::mutex mAsyncResourceMutex;
+
     [[nodiscard]] std::future<void> requestAndReceiveAsyncMultiThreads(std::shared_ptr<LlmRequest> const& llmRequest)
     {
         try
         {
             auto promise = std::make_unique<std::promise<void>>();
             auto future = promise->get_future();
             TLLM_CHECK(llmRequest->getDataTransceiverState().getCommState().has_value());
             std::string processInfo = kDefaultProcessInfo;
             if (common::getEnvRequestKVCacheConcurrent())
             {
                 processInfo = llmRequest->getDataTransceiverState().getCommState()->toString();
             }
-            if (mInstanceToAsyncResource.find(processInfo) == mInstanceToAsyncResource.end())
             {
-
-                mInstanceToAsyncResource.emplace(processInfo, std::make_unique<AsyncResource>());
-                auto requestFuture = std::async(std::launch::async, &CacheReceiver::Impl::request, this,
-                    std::ref(*mInstanceToAsyncResource.at(processInfo)));
-                mRequestFutures.emplace_back(std::move(requestFuture));
+                std::scoped_lock lk(mAsyncResourceMutex);
+                if (mInstanceToAsyncResource.find(processInfo) == mInstanceToAsyncResource.end())
+                {
+                    mInstanceToAsyncResource.emplace(processInfo, std::make_unique<AsyncResource>());
+                    auto requestFuture = std::async(std::launch::async, &CacheReceiver::Impl::request, this,
+                        std::ref(*mInstanceToAsyncResource.at(processInfo)));
+                    mRequestFutures.emplace_back(std::move(requestFuture));
+                }
             }
             auto& asyncResource = mInstanceToAsyncResource.at(processInfo);
             {
                 std::unique_lock<std::mutex> lck(asyncResource->mMtxForQueue);
                 asyncResource->mRequestsQueue.emplace_back(llmRequest, std::move(promise));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp` around lines 774 - 798,
requestAndReceiveAsyncMultiThreads mutates mInstanceToAsyncResource and
mRequestFutures without synchronization and can race between concurrent calls;
wrap the map/vector lookup and potential insertion (the block that checks
mInstanceToAsyncResource.find(processInfo), emplaces a new AsyncResource, starts
the async worker via std::async, and pushes into mRequestFutures) with a
dedicated mutex (e.g., add or reuse a member like mInstanceMutex) so the
creation of the AsyncResource and the requestFuture is atomic, then after
releasing that mutex you can lock asyncResource->mMtxForQueue to push the
(llmRequest, promise) into asyncResource->mRequestsQueue; ensure all accesses to
mInstanceToAsyncResource and mRequestFutures use the same mutex to avoid UB and
duplicate workers.

🧹 Nitpick comments (1)

cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
351-355: ⚡ Quick win

Add one regression that drops caller ownership before the future resolves.

These call sites now use the shared_ptr API, but the surrounding tests still keep the request alive until after future.get(). That means a regression back to raw/reference capture inside dataTransceiver would likely still pass. Please add at least one case that calls sendAsync() / receiveAsync(), immediately releases the caller-held shared_ptr, and then asserts the future still completes correctly.

Also applies to: 977-988
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp` around lines 351 -
355, Add a regression test that verifies caller ownership of the llmRequest is
not required for the future to complete: after calling
mSender->sendAsync(llmRequest) or mRequester->receiveAsync(llmRequest) store the
returned future in mFutures (or a local future), immediately reset the caller's
shared_ptr (e.g. llmRequest.reset()), then call future.get()/wait() and assert
completion/success. Update the test block around
mFutures.emplace_back(mSender->sendAsync(llmRequest)) / auto future =
mRequester->receiveAsync(llmRequest) to include one scenario that drops the
shared_ptr before resolving the future; apply the same pattern for the other
occurrence mentioned (lines ~977-988) so the test will fail if dataTransceiver
reintroduces raw/reference captures.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`:
- Around line 477-482: The broken-promise occurs because promises are destroyed
when entries are erased; in CacheSender::Impl::sendResponse (and the other spots
around lines noted) ensure you resolve Response::mPromise before erasing from
mReadyResponses by calling mPromise.set_value() (or set_exception with an
appropriate exception on cancellation), and similarly in
CacheReceiver::Impl::cancelRequest resolve RequestAndPromise::mPromise before
removing from asyncResource->mRequestsQueue so the futures returned by
sendAsync() and requestAndReceiveAsyncMultiThreads() are not left broken; update
all indicated sites (including the other occurrences around 797-798 and
1089-1125) to fulfill the promise with success or a cancellation exception prior
to destroying the promise-containing object.

In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.h`:
- Around line 259-262: The new shared_ptr-based APIs (e.g.,
sendAsync(std::shared_ptr<LlmRequest> const& llmRequest)) accept nullptr but
implementations immediately dereference llmRequest; reject null at entry or
enforce non-null ownership to preserve the old non-null contract. Add a
precondition check in sendAsync (and the other changed APIs at the same area)
that throws or returns a failed future when llmRequest == nullptr (or
alternatively change the API to take std::shared_ptr<LlmRequest> by value and
assert/throw on null), and document the behaviour so callers cannot pass
nullptr; reference sendAsync and the other modified functions that previously
took LlmRequest& to locate and update their entry-point null handling.

---

Outside diff comments:
In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`:
- Around line 774-798: requestAndReceiveAsyncMultiThreads mutates
mInstanceToAsyncResource and mRequestFutures without synchronization and can
race between concurrent calls; wrap the map/vector lookup and potential
insertion (the block that checks mInstanceToAsyncResource.find(processInfo),
emplaces a new AsyncResource, starts the async worker via std::async, and pushes
into mRequestFutures) with a dedicated mutex (e.g., add or reuse a member like
mInstanceMutex) so the creation of the AsyncResource and the requestFuture is
atomic, then after releasing that mutex you can lock asyncResource->mMtxForQueue
to push the (llmRequest, promise) into asyncResource->mRequestsQueue; ensure all
accesses to mInstanceToAsyncResource and mRequestFutures use the same mutex to
avoid UB and duplicate workers.

---

Nitpick comments:
In `@cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp`:
- Around line 351-355: Add a regression test that verifies caller ownership of
the llmRequest is not required for the future to complete: after calling
mSender->sendAsync(llmRequest) or mRequester->receiveAsync(llmRequest) store the
returned future in mFutures (or a local future), immediately reset the caller's
shared_ptr (e.g. llmRequest.reset()), then call future.get()/wait() and assert
completion/success. Update the test block around
mFutures.emplace_back(mSender->sendAsync(llmRequest)) / auto future =
mRequester->receiveAsync(llmRequest) to include one scenario that drops the
shared_ptr before resolving the future; apply the same pattern for the other
occurrence mentioned (lines ~977-988) so the test will fail if dataTransceiver
reintroduces raw/reference captures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47b75d87-4b8a-4b26-ac1e-fe04452ef7ab

📥 Commits

Reviewing files that changed from the base of the PR and between a50b5e2 and fee06b4.

📒 Files selected for processing (4)

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp

chienchunhung · 2026-06-05T02:29:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-05T02:35:54Z

PR_Github #52210 [ run ] triggered by Bot. Commit: 3c445b2 Link to invocation

tensorrt-cicd · 2026-06-05T09:42:46Z

PR_Github #52210 [ run ] completed with state SUCCESS. Commit: 3c445b2
/LLM/main/L0_MergeRequest_PR pipeline #41529 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…est> lifetime fix Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-06-05T18:23:17Z

/bot run --disable-fail-fast --stage-list "A10-PackageSanityCheck-PY310-UB2204"

tensorrt-cicd · 2026-06-05T18:29:33Z

PR_Github #52407 [ run ] triggered by Bot. Commit: f5ae093 Link to invocation

tensorrt-cicd · 2026-06-05T20:02:22Z

PR_Github #52407 [ run ] completed with state SUCCESS. Commit: f5ae093
/LLM/main/L0_MergeRequest_PR pipeline #41700 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

chienchunhung · 2026-06-05T20:27:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-05T20:33:12Z

PR_Github #52426 [ run ] triggered by Bot. Commit: f5ae093 Link to invocation

tensorrt-cicd · 2026-06-06T07:13:50Z

PR_Github #52426 [ run ] completed with state FAILURE. Commit: f5ae093
/LLM/main/L0_MergeRequest_PR pipeline #41718 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned chienchunhung Jun 4, 2026

chienchunhung force-pushed the nvbug6104831-dataTransceiver-lifetime branch 3 times, most recently from 8da74f0 to fee06b4 Compare June 4, 2026 23:53

chienchunhung requested review from Shixiaowei02 and pcastonguay June 5, 2026 00:01

chienchunhung marked this pull request as ready for review June 5, 2026 00:01

chienchunhung requested a review from a team as a code owner June 5, 2026 00:01

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

Comment thread cpp/tensorrt_llm/batch_manager/dataTransceiver.h

chienchunhung force-pushed the nvbug6104831-dataTransceiver-lifetime branch from fee06b4 to 3c445b2 Compare June 5, 2026 02:19

Shixiaowei02 approved these changes Jun 5, 2026

View reviewed changes

chienchunhung enabled auto-merge (squash) June 5, 2026 06:15

chienchunhung force-pushed the nvbug6104831-dataTransceiver-lifetime branch from 3c445b2 to 42aa1ce Compare June 5, 2026 16:38

[https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr<LlmRequ…

f5ae093

…est> lifetime fix Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung force-pushed the nvbug6104831-dataTransceiver-lifetime branch from 42aa1ce to f5ae093 Compare June 5, 2026 16:40

Conversation

chienchunhung commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was missed in PR #14768

Observed signature (internal incident report)

What this PR changes

Out of scope (intentionally — preserving PR #14768's "always-on baseline" scope)

Summary by CodeRabbit

Uh oh!

chienchunhung commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chienchunhung commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

chienchunhung commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

chienchunhung commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chienchunhung commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading