fix(rtpllm): adapt to RTP-LLM PyAttentionInputs host/device field rename#1412
Open
Jonathan-hwx wants to merge 2 commits into
Open
fix(rtpllm): adapt to RTP-LLM PyAttentionInputs host/device field rename#1412Jonathan-hwx wants to merge 2 commits into
Jonathan-hwx wants to merge 2 commits into
Conversation
RTP-LLM commit b1f8d50 ("refactor(attn): unify PyAttentionInputs
host/device tensors with _host/_device suffixes") renamed the canonical
device-side attention fields:
sequence_lengths_plus_1_d -> sequence_lengths_plus_1_device
prefix_lengths_d -> prefix_lengths_device
cu_seqlens (device) -> cu_seqlens_device
forward_context.py still read the old names via getattr(..., None), so on
post-rename RTP-LLM these silently returned None. The decode path then fell
back to recomputing (sequence_lengths + input_lengths) into a transient
device tensor. Under CUDA/HIP graph capture that transient is baked into the
graph, while RTP-LLM updates sequence_lengths_plus_1_device in place on
replay -> captured context_lens freeze at capture-time values -> decode
attention reads KV with stale sequence lengths -> accuracy regression
(only in CUDA-graph mode; eager recomputes each step so it was unaffected).
Point all getattr keys at the new _device names. _non_empty_int32 is a no-op
on an already-on-device contiguous int32 tensor, so the canonical buffer
identity is preserved and graph replay binds to RTP-LLM's in-place buffer.
Manifests as the GLM-5 ROCm CUDA-graph accuracy issue.
_resolve_plugin_block_table (both the base and MLA contexts) returned RTP-LLM's
kv_cache_block_id_device whenever it was non-empty -- including under CUDA/HIP
graph capture. That table is RTP-LLM's *cache-store* physical block table, and
by explicit design RTP-LLM does NOT refresh it inside the graph on replay:
// kv_cache_block_id_{host,device} are physical block IDs dedicated for cache
// store ... NOT consumed by any GPU attention kernel during CUDA graph replay;
// attention kernels only use kv_cache_kernel_block_id_{host,device}. Cache
// store operations run outside the CUDA graph and read the original inputs.
-- RTP-LLM cuda_graph_runner.cc
Only kv_cache_kernel_block_id_device is D2D-refreshed on replay. So capturing a
graph off kv_cache_block_id_device bakes a stale block_table / slot_mapping into
the graph: on every decode replay step KV is read/written at the frozen
capture-time physical blocks -> KV cache corruption -> garbled output.
Gate the fast path on `not in_capture`. Under capture we now fall through to the
existing capture-safe path that rebuilds the physical table from the (refreshed)
kernel block table via _recover_physical_block_table_from_kernel + cg_bufs. This
matches the codebase's existing `not in_capture` idiom for capture-unsafe ops.
Together with the host/device field rename fix, this resolves the GLM-5 ROCm
CUDA-graph garbled-output (accuracy) issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RTP-LLM commit b1f8d50 ("refactor(attn): unify PyAttentionInputs host/device tensors with _host/_device suffixes") renamed the canonical device-side attention fields:
sequence_lengths_plus_1_d -> sequence_lengths_plus_1_device
prefix_lengths_d -> prefix_lengths_device
cu_seqlens (device) -> cu_seqlens_device
forward_context.py still read the old names via getattr(..., None), so on post-rename RTP-LLM these silently returned None. The decode path then fell back to recomputing (sequence_lengths + input_lengths) into a transient device tensor. Under CUDA/HIP graph capture that transient is baked into the graph, while RTP-LLM updates sequence_lengths_plus_1_device in place on replay -> captured context_lens freeze at capture-time values -> decode attention reads KV with stale sequence lengths -> accuracy regression (only in CUDA-graph mode; eager recomputes each step so it was unaffected).
Point all getattr keys at the new _device names. _non_empty_int32 is a no-op on an already-on-device contiguous int32 tensor, so the canonical buffer identity is preserved and graph replay binds to RTP-LLM's in-place buffer.
Manifests as the GLM-5 ROCm CUDA-graph accuracy issue.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist