fix(rtpllm): adapt to RTP-LLM PyAttentionInputs host/device field rename by Jonathan-hwx · Pull Request #1412 · ROCm/ATOM

Jonathan-hwx · 2026-06-30T07:54:37Z

RTP-LLM commit b1f8d50 ("refactor(attn): unify PyAttentionInputs host/device tensors with _host/_device suffixes") renamed the canonical device-side attention fields:

sequence_lengths_plus_1_d -> sequence_lengths_plus_1_device
prefix_lengths_d -> prefix_lengths_device
cu_seqlens (device) -> cu_seqlens_device

forward_context.py still read the old names via getattr(..., None), so on post-rename RTP-LLM these silently returned None. The decode path then fell back to recomputing (sequence_lengths + input_lengths) into a transient device tensor. Under CUDA/HIP graph capture that transient is baked into the graph, while RTP-LLM updates sequence_lengths_plus_1_device in place on replay -> captured context_lens freeze at capture-time values -> decode attention reads KV with stale sequence lengths -> accuracy regression (only in CUDA-graph mode; eager recomputes each step so it was unaffected).

Point all getattr keys at the new _device names. _non_empty_int32 is a no-op on an already-on-device contiguous int32 tensor, so the canonical buffer identity is preserved and graph replay binds to RTP-LLM's in-place buffer.

Manifests as the GLM-5 ROCm CUDA-graph accuracy issue.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

RTP-LLM commit b1f8d50 ("refactor(attn): unify PyAttentionInputs host/device tensors with _host/_device suffixes") renamed the canonical device-side attention fields: sequence_lengths_plus_1_d -> sequence_lengths_plus_1_device prefix_lengths_d -> prefix_lengths_device cu_seqlens (device) -> cu_seqlens_device forward_context.py still read the old names via getattr(..., None), so on post-rename RTP-LLM these silently returned None. The decode path then fell back to recomputing (sequence_lengths + input_lengths) into a transient device tensor. Under CUDA/HIP graph capture that transient is baked into the graph, while RTP-LLM updates sequence_lengths_plus_1_device in place on replay -> captured context_lens freeze at capture-time values -> decode attention reads KV with stale sequence lengths -> accuracy regression (only in CUDA-graph mode; eager recomputes each step so it was unaffected). Point all getattr keys at the new _device names. _non_empty_int32 is a no-op on an already-on-device contiguous int32 tensor, so the canonical buffer identity is preserved and graph replay binds to RTP-LLM's in-place buffer. Manifests as the GLM-5 ROCm CUDA-graph accuracy issue.

_resolve_plugin_block_table (both the base and MLA contexts) returned RTP-LLM's kv_cache_block_id_device whenever it was non-empty -- including under CUDA/HIP graph capture. That table is RTP-LLM's *cache-store* physical block table, and by explicit design RTP-LLM does NOT refresh it inside the graph on replay: // kv_cache_block_id_{host,device} are physical block IDs dedicated for cache // store ... NOT consumed by any GPU attention kernel during CUDA graph replay; // attention kernels only use kv_cache_kernel_block_id_{host,device}. Cache // store operations run outside the CUDA graph and read the original inputs. -- RTP-LLM cuda_graph_runner.cc Only kv_cache_kernel_block_id_device is D2D-refreshed on replay. So capturing a graph off kv_cache_block_id_device bakes a stale block_table / slot_mapping into the graph: on every decode replay step KV is read/written at the frozen capture-time physical blocks -> KV cache corruption -> garbled output. Gate the fast path on `not in_capture`. Under capture we now fall through to the existing capture-safe path that rebuilds the physical table from the (refreshed) kernel block table via _recover_physical_block_table_from_kernel + cg_bufs. This matches the codebase's existing `not in_capture` idiom for capture-unsafe ops. Together with the host/device field rename fix, this resolves the GLM-5 ROCm CUDA-graph garbled-output (accuracy) issue.

Jonathan-hwx added 2 commits June 30, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rtpllm): adapt to RTP-LLM PyAttentionInputs host/device field rename#1412

fix(rtpllm): adapt to RTP-LLM PyAttentionInputs host/device field rename#1412
Jonathan-hwx wants to merge 2 commits into
ROCm:mainfrom
Jonathan-hwx:fix/rtp-host-device-rename-cuda-graph-acc

Jonathan-hwx commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Jonathan-hwx commented Jun 30, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant