Skip to content

fix(rtpllm): adapt to RTP-LLM PyAttentionInputs host/device field rename#1412

Open
Jonathan-hwx wants to merge 2 commits into
ROCm:mainfrom
Jonathan-hwx:fix/rtp-host-device-rename-cuda-graph-acc
Open

fix(rtpllm): adapt to RTP-LLM PyAttentionInputs host/device field rename#1412
Jonathan-hwx wants to merge 2 commits into
ROCm:mainfrom
Jonathan-hwx:fix/rtp-host-device-rename-cuda-graph-acc

Conversation

@Jonathan-hwx

Copy link
Copy Markdown

RTP-LLM commit b1f8d50 ("refactor(attn): unify PyAttentionInputs host/device tensors with _host/_device suffixes") renamed the canonical device-side attention fields:

sequence_lengths_plus_1_d -> sequence_lengths_plus_1_device
prefix_lengths_d -> prefix_lengths_device
cu_seqlens (device) -> cu_seqlens_device

forward_context.py still read the old names via getattr(..., None), so on post-rename RTP-LLM these silently returned None. The decode path then fell back to recomputing (sequence_lengths + input_lengths) into a transient device tensor. Under CUDA/HIP graph capture that transient is baked into the graph, while RTP-LLM updates sequence_lengths_plus_1_device in place on replay -> captured context_lens freeze at capture-time values -> decode attention reads KV with stale sequence lengths -> accuracy regression (only in CUDA-graph mode; eager recomputes each step so it was unaffected).

Point all getattr keys at the new _device names. _non_empty_int32 is a no-op on an already-on-device contiguous int32 tensor, so the canonical buffer identity is preserved and graph replay binds to RTP-LLM's in-place buffer.

Manifests as the GLM-5 ROCm CUDA-graph accuracy issue.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

RTP-LLM commit b1f8d50 ("refactor(attn): unify PyAttentionInputs
host/device tensors with _host/_device suffixes") renamed the canonical
device-side attention fields:

  sequence_lengths_plus_1_d -> sequence_lengths_plus_1_device
  prefix_lengths_d          -> prefix_lengths_device
  cu_seqlens (device)       -> cu_seqlens_device

forward_context.py still read the old names via getattr(..., None), so on
post-rename RTP-LLM these silently returned None. The decode path then fell
back to recomputing (sequence_lengths + input_lengths) into a transient
device tensor. Under CUDA/HIP graph capture that transient is baked into the
graph, while RTP-LLM updates sequence_lengths_plus_1_device in place on
replay -> captured context_lens freeze at capture-time values -> decode
attention reads KV with stale sequence lengths -> accuracy regression
(only in CUDA-graph mode; eager recomputes each step so it was unaffected).

Point all getattr keys at the new _device names. _non_empty_int32 is a no-op
on an already-on-device contiguous int32 tensor, so the canonical buffer
identity is preserved and graph replay binds to RTP-LLM's in-place buffer.

Manifests as the GLM-5 ROCm CUDA-graph accuracy issue.
_resolve_plugin_block_table (both the base and MLA contexts) returned RTP-LLM's
kv_cache_block_id_device whenever it was non-empty -- including under CUDA/HIP
graph capture. That table is RTP-LLM's *cache-store* physical block table, and
by explicit design RTP-LLM does NOT refresh it inside the graph on replay:

  // kv_cache_block_id_{host,device} are physical block IDs dedicated for cache
  // store ... NOT consumed by any GPU attention kernel during CUDA graph replay;
  // attention kernels only use kv_cache_kernel_block_id_{host,device}. Cache
  // store operations run outside the CUDA graph and read the original inputs.
  -- RTP-LLM cuda_graph_runner.cc

Only kv_cache_kernel_block_id_device is D2D-refreshed on replay. So capturing a
graph off kv_cache_block_id_device bakes a stale block_table / slot_mapping into
the graph: on every decode replay step KV is read/written at the frozen
capture-time physical blocks -> KV cache corruption -> garbled output.

Gate the fast path on `not in_capture`. Under capture we now fall through to the
existing capture-safe path that rebuilds the physical table from the (refreshed)
kernel block table via _recover_physical_block_table_from_kernel + cg_bufs. This
matches the codebase's existing `not in_capture` idiom for capture-unsafe ops.

Together with the host/device field rename fix, this resolves the GLM-5 ROCm
CUDA-graph garbled-output (accuracy) issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant