fix : (GLM-5-2-FP8)do not use buffer out of cudagraph when unnecessary by JiaoliangYu · Pull Request #1391 · ROCm/ATOM

JiaoliangYu · 2026-06-29T02:48:32Z

Motivation

GLM-5.2-FP8 benchmark/serving (isl=1024 osl=1024, higher concurrency) intermittently dies with a GPU Memory access fault ... Reason: Unknown (process exit -6). The faulting kernel is the sparse-MLA decode index converter _convert_req_index_to_global_index_kernel (MEMORY_VIOLATION). The failure is non-deterministic (the same commit passes and fails across CI runs) and only happens under CUDA/HIP-graph capture — with --enforce-eager it never reproduces (145k+ eager decode steps, zero faults).

Under CUDA-graph replay, the convert kernel reads a stale/cumulative cu_seqlens_q (qo_indptr) instead of the decode arange. For decode each request has exactly one query token, so cu_seqlens_q must be [0,1,2,...,bs] (qo_end[i] = i+1 ≤ bs). At the fault, the registers show qo_end values far larger than bs (e.g. 860/875 with bs ≤ 512), i.e. a prefill-shaped cumulative layout.

That drives the loop variable token_id out of range, so the kernel reads a row of token_indices (a per-step torch.empty buffer) that the indexer never filled. Those positions hold the -1 "invalid" sentinel, which leaks into a lane the row mask treats as valid; the unbounded kv_indices + kv_start + tok load with tok = -1 then underflows to kv_indices_base - 4 and hits an unmapped page. This is confirmed byte-exact from the debug-agent dump (faulting address == kv_indices base − 4, kv_start == 0, tok == -1).

The Python metadata path writes cu_seqlens_q correctly (decode arange) every step on the main stream before replay; static analysis shows no logic bug. The staleness is a CUDA/HIP-graph runtime artifact on gfx950/MI355X — a captured kernel observing stale device memory under replay.

Technical Details

The convert kernel is decode-only (max_seqlen_q == 1), so token_id == batch_id by construction. Derive the row directly from tl.program_id(0) instead of reading qo_indptr/cu_seqlens_q. The kernel no longer depends on the buffer that goes stale under graph replay, so the fault cannot occur regardless of the underlying runtime root cause — it is a correctness-preserving change, not a band-aid.

Test Plan

Accuracy CI && nightly Benchmark

Test Result

Accuracy (g64): gsm8k 3-shot flexible-extract = 0.9606 (CI baseline 0.9447, threshold 0.92), 0 faults — accuracy preserved.
Crash : https://github.com/ROCm/ATOM/actions/runs/28325465813

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

valarLip · 2026-06-29T10:25:26Z

-            out_val,
-            mask=valid_mask,
-        )
+    # Decode-only: token_id == batch_id. Don't read qo_indptr (cu_seqlens_q),


token_id == batch_id this is not True for mtp

ATOM/atom/model_ops/attention_mla.py

Line 1381 in f797dd5

# NOTE: MTP (max_seqlen_q > 1) uses triton_convert_req_index_to_global_index_dsa_prefill instead

if attn_metadata.max_seqlen_q > 1: triton_gather_kv_indices_sparse( attn_metadata.sparse_kv_indptr, attn_metadata.token_to_seq_idxs, topk_indices, attn_metadata.kv_indices, attn_metadata.kv_indptr, NUM_TOPK_TOKENS=topk_tokens, out=sparse_kv_indices_buffer, ) else: triton_convert_req_index_to_global_index( attn_metadata.cu_seqlens_q, attn_metadata.kv_indptr, attn_metadata.sparse_kv_indptr, attn_metadata.kv_indices, topk_indices, NUM_TOPK_TOKENS=topk_tokens, out=sparse_kv_indices_buffer, )

mtp does not use this kernel, right?

zufayu requested a review from jiayyu June 29, 2026 02:54

JiaoliangYu changed the title ~~fix : do not use buffer out of cudagraph when unnecessary~~ fix : (GLM-5-2-FP8)do not use buffer out of cudagraph when unnecessary Jun 29, 2026

fix : do not use buffer out of cudagraph when unnecessary

1a0d5b8

JiaoliangYu force-pushed the fix/sparse-mla-convert-tok-bounds branch from 6d94ad1 to 1a0d5b8 Compare June 29, 2026 03:00

Merge branch 'main' into fix/sparse-mla-convert-tok-bounds

3739316

JiaoliangYu marked this pull request as ready for review June 29, 2026 03:01

JiaoliangYu marked this pull request as draft June 29, 2026 03:01

Merge branch 'main' into fix/sparse-mla-convert-tok-bounds

dbe8b93

JiaoliangYu marked this pull request as ready for review June 29, 2026 06:33

valarLip reviewed Jun 29, 2026

View reviewed changes

JiaoliangYu requested a review from valarLip June 30, 2026 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix : (GLM-5-2-FP8)do not use buffer out of cudagraph when unnecessary#1391

fix : (GLM-5-2-FP8)do not use buffer out of cudagraph when unnecessary#1391
JiaoliangYu wants to merge 3 commits into
ROCm:mainfrom
JiaoliangYu:fix/sparse-mla-convert-tok-bounds

JiaoliangYu commented Jun 29, 2026

Uh oh!

valarLip Jun 29, 2026

Uh oh!

JiaoliangYu Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JiaoliangYu commented Jun 29, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

valarLip Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

JiaoliangYu Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JiaoliangYu Jun 29, 2026 •

edited

Loading