Skip to content

[GLM5.1/5.2] Fix acc drop for long prompt#1379

Open
zejunchen-zejun wants to merge 3 commits into
mainfrom
zejun/fix_glm_20_shot_acc_issue_0627
Open

[GLM5.1/5.2] Fix acc drop for long prompt#1379
zejunchen-zejun wants to merge 3 commits into
mainfrom
zejun/fix_glm_20_shot_acc_issue_0627

Conversation

@zejunchen-zejun

@zejunchen-zejun zejunchen-zejun commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Before fix:

Model Shot Strict Flexible
GLM5.1 FP8 5 0.945413 0.939348
GLM5.1 FP8 20 0.780000 0.780000
GLM5.2 FP8 5 0.943139 0.943897
GLM5.2 FP8 20 0.001516 0.012130

After fix:

Model Shot Strict Flexible
GLM5.1 FP8 5 0.935557240333586 0.9317664897649734
GLM5.1 FP8 20 0.9446550416982562 0.9454131918119788
GLM5.2 FP8 5 0.9416224412433661 0.9416224412433661
GLM5.2 FP8 20 0.9454131918119788 0.9454131918119788
DeepSeek-V3.2 5 0.9514783927217589 0.9514783927217589
DeepSeek-V3.2 20 0.9529946929492039 0.9537528430629265

How to fix:

  1. persistent mla meta not pass into the mla_decode_fwd when doing sparse prefill. For sparse prefill, total prefill requests are divided into multiple q_len=1 virtual decode request, while current code doesn't pass the metadata to mla_decode
  2. chunked prefill breaks the causality
  3. indexer should calculate the qk score after the ROPE, while current code calculates the q/k score without the ROPE
  4. GLM5.2 use shared indexer, while for the shared layer, the indexer has been wrongly assigned None, unfortunately the succeeding layer use indexer is None to determine the mla is sparse mla or not, so the mla was wrongly chosen

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zufayu zufayu requested a review from jiayyu June 29, 2026 02:54
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zejunchen-zejun zejunchen-zejun marked this pull request as ready for review June 30, 2026 02:25
Copilot AI review requested due to automatic review settings June 30, 2026 02:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses long-prompt accuracy drops for GLM-5.1/5.2 (and confirms no regression for DeepSeek-V3.2) by fixing multiple issues in sparse MLA prefill/indexing, including causality handling, RoPE application in the indexer path, and correct sparse-mode selection for GLM-5.2 IndexShare “shared” layers.

Changes:

  • Fix indexer scoring to apply RoPE to q/k before computing QK scores, and make RoPE style configurable via rope_interleave.
  • Fix sparse prefill metadata construction to preserve causality across chunked prefill (per-token virtual decode layout) and generate the required sparse-prefill MLA metadata.
  • Ensure GLM-5.2 IndexShare “shared” layers still run sparse MLA by deriving sparsity at the model level (not per-layer indexer is None).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
atom/models/deepseek_v2.py Applies RoPE correctly in indexer q/k scoring and propagates model-level sparse settings for IndexShare layers.
atom/model_ops/attentions/aiter_mla.py Builds sparse-prefill per-token causality metadata and allocates/publishes sparse-prefill MLA work buffers.
atom/model_ops/attention_mla.py Adds model-level sparse flag/top-k plumbing and forwards sparse-prefill work metadata into mla_decode_fwd.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +208 to +216
) = get_mla_metadata_info_v1(
self.max_num_batched_tokens,
1, # sparse prefill treats each query token as q_len=1
self.padded_num_attention_heads,
self.dtype_q,
self.dtype_kv,
is_sparse=True,
fast_mode=True,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants