[MI455] MiniMax-M3 gfx1250 enabling by leonling-ll · Pull Request #1410 · ROCm/ATOM

leonling-ll · 2026-06-30T06:40:52Z

This PR aims to enable MiniMa-M3 E2E functionally work on MI455 and fix accuracy issue.

There are 3 main changes:

Route MiniMax-M3 attention from pa_decode_gluon to unified_attention on gfx1250 (MI455)
gfx1250 has no pa_decode_gluon kernel (gluon supports gfx942/gfx950 only).
Support SwiGLU-OAI activation in the dense shared-expert GEMM
Mxfp4MoEMethod._apply_shared_experts_dense previously hard-asserted the SiLU path, so MiniMax-M3 (ActivationType.Swiglu with fused shared experts) crashed. It now supports both activations.
Fix prefill weight layout: gguu→gugu interleave (thanks to @ganyi1996ppo )

Mxfp4MoEMethod._apply_shared_experts_dense hard-asserted the SiLU activation path, so MiniMax-M3 (ActivationType.Swiglu with fused shared experts) crashed with "dense shared-expert GEMM only supports the SiLU activation path". MiniMax-M3 uses SwiGLU-OAI (gate*sigmoid(alpha*gate)*(up+beta)) and does not interleave gate/up, so the dense GEMM output is split [gate|up] -- exactly what swiglu_oai_split consumes. The dense shared expert now mirrors MiniMaxM3MLP.forward and the routed experts' alpha / swiglu_add_residual=True path. - moe.py: drop the assert; branch the activation step. SiLU keeps fused_clamp_act_mul (DeepSeek, unchanged); SwiGLU uses swiglu_oai_split with alpha/beta/limit read from the layer. - minimax_m3.py: stash swiglu_alpha/swiglu_beta on self.experts (from config, defaults 1.702/1.0) next to swiglu_limit. - tests: numerical test that reuses the same kernel GEMM and varies only the activation, isolating the fix from mxfp4/bf16 GEMM precision. Verified on MI350X: fixed path matches the SwiGLU-OAI reference exactly, old SiLU behaviour diverged ~20%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gfx1250 has no pa_decode_gluon kernel (gluon supports gfx942 / gfx950 only), so MiniMax-M3 decode crashed with "pa_decode_gluon only supports gfx942 (CDNA3) and gfx950 (CDNA4)". The compacted sparse block table / context lengths the runners already build are exactly the (block_table, seqused_k) contract unified_attention consumes over the same SHUFFLE KV cache, so route both the full-attn and sparse decode paths through the triton unified_attention on gfx1250. Full-attn (attention_mha.py): - paged_attention_triton: add a use_unified flag that includes get_gfx() == "gfx1250" so decode takes the unified_attention branch instead of run_pa_decode_gluon. Gluon retained for CDNA3/CDNA4. Sparse (minimax_m3/sparse_attn.py) -- note ATOM_USE_UNIFIED_ATTN does NOT gate this path; the sparse runners call run_pa_decode_gluon directly: - add _sparse_decode_unified_attention helper feeding the kv-head collapsed SHUFFLE cache + sparse_bt + sparse_ctx into unified_attention(shuffled_kv_cache=True), each token a length-1 causal sequence (mirrors gluon max_seqlen_q=1 per-token-as-decode). - gfx1250 branch in minimax_m3_sparse_attn_decode_asm and _run_prefill_fp8_gluon: bf16 -> helper; fp8 -> NotImplementedError (gluon per-page descale has no unified_attention equivalent yet). Caveat: validated to compile/import on gfx950; the sparse path's GQA/block_table semantics still need MI455 numerical validation against the gfx950 gluon reference, and fp8 KV cache on gfx1250 is unsupported. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: ganyi <ygan@amd.com>

Copilot

Pull request overview

Enables MiniMax-M3 end-to-end functionality on MI455 (gfx1250) by routing sparse/decode attention away from unsupported gluon kernels, fixing MoE SwiGLU shared-expert activation support, and correcting a weight-layout issue that impacted accuracy.

Changes:

Route MiniMax-M3 sparse decode/prefill attention to unified_attention on gfx1250 (and gate off fp8 sparse decode there).
Add SwiGLU-OAI (alpha/beta) support to the dense shared-expert GEMM path and ensure shared weights are safely detached before in-place layout transforms.
Fix routed expert w13 gate/up row ordering via in-place gguu→gugu interleave to match the triton SwiGLU kernel’s expectations.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
atom/models/minimax_m3.py	Plumbs `swiglu_alpha`/`swiglu_beta` into the MoE experts module for shared-expert dense SwiGLU-OAI parity.
atom/model_ops/moe.py	Adds in-place gate/up row interleave and extends dense shared-expert GEMM to support SwiGLU-OAI; fixes shared-weight stashing to avoid aliasing.
atom/model_ops/minimax_m3/sparse_attn.py	Adds gfx1250 fallback routing for sparse decode/prefill to `unified_attention` and blocks fp8 sparse decode on gfx1250.
atom/model_ops/attention_mha.py	Forces `paged_attention_triton` to use `unified_attention` on gfx1250 to avoid unsupported gluon decode kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            raise NotImplementedError(
+                "MiniMax-M3 fp8 sparse decode is not yet supported on gfx1250 "
+                "(MI455): the gluon per-page descale path has no unified_attention "
+                "equivalent here. Use a bf16 KV cache on gfx1250."
+            )


leonling-ll and others added 8 commits June 29, 2026 06:15

add dump for minimax

92f2201

Signed-off-by: ganyi <ygan@amd.com>

maybe acc right

e3f906d

Signed-off-by: ganyi <ygan@amd.com>

uint8 to view

cbd2abd

Signed-off-by: ganyi <ygan@amd.com>

reduce memory consumption

0b460f7

Signed-off-by: ganyi <ygan@amd.com>

prefill correct

a319eff

Signed-off-by: ganyi <ygan@amd.com>

Cleanup

c2efefb

leonling-ll self-assigned this Jun 30, 2026

leonling-ll requested review from Dewei-Wang-sh and ganyi1996ppo June 30, 2026 06:41

leonling-ll marked this pull request as ready for review June 30, 2026 11:13

Copilot AI review requested due to automatic review settings June 30, 2026 11:13

Copilot started reviewing on behalf of leonling-ll June 30, 2026 11:14 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MI455] MiniMax-M3 gfx1250 enabling#1410

[MI455] MiniMax-M3 gfx1250 enabling#1410
leonling-ll wants to merge 8 commits into
ROCm:mainfrom
leonling-ll:liyang/minimax-m3-455-support

leonling-ll commented Jun 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

leonling-ll commented Jun 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants