Skip to content

[MI455] MiniMax-M3 gfx1250 enabling#1410

Open
leonling-ll wants to merge 8 commits into
ROCm:mainfrom
leonling-ll:liyang/minimax-m3-455-support
Open

[MI455] MiniMax-M3 gfx1250 enabling#1410
leonling-ll wants to merge 8 commits into
ROCm:mainfrom
leonling-ll:liyang/minimax-m3-455-support

Conversation

@leonling-ll

Copy link
Copy Markdown

This PR aims to enable MiniMa-M3 E2E functionally work on MI455 and fix accuracy issue.

There are 3 main changes:

  1. Route MiniMax-M3 attention from pa_decode_gluon to unified_attention on gfx1250 (MI455)
    gfx1250 has no pa_decode_gluon kernel (gluon supports gfx942/gfx950 only).

  2. Support SwiGLU-OAI activation in the dense shared-expert GEMM
    Mxfp4MoEMethod._apply_shared_experts_dense previously hard-asserted the SiLU path, so MiniMax-M3 (ActivationType.Swiglu with fused shared experts) crashed. It now supports both activations.

  3. Fix prefill weight layout: gguu→gugu interleave (thanks to @ganyi1996ppo )

cc: @Dewei-Wang-sh

leonling-ll and others added 8 commits June 29, 2026 06:15
Mxfp4MoEMethod._apply_shared_experts_dense hard-asserted the SiLU
activation path, so MiniMax-M3 (ActivationType.Swiglu with fused shared
experts) crashed with "dense shared-expert GEMM only supports the SiLU
activation path".

MiniMax-M3 uses SwiGLU-OAI (gate*sigmoid(alpha*gate)*(up+beta)) and does
not interleave gate/up, so the dense GEMM output is split [gate|up] --
exactly what swiglu_oai_split consumes. The dense shared expert now
mirrors MiniMaxM3MLP.forward and the routed experts' alpha /
swiglu_add_residual=True path.

- moe.py: drop the assert; branch the activation step. SiLU keeps
  fused_clamp_act_mul (DeepSeek, unchanged); SwiGLU uses swiglu_oai_split
  with alpha/beta/limit read from the layer.
- minimax_m3.py: stash swiglu_alpha/swiglu_beta on self.experts (from
  config, defaults 1.702/1.0) next to swiglu_limit.
- tests: numerical test that reuses the same kernel GEMM and varies only
  the activation, isolating the fix from mxfp4/bf16 GEMM precision.
  Verified on MI350X: fixed path matches the SwiGLU-OAI reference exactly,
  old SiLU behaviour diverged ~20%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gfx1250 has no pa_decode_gluon kernel (gluon supports gfx942 / gfx950
only), so MiniMax-M3 decode crashed with "pa_decode_gluon only supports
gfx942 (CDNA3) and gfx950 (CDNA4)". The compacted sparse block table /
context lengths the runners already build are exactly the
(block_table, seqused_k) contract unified_attention consumes over the
same SHUFFLE KV cache, so route both the full-attn and sparse decode
paths through the triton unified_attention on gfx1250.

Full-attn (attention_mha.py):
- paged_attention_triton: add a use_unified flag that includes
  get_gfx() == "gfx1250" so decode takes the unified_attention branch
  instead of run_pa_decode_gluon. Gluon retained for CDNA3/CDNA4.

Sparse (minimax_m3/sparse_attn.py) -- note ATOM_USE_UNIFIED_ATTN does
NOT gate this path; the sparse runners call run_pa_decode_gluon directly:
- add _sparse_decode_unified_attention helper feeding the kv-head
  collapsed SHUFFLE cache + sparse_bt + sparse_ctx into
  unified_attention(shuffled_kv_cache=True), each token a length-1
  causal sequence (mirrors gluon max_seqlen_q=1 per-token-as-decode).
- gfx1250 branch in minimax_m3_sparse_attn_decode_asm and
  _run_prefill_fp8_gluon: bf16 -> helper; fp8 -> NotImplementedError
  (gluon per-page descale has no unified_attention equivalent yet).

Caveat: validated to compile/import on gfx950; the sparse path's
GQA/block_table semantics still need MI455 numerical validation against
the gfx950 gluon reference, and fp8 KV cache on gfx1250 is unsupported.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
@leonling-ll leonling-ll self-assigned this Jun 30, 2026
@leonling-ll leonling-ll marked this pull request as ready for review June 30, 2026 11:13
Copilot AI review requested due to automatic review settings June 30, 2026 11:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables MiniMax-M3 end-to-end functionality on MI455 (gfx1250) by routing sparse/decode attention away from unsupported gluon kernels, fixing MoE SwiGLU shared-expert activation support, and correcting a weight-layout issue that impacted accuracy.

Changes:

  • Route MiniMax-M3 sparse decode/prefill attention to unified_attention on gfx1250 (and gate off fp8 sparse decode there).
  • Add SwiGLU-OAI (alpha/beta) support to the dense shared-expert GEMM path and ensure shared weights are safely detached before in-place layout transforms.
  • Fix routed expert w13 gate/up row ordering via in-place gguu→gugu interleave to match the triton SwiGLU kernel’s expectations.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
atom/models/minimax_m3.py Plumbs swiglu_alpha/swiglu_beta into the MoE experts module for shared-expert dense SwiGLU-OAI parity.
atom/model_ops/moe.py Adds in-place gate/up row interleave and extends dense shared-expert GEMM to support SwiGLU-OAI; fixes shared-weight stashing to avoid aliasing.
atom/model_ops/minimax_m3/sparse_attn.py Adds gfx1250 fallback routing for sparse decode/prefill to unified_attention and blocks fp8 sparse decode on gfx1250.
atom/model_ops/attention_mha.py Forces paged_attention_triton to use unified_attention on gfx1250 to avoid unsupported gluon decode kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1360 to +1364
raise NotImplementedError(
"MiniMax-M3 fp8 sparse decode is not yet supported on gfx1250 "
"(MI455): the gluon per-page descale path has no unified_attention "
"equivalent here. Use a bf16 KV cache on gfx1250."
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants