[MI455] MiniMax-M3 gfx1250 enabling#1410
Open
leonling-ll wants to merge 8 commits into
Open
Conversation
Mxfp4MoEMethod._apply_shared_experts_dense hard-asserted the SiLU activation path, so MiniMax-M3 (ActivationType.Swiglu with fused shared experts) crashed with "dense shared-expert GEMM only supports the SiLU activation path". MiniMax-M3 uses SwiGLU-OAI (gate*sigmoid(alpha*gate)*(up+beta)) and does not interleave gate/up, so the dense GEMM output is split [gate|up] -- exactly what swiglu_oai_split consumes. The dense shared expert now mirrors MiniMaxM3MLP.forward and the routed experts' alpha / swiglu_add_residual=True path. - moe.py: drop the assert; branch the activation step. SiLU keeps fused_clamp_act_mul (DeepSeek, unchanged); SwiGLU uses swiglu_oai_split with alpha/beta/limit read from the layer. - minimax_m3.py: stash swiglu_alpha/swiglu_beta on self.experts (from config, defaults 1.702/1.0) next to swiglu_limit. - tests: numerical test that reuses the same kernel GEMM and varies only the activation, isolating the fix from mxfp4/bf16 GEMM precision. Verified on MI350X: fixed path matches the SwiGLU-OAI reference exactly, old SiLU behaviour diverged ~20%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gfx1250 has no pa_decode_gluon kernel (gluon supports gfx942 / gfx950 only), so MiniMax-M3 decode crashed with "pa_decode_gluon only supports gfx942 (CDNA3) and gfx950 (CDNA4)". The compacted sparse block table / context lengths the runners already build are exactly the (block_table, seqused_k) contract unified_attention consumes over the same SHUFFLE KV cache, so route both the full-attn and sparse decode paths through the triton unified_attention on gfx1250. Full-attn (attention_mha.py): - paged_attention_triton: add a use_unified flag that includes get_gfx() == "gfx1250" so decode takes the unified_attention branch instead of run_pa_decode_gluon. Gluon retained for CDNA3/CDNA4. Sparse (minimax_m3/sparse_attn.py) -- note ATOM_USE_UNIFIED_ATTN does NOT gate this path; the sparse runners call run_pa_decode_gluon directly: - add _sparse_decode_unified_attention helper feeding the kv-head collapsed SHUFFLE cache + sparse_bt + sparse_ctx into unified_attention(shuffled_kv_cache=True), each token a length-1 causal sequence (mirrors gluon max_seqlen_q=1 per-token-as-decode). - gfx1250 branch in minimax_m3_sparse_attn_decode_asm and _run_prefill_fp8_gluon: bf16 -> helper; fp8 -> NotImplementedError (gluon per-page descale has no unified_attention equivalent yet). Caveat: validated to compile/import on gfx950; the sparse path's GQA/block_table semantics still need MI455 numerical validation against the gfx950 gluon reference, and fp8 KV cache on gfx1250 is unsupported. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Enables MiniMax-M3 end-to-end functionality on MI455 (gfx1250) by routing sparse/decode attention away from unsupported gluon kernels, fixing MoE SwiGLU shared-expert activation support, and correcting a weight-layout issue that impacted accuracy.
Changes:
- Route MiniMax-M3 sparse decode/prefill attention to
unified_attentionon gfx1250 (and gate off fp8 sparse decode there). - Add SwiGLU-OAI (
alpha/beta) support to the dense shared-expert GEMM path and ensure shared weights are safely detached before in-place layout transforms. - Fix routed expert w13 gate/up row ordering via in-place gguu→gugu interleave to match the triton SwiGLU kernel’s expectations.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| atom/models/minimax_m3.py | Plumbs swiglu_alpha/swiglu_beta into the MoE experts module for shared-expert dense SwiGLU-OAI parity. |
| atom/model_ops/moe.py | Adds in-place gate/up row interleave and extends dense shared-expert GEMM to support SwiGLU-OAI; fixes shared-weight stashing to avoid aliasing. |
| atom/model_ops/minimax_m3/sparse_attn.py | Adds gfx1250 fallback routing for sparse decode/prefill to unified_attention and blocks fp8 sparse decode on gfx1250. |
| atom/model_ops/attention_mha.py | Forces paged_attention_triton to use unified_attention on gfx1250 to avoid unsupported gluon decode kernels. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1360
to
+1364
| raise NotImplementedError( | ||
| "MiniMax-M3 fp8 sparse decode is not yet supported on gfx1250 " | ||
| "(MI455): the gluon per-page descale path has no unified_attention " | ||
| "equivalent here. Use a bf16 KV cache on gfx1250." | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR aims to enable MiniMa-M3 E2E functionally work on MI455 and fix accuracy issue.
There are 3 main changes:
Route MiniMax-M3 attention from
pa_decode_gluontounified_attentionon gfx1250 (MI455)gfx1250 has no pa_decode_gluon kernel (gluon supports gfx942/gfx950 only).
Support
SwiGLU-OAIactivation in the dense shared-expert GEMMMxfp4MoEMethod._apply_shared_experts_densepreviously hard-asserted the SiLU path, so MiniMax-M3 (ActivationType.Swiglu with fused shared experts) crashed. It now supports both activations.Fix prefill weight layout: gguu→gugu interleave (thanks to @ganyi1996ppo )
cc: @Dewei-Wang-sh