[atom-vllm] enable minimax_m3 for atom vllm by lirui927 · Pull Request #1408 · ROCm/ATOM

lirui927 · 2026-06-30T06:08:05Z

Motivation

Enable MiniMax-M3 to run through the ATOM vLLM plugin path, including dense attention and sparse attention support, so MiniMax-M3 can use vLLM serving, KV cache management, and cudagraph execution.

Technical Details

Add MiniMax-M3 vLLM attention adapters:
- MiniMaxM3DenseAttentionForVllm
- MiniMaxM3SparseAttentionForVllm
Route MiniMax-M3 dense attention through vLLM Attention while keeping ATOM-owned projection, Q/K norm, RoPE, and checkpoint layout unchanged.
Add MiniMax-M3 sparse attention support under the ATOM vLLM attention backend instead of relying on model-local/community backend code.
Support fp8 KV cache scale layout for MiniMax-M3 sparse attention.

Test Plan

Launch MiniMax-M3 vLLM service with ATOM adapters enabled.
Run GSM8K accuracy validation:
- MIXFP8
- MIXFP4
- MIXFP8 + kv_fp8
- MIXFP4 + kv_fp8

MODEL=/shared/data/amd_int/models/MiniMax-M3-MXFP8

vllm serve "$MODEL" \
    --dtype auto \
    --load-format auto \
    --host localhost \
    --port 8001 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --no-async-scheduling \
    --kv-cache-dtype fp8 \
    --no-enable-prefix-caching \
    --language-model-only \
    --no-trust-remote-code \
    --additional-config '{"online_quant_config": {"global_quant_config": "ptpc_fp8", "exclude_layer": ["lm_head", "model.embed_tokens", "vision_tower", "multi_modal_projector", "patch_merge_mlp", "*block_sparse_moe"]}}' \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    2>&1 | tee log_m3_mxfp8_vllm.log

MODEL=/shared/data/amd_int/models/MiniMax-M3-MXFP4

vllm serve "$MODEL" \
    --dtype auto \
    --load-format auto\
    --host localhost \
    --port 8001 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --kv-cache-dtype fp8 \
    --no-enable-prefix-caching \
    --language-model-only \
    --no-trust-remote-code \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    2>&1 | tee log_m3_mxfp4_vllm_0625.log

Test Result

FP8-KV16

FP8-KV8

FP4-KV16

FP4-KV8

每组多次测试精度结果

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Fix PTPC FP8 MoE loading to preserve offline checkpoint bits and wire MiniMax-M3 sparse MHA metadata/backend support for vLLM serving. Co-authored-by: Cursor <cursoragent@cursor.com>

Reuse vLLM-provided output buffers in sparse MHA prefill/decode and align the adapter with the page-shuffled KV cache layout used by MiniMax-M3 serving. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Keep mixed decode/prefill/extend batches phase-local and separate index-cache top-k metadata from main KV-cache sparse block emission to prevent cross-request fp8 accuracy drift.

Route MiniMax-M3 vLLM-plugin attention through the vendored community implementation by default so serving has a stable correctness baseline while the ATOM plugin attention integration is debugged.

Remove the community MiniMax-M3 vLLM backend and registry switches so serving uses ATOM-owned dense and sparse attention paths.

Keep MiniMax-M3 sparse attention on the dedicated adapter path and rename the backend metadata pieces to match their model-specific ownership.

Restore the model-local sparse attention implementation to match main so the vLLM PR only carries the adapter-side changes needed for MiniMax-M3.

XiaobingSuper marked this pull request as draft June 30, 2026 07:34

lirui927 and others added 9 commits June 30, 2026 03:27

[MiniMax-M3] support sparse MHA serving in vLLM

2f70668

Fix PTPC FP8 MoE loading to preserve offline checkpoint bits and wire MiniMax-M3 sparse MHA metadata/backend support for vLLM serving. Co-authored-by: Cursor <cursoragent@cursor.com>

[MiniMax-M3] optimize sparse MHA vLLM output path

1aa9b54

Reuse vLLM-provided output buffers in sparse MHA prefill/decode and align the adapter with the page-shuffled KV cache layout used by MiniMax-M3 serving. Co-authored-by: Cursor <cursoragent@cursor.com>

[MiniMax-M3] initialize sparse MHA vLLM cache state

37358aa

Co-authored-by: Cursor <cursoragent@cursor.com>

[MiniMax-M3] register ATOM MXFP8 quant config

ca56978

Co-authored-by: Cursor <cursoragent@cursor.com>

[MiniMax-M3] fix sparse MHA fp8 metadata alignment

7fbf460

Keep mixed decode/prefill/extend batches phase-local and separate index-cache top-k metadata from main KV-cache sparse block emission to prevent cross-request fp8 accuracy drift.

[MiniMax-M3] add vLLM attention correctness baseline

56ff7fd

Route MiniMax-M3 vLLM-plugin attention through the vendored community implementation by default so serving has a stable correctness baseline while the ATOM plugin attention integration is debugged.

[MiniMax-M3] route vLLM attention through ATOM adapters

e21ea86

Remove the community MiniMax-M3 vLLM backend and registry switches so serving uses ATOM-owned dense and sparse attention paths.

[MiniMax-M3] clean up vLLM sparse attention adapter

c14f222

Keep MiniMax-M3 sparse attention on the dedicated adapter path and rename the backend metadata pieces to match their model-specific ownership.

[MiniMax-M3] stabilize vLLM graph attention path

e06fa99

XiaobingSuper force-pushed the lirui/m3_vllm_0630 branch from 01e43cc to 456797b Compare June 30, 2026 08:31

[MiniMax-M3] restore core files and stabilize sparse capture

80a868b

XiaobingSuper force-pushed the lirui/m3_vllm_0630 branch from 456797b to 80a868b Compare June 30, 2026 08:46

lirui927 added 2 commits June 30, 2026 04:09

[MiniMax-M3] simplify vLLM attention adapter

a387c14

[MiniMax-M3] fix sparse fp8 KV cache scales

cf30919

lirui927 marked this pull request as ready for review June 30, 2026 09:56

XiaobingSuper requested review from ganyi1996ppo, valarLip and zejunchen-zejun June 30, 2026 09:57

[MiniMax-M3] add vLLM recipe

3c7261e

zufayu requested a review from ZhangLirong-amd July 1, 2026 01:43

use glue ps for sparse attention

6bc0bea

zejunchen-zejun changed the title ~~enable minimax_m3 for vllm atom~~ [atom-vllm] enable minimax_m3 for vllm atom Jul 1, 2026

[MiniMax-M3] restore sparse attention kernel

5ff1592

Restore the model-local sparse attention implementation to match main so the vLLM PR only carries the adapter-side changes needed for MiniMax-M3.

zejunchen-zejun approved these changes Jul 1, 2026

View reviewed changes

XiaobingSuper changed the title ~~[atom-vllm] enable minimax_m3 for vllm atom~~ [atom-vllm] enable minimax_m3 for atom vllm Jul 1, 2026

Merge branch 'main' into lirui/m3_vllm_0630

36612ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[atom-vllm] enable minimax_m3 for atom vllm#1408

[atom-vllm] enable minimax_m3 for atom vllm#1408
lirui927 wants to merge 16 commits into
mainfrom
lirui/m3_vllm_0630

lirui927 commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lirui927 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

FP8-KV16

FP8-KV8

FP4-KV16

FP4-KV8

每组多次测试精度结果

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lirui927 commented Jun 30, 2026 •

edited

Loading