Skip to content

[atom-vllm] enable minimax_m3 for atom vllm#1408

Open
lirui927 wants to merge 16 commits into
mainfrom
lirui/m3_vllm_0630
Open

[atom-vllm] enable minimax_m3 for atom vllm#1408
lirui927 wants to merge 16 commits into
mainfrom
lirui/m3_vllm_0630

Conversation

@lirui927

@lirui927 lirui927 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Motivation

Enable MiniMax-M3 to run through the ATOM vLLM plugin path, including dense attention and sparse attention support, so MiniMax-M3 can use vLLM serving, KV cache management, and cudagraph execution.

Technical Details

  • Add MiniMax-M3 vLLM attention adapters:
    • MiniMaxM3DenseAttentionForVllm
    • MiniMaxM3SparseAttentionForVllm
  • Route MiniMax-M3 dense attention through vLLM Attention while keeping ATOM-owned projection, Q/K norm, RoPE, and checkpoint layout unchanged.
  • Add MiniMax-M3 sparse attention support under the ATOM vLLM attention backend instead of relying on model-local/community backend code.
  • Support fp8 KV cache scale layout for MiniMax-M3 sparse attention.

Test Plan

  • Launch MiniMax-M3 vLLM service with ATOM adapters enabled.
  • Run GSM8K accuracy validation:
    • MIXFP8
    • MIXFP4
    • MIXFP8 + kv_fp8
    • MIXFP4 + kv_fp8
MODEL=/shared/data/amd_int/models/MiniMax-M3-MXFP8

vllm serve "$MODEL" \
    --dtype auto \
    --load-format auto \
    --host localhost \
    --port 8001 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --no-async-scheduling \
    --kv-cache-dtype fp8 \
    --no-enable-prefix-caching \
    --language-model-only \
    --no-trust-remote-code \
    --additional-config '{"online_quant_config": {"global_quant_config": "ptpc_fp8", "exclude_layer": ["lm_head", "model.embed_tokens", "vision_tower", "multi_modal_projector", "patch_merge_mlp", "*block_sparse_moe"]}}' \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    2>&1 | tee log_m3_mxfp8_vllm.log
MODEL=/shared/data/amd_int/models/MiniMax-M3-MXFP4

vllm serve "$MODEL" \
    --dtype auto \
    --load-format auto\
    --host localhost \
    --port 8001 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --kv-cache-dtype fp8 \
    --no-enable-prefix-caching \
    --language-model-only \
    --no-trust-remote-code \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    2>&1 | tee log_m3_mxfp4_vllm_0625.log

Test Result

FP8-KV16

image

FP8-KV8

image

FP4-KV16

image

FP4-KV8

image

每组多次测试精度结果

image

Submission Checklist

@XiaobingSuper XiaobingSuper marked this pull request as draft June 30, 2026 07:34
lirui927 and others added 9 commits June 30, 2026 03:27
Fix PTPC FP8 MoE loading to preserve offline checkpoint bits and wire MiniMax-M3 sparse MHA metadata/backend support for vLLM serving.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse vLLM-provided output buffers in sparse MHA prefill/decode and align the adapter with the page-shuffled KV cache layout used by MiniMax-M3 serving.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Keep mixed decode/prefill/extend batches phase-local and separate index-cache top-k metadata from main KV-cache sparse block emission to prevent cross-request fp8 accuracy drift.
Route MiniMax-M3 vLLM-plugin attention through the vendored community implementation by default so serving has a stable correctness baseline while the ATOM plugin attention integration is debugged.
Remove the community MiniMax-M3 vLLM backend and registry switches so serving uses ATOM-owned dense and sparse attention paths.
Keep MiniMax-M3 sparse attention on the dedicated adapter path and rename the backend metadata pieces to match their model-specific ownership.
@lirui927 lirui927 marked this pull request as ready for review June 30, 2026 09:56
@zufayu zufayu requested a review from ZhangLirong-amd July 1, 2026 01:43
@zejunchen-zejun zejunchen-zejun changed the title enable minimax_m3 for vllm atom [atom-vllm] enable minimax_m3 for vllm atom Jul 1, 2026
Restore the model-local sparse attention implementation to match main so the vLLM PR only carries the adapter-side changes needed for MiniMax-M3.
@XiaobingSuper XiaobingSuper changed the title [atom-vllm] enable minimax_m3 for vllm atom [atom-vllm] enable minimax_m3 for atom vllm Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants