Skip to content

[feat][sgl-atom] add qwen3 sglang dense model support#1416

Merged
zhangxinyuanliuhengyu merged 2 commits into
mainfrom
fix/qwen3-sglang-dense-startup
Jun 30, 2026
Merged

[feat][sgl-atom] add qwen3 sglang dense model support#1416
zhangxinyuanliuhengyu merged 2 commits into
mainfrom
fix/qwen3-sglang-dense-startup

Conversation

@zhangxinyuanliuhengyu

Copy link
Copy Markdown
Contributor

Summary

  • Register Qwen3ForCausalLM in the SGLang+ATOM model adapter registry so Qwen3-32B-FP8 uses the ATOM plugin model path instead of falling back to the built-in SGLang model.
  • Set --page-size 16 for the Qwen3-32B-FP8 benchmark entries to avoid the dense attention KV layout issue seen with page_size=1.
  • Keep the existing dense attention gating behavior unchanged.

Accuracy Validation

Validated Qwen3-32B-FP8 with the SGLang+ATOM service using the same --page-size 16 runtime configuration.

The initial GSM8K CI-style run looked low because lm_eval local-completions defaults to max_gen_toks=256, which truncates Qwen3 math reasoning outputs. A direct A/B confirmed this was an evaluation configuration issue, not a model accuracy regression:

  • GSM8K 3-shot, 100-sample smoke, default max_gen_toks=256: flexible-extract=0.56, strict-match=0.56
  • GSM8K 3-shot, 100-sample smoke, max_gen_toks=2048, max_length=8192: flexible-extract=0.95, strict-match=0.97
  • GSM8K 3-shot, full set, max_gen_toks=2048, max_length=8192: flexible-extract=0.9037, strict-match=0.9166

These full-set results are in the expected range for Qwen3-32B GSM8K accuracy and confirm the minimal Qwen3-32B support changes do not introduce an accuracy regression.

Test Plan

  • Launched Qwen3-32B-FP8 with SGLang+ATOM, --page-size 16, and Qwen3 reasoning parser.
  • Verified the service starts successfully without the previous RoPE/CUDA graph crash.
  • Ran GSM8K A/B evaluation to isolate the low-score cause to generation length.
  • Ran full GSM8K 3-shot evaluation with max_gen_toks=2048,max_length=8192.
  • Confirmed SGLang service cleanup and GPU memory release after validation.

whx-sjtu and others added 2 commits June 30, 2026 16:17
Register Qwen3 dense for the SGLang ATOM wrapper and route page-size 1 dense decode through the existing native AITER path to avoid invalid KV layout reshapes during CUDA graph capture.

Co-authored-by: Cursor <cursoragent@cursor.com>
Restore the existing dense attention routing and use page-size 16 only for Qwen3-32B MI308 SGLang benchmark entries so the model stays on the ATOM attention path.

Co-authored-by: Cursor <cursoragent@cursor.com>
@zhuyuhua-v zhuyuhua-v changed the title Fix/qwen3 sglang dense startup [feat][sgl-atom] add qwen3 sglang dense model support Jun 30, 2026
@zhuyuhua-v zhuyuhua-v self-requested a review June 30, 2026 09:45
@zhangxinyuanliuhengyu zhangxinyuanliuhengyu merged commit c8aa1a7 into main Jun 30, 2026
34 checks passed
@zhangxinyuanliuhengyu zhangxinyuanliuhengyu deleted the fix/qwen3-sglang-dense-startup branch June 30, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants