[feat][sgl-atom] add qwen3 sglang dense model support#1416
Merged
Conversation
Register Qwen3 dense for the SGLang ATOM wrapper and route page-size 1 dense decode through the existing native AITER path to avoid invalid KV layout reshapes during CUDA graph capture. Co-authored-by: Cursor <cursoragent@cursor.com>
Restore the existing dense attention routing and use page-size 16 only for Qwen3-32B MI308 SGLang benchmark entries so the model stays on the ATOM attention path. Co-authored-by: Cursor <cursoragent@cursor.com>
zhuyuhua-v
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Qwen3ForCausalLMin the SGLang+ATOM model adapter registry so Qwen3-32B-FP8 uses the ATOM plugin model path instead of falling back to the built-in SGLang model.--page-size 16for the Qwen3-32B-FP8 benchmark entries to avoid the dense attention KV layout issue seen withpage_size=1.Accuracy Validation
Validated Qwen3-32B-FP8 with the SGLang+ATOM service using the same
--page-size 16runtime configuration.The initial GSM8K CI-style run looked low because
lm_eval local-completionsdefaults tomax_gen_toks=256, which truncates Qwen3 math reasoning outputs. A direct A/B confirmed this was an evaluation configuration issue, not a model accuracy regression:max_gen_toks=256:flexible-extract=0.56,strict-match=0.56max_gen_toks=2048,max_length=8192:flexible-extract=0.95,strict-match=0.97max_gen_toks=2048,max_length=8192:flexible-extract=0.9037,strict-match=0.9166These full-set results are in the expected range for Qwen3-32B GSM8K accuracy and confirm the minimal Qwen3-32B support changes do not introduce an accuracy regression.
Test Plan
--page-size 16, and Qwen3 reasoning parser.max_gen_toks=2048,max_length=8192.