Skip to content

[perf] fuse AllReduce + RMSNorm + FP8 quant for ds/kimi#1388

Open
gbyu-amd wants to merge 5 commits into
mainfrom
fuse-allreduce-rmsnorm-quant
Open

[perf] fuse AllReduce + RMSNorm + FP8 quant for ds/kimi#1388
gbyu-amd wants to merge 5 commits into
mainfrom
fuse-allreduce-rmsnorm-quant

Conversation

@gbyu-amd

@gbyu-amd gbyu-amd commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Motivation

The combined HIP kernel emits the quantized (fp8, scale) activation that the downstream qkv GEMM consumes directly, removing a standalone per-token/per-group quant kernel from the hot path.

  • layernorm.py: RMSNorm.forward gains a fused AR+RMS+quant branch dispatching to ..._rmsnorm_quant_per_group (per_1x128) or ..._rmsnorm_quant (per_Token), returning ((fp8, scale), residual).
  • deepseek_v2.py: enable fused_quant on input_layernorm when AR fusion is on and fused_qkv_a_proj is per_1x128/per_Token FP8. Mutually exclusive with the existing non-AR fuse_input_norm_quant path; the attention forward already unpacks the (fp8, scale) tuple.

Need the fix from aiter: ROCm/aiter#3977

Technical Details

Test Plan

Test Result

Submission Checklist

…rm (#1226)

The combined HIP kernel emits the quantized (fp8, scale) activation that the
downstream qkv GEMM consumes directly, removing a standalone per-token/per-group
quant kernel from the hot path.

- layernorm.py: RMSNorm.forward gains a fused AR+RMS+quant branch dispatching to
  `..._rmsnorm_quant_per_group` (per_1x128) or `..._rmsnorm_quant` (per_Token),
  returning ((fp8, scale), residual).
- deepseek_v2.py: enable `fused_quant` on input_layernorm when AR fusion is on
  and fused_qkv_a_proj is per_1x128/per_Token FP8. Mutually exclusive with the
  existing non-AR `fuse_input_norm_quant` path; the attention forward already
  unpacks the (fp8, scale) tuple.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gbyu-amd gbyu-amd requested a review from valarLip June 29, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants