[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control#15002
[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control#15002luyiyun1021 wants to merge 1 commit into
Conversation
…-swizzle control Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughUpdated FP4 quantization to make dispatch parameters explicit: the internal dispatch helper now accepts ChangesFP4 dispatch parameter handling
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
2424-2432:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
get_valid_tacticsshould exclude FlashInfer whensf_use_ue8m0=True.The assertion at line 2372-2374 enforces that FlashInfer cannot be used with
sf_use_ue8m0=True, butget_valid_tacticsunconditionally includesFp4QuantTactic.FLASHINFERwhen FlashInfer is available. During autotuning warmup,tuner.choose_onewill callforward()which invokes_fp4_quantize_dispatchfor each tactic, causing an assertion failure ifsf_use_ue8m0=True.🐛 Proposed fix to filter FlashInfer based on sf_use_ue8m0
def get_valid_tactics( self, inputs: List[torch.Tensor], profile: OptimizationProfile, ) -> List[int]: tactics = [Fp4QuantTactic.TRTLLM] - if IS_FLASHINFER_AVAILABLE: + # FlashInfer does not support MXFP4 (UE8M0) scaling + if IS_FLASHINFER_AVAILABLE and not self.sf_use_ue8m0: tactics.append(Fp4QuantTactic.FLASHINFER) return tactics🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/custom_ops/torch_custom_ops.py` around lines 2424 - 2432, The get_valid_tactics() method currently appends Fp4QuantTactic.FLASHINFER whenever IS_FLASHINFER_AVAILABLE is true, which conflicts with the earlier assertion forbidding FlashInfer when sf_use_ue8m0=True; update get_valid_tactics (the method name) to check the instance flag self.sf_use_ue8m0 and only append Fp4QuantTactic.FLASHINFER if IS_FLASHINFER_AVAILABLE is true AND self.sf_use_ue8m0 is False, so autotuning won't select FlashInfer when sf_use_ue8m0 is enabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`:
- Around line 2424-2432: The get_valid_tactics() method currently appends
Fp4QuantTactic.FLASHINFER whenever IS_FLASHINFER_AVAILABLE is true, which
conflicts with the earlier assertion forbidding FlashInfer when
sf_use_ue8m0=True; update get_valid_tactics (the method name) to check the
instance flag self.sf_use_ue8m0 and only append Fp4QuantTactic.FLASHINFER if
IS_FLASHINFER_AVAILABLE is true AND self.sf_use_ue8m0 is False, so autotuning
won't select FlashInfer when sf_use_ue8m0 is enabled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4a08cbb6-0708-4058-b909-c4a330d929de
📒 Files selected for processing (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
|
PR_Github #52319 [ run ] triggered by Bot. Commit: |
|
PR_Github #52319 [ run ] completed with state
|
There was a problem hiding this comment.
If FI is selected but sf_use_ue8m0 is somehow passed, would that cause a problem?
can we exclude FI in get_valid_tactics when self.sf_use_ue8m0 is True, so the tuner never selects or profiles it?
Summary by CodeRabbit
Release Notes
Bug Fixes
Enhancements
Description
Fixes a latent bug in
tunable_fp4_quantize(added in PR #12126) where the Python wrapper's 4th kwarg, namedis_sf_swizzled_layout, was misforwarded inside_fp4_quantize_dispatch. The TRTLLM dispatch branch passed it as the 4th positional argument to the 5-arg C++fp4_quantizeop, where the 4th slot is actuallysfUseUE8M0(the MXFP4 toggle) and the 5th isisSfSwizzledLayout. As a result, three things were wrong: (1) the wrapper kwarg name lied about what it controlled; (2) the FlashInfer branch interpreted the same kwarg correctly asdo_shuffle(swizzled), so the wrapper had divergent semantics across tactics; (3) callers had no way to actually controlisSfSwizzledLayout— the C++ defaultTruewas always used.The bug stayed latent because every existing call site passes positional
False(production NVFP4 Linear intensorrt_llm/_torch/modules/linear.py, plus the two cases intests/unittest/_torch/thop/parallel/test_fp4_quantize_flashinfer.py), which lands assfUseUE8M0=False(correct for NVFP4) and lets the C++ defaultisSfSwizzledLayout=Trueproduce SWIZZLED output that downstreamnvfp4_gemmconsumes. Trying to flip the swizzled flag by passingTrueinstead crashes withRuntimeError: sfVecSize can only be 32, when sfUseUE8M0 is true(the UE8M0 + sf_vec_size=16 combination is rejected by the C++ op).The fix renames the 4th wrapper kwarg to
sf_use_ue8m0(matching what the TRTLLM dispatch actually controls), adds a real 5th kwargis_sf_swizzled_layout: bool = True(matching the C++ default), and threads both through the dispatch helper,Fp4QuantKernelRunner, and the fake registration. The FlashInfer branch now assertsnot sf_use_ue8m0(FlashInfer has no MXFP4 path) and uses the newis_sf_swizzled_layoutfordo_shuffle. All existing call sites continue to pass positionalFalse, which now binds tosf_use_ue8m0=Falsewhileis_sf_swizzled_layoutfalls back to the new defaultTrue— so each existing caller's effective C++ call is byte-identical to pre-fix.Test Coverage
tests/unittest/_torch/thop/parallel/test_fp4_quantize_flashinfer.py(the wrapper's own op-level test, bothtest_tunable_fp4_quantize_opandtest_tunable_fp4_quantize_with_autotune) — pass.PR Checklist
PR description clearly explains what and why.
PR Follows TRT-LLM CODING GUIDELINES.
Test cases are provided for new code paths.
Any new dependencies have been scanned for license and vulnerabilities.
CODEOWNERS updated if ownership changes.
Documentation updated as needed.
Update tava architecture diagram if significant design change.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.