Skip to content

[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control#15002

Open
luyiyun1021 wants to merge 1 commit into
NVIDIA:mainfrom
luyiyun1021:fix-tunable-fp4-quantize-kwarg
Open

[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control#15002
luyiyun1021 wants to merge 1 commit into
NVIDIA:mainfrom
luyiyun1021:fix-tunable-fp4-quantize-kwarg

Conversation

@luyiyun1021
Copy link
Copy Markdown
Collaborator

@luyiyun1021 luyiyun1021 commented Jun 5, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Corrected FP4 quantization dispatch to properly align scale factors between different inference backends
  • Enhancements

    • Added configuration options for scale factor layout format and sizing to provide finer control over quantization behavior

Description

Fixes a latent bug in tunable_fp4_quantize (added in PR #12126) where the Python wrapper's 4th kwarg, named is_sf_swizzled_layout, was misforwarded inside _fp4_quantize_dispatch. The TRTLLM dispatch branch passed it as the 4th positional argument to the 5-arg C++ fp4_quantize op, where the 4th slot is actually sfUseUE8M0 (the MXFP4 toggle) and the 5th is isSfSwizzledLayout. As a result, three things were wrong: (1) the wrapper kwarg name lied about what it controlled; (2) the FlashInfer branch interpreted the same kwarg correctly as do_shuffle (swizzled), so the wrapper had divergent semantics across tactics; (3) callers had no way to actually control isSfSwizzledLayout — the C++ default True was always used.

The bug stayed latent because every existing call site passes positional False (production NVFP4 Linear in tensorrt_llm/_torch/modules/linear.py, plus the two cases in tests/unittest/_torch/thop/parallel/test_fp4_quantize_flashinfer.py), which lands as sfUseUE8M0=False (correct for NVFP4) and lets the C++ default isSfSwizzledLayout=True produce SWIZZLED output that downstream nvfp4_gemm consumes. Trying to flip the swizzled flag by passing True instead crashes with RuntimeError: sfVecSize can only be 32, when sfUseUE8M0 is true (the UE8M0 + sf_vec_size=16 combination is rejected by the C++ op).

The fix renames the 4th wrapper kwarg to sf_use_ue8m0 (matching what the TRTLLM dispatch actually controls), adds a real 5th kwarg is_sf_swizzled_layout: bool = True (matching the C++ default), and threads both through the dispatch helper, Fp4QuantKernelRunner, and the fake registration. The FlashInfer branch now asserts not sf_use_ue8m0 (FlashInfer has no MXFP4 path) and uses the new is_sf_swizzled_layout for do_shuffle. All existing call sites continue to pass positional False, which now binds to sf_use_ue8m0=False while is_sf_swizzled_layout falls back to the new default True — so each existing caller's effective C++ call is byte-identical to pre-fix.

Test Coverage

  • tests/unittest/_torch/thop/parallel/test_fp4_quantize_flashinfer.py (the wrapper's own op-level test, both test_tunable_fp4_quantize_op and test_tunable_fp4_quantize_with_autotune) — pass.
  • LTX-2 transformer block tests (which exercise the production NVFP4 Linear path through this wrapper) — pass.

PR Checklist

  • PR description clearly explains what and why.

  • PR Follows TRT-LLM CODING GUIDELINES.

  • Test cases are provided for new code paths.

  • Any new dependencies have been scanned for license and vulnerabilities.

  • CODEOWNERS updated if ownership changes.

  • Documentation updated as needed.

  • Update tava architecture diagram if significant design change.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…-swizzle control

Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
@luyiyun1021 luyiyun1021 requested a review from a team as a code owner June 5, 2026 09:12
@luyiyun1021 luyiyun1021 requested a review from hyukn June 5, 2026 09:12
@luyiyun1021
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Updated FP4 quantization to make dispatch parameters explicit: the internal dispatch helper now accepts sf_use_ue8m0 (MXFP4 UE8M0 scaling) and is_sf_swizzled_layout (swizzled 128x4 scale layout), propagates them through Fp4QuantKernelRunner with caching, exposes them in the public tunable_fp4_quantize API, and updates the torch.compile fake implementation to handle them.

Changes

FP4 dispatch parameter handling

Layer / File(s) Summary
Dispatch helper parameter expansion
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
_fp4_quantize_dispatch accepts sf_use_ue8m0 and is_sf_swizzled_layout parameters, documents their semantics relative to the C++ backend, enforces that FlashInfer tactic cannot use sf_use_ue8m0=True, and the TRTLLM dispatch path passes sf_use_ue8m0 to the underlying op invocation.
KernelRunner parameter caching
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Fp4QuantKernelRunner constructor stores both parameters, includes them in the cache key alongside scaling_vector_size, updates is_sf_swizzled_layout default to True, and forward method propagates both to the dispatch helper.
Public custom-op API signature and wiring
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tunable_fp4_quantize signature expanded to accept sf_use_ue8m0 (default False) and is_sf_swizzled_layout (default True), documentation and runner construction updated accordingly, and both the fast-path and fallback dispatch calls pass sf_use_ue8m0 to the helper.
torch.compile fake implementation support
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Fake implementation signature accepts the new parameters with matching defaults, and shape computation explicitly ignores both flags to ensure output shape remains independent of control parameters.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: renaming a misnamed kwarg and adding real control for SF-swizzle in the tunable_fp4_quantize function.
Description check ✅ Passed The PR description comprehensively explains the bug, its consequences, the fix, and test coverage. It follows the template structure and includes detailed technical context.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

2424-2432: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

get_valid_tactics should exclude FlashInfer when sf_use_ue8m0=True.

The assertion at line 2372-2374 enforces that FlashInfer cannot be used with sf_use_ue8m0=True, but get_valid_tactics unconditionally includes Fp4QuantTactic.FLASHINFER when FlashInfer is available. During autotuning warmup, tuner.choose_one will call forward() which invokes _fp4_quantize_dispatch for each tactic, causing an assertion failure if sf_use_ue8m0=True.

🐛 Proposed fix to filter FlashInfer based on sf_use_ue8m0
     def get_valid_tactics(
         self,
         inputs: List[torch.Tensor],
         profile: OptimizationProfile,
     ) -> List[int]:
         tactics = [Fp4QuantTactic.TRTLLM]
-        if IS_FLASHINFER_AVAILABLE:
+        # FlashInfer does not support MXFP4 (UE8M0) scaling
+        if IS_FLASHINFER_AVAILABLE and not self.sf_use_ue8m0:
             tactics.append(Fp4QuantTactic.FLASHINFER)
         return tactics
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/custom_ops/torch_custom_ops.py` around lines 2424 - 2432,
The get_valid_tactics() method currently appends Fp4QuantTactic.FLASHINFER
whenever IS_FLASHINFER_AVAILABLE is true, which conflicts with the earlier
assertion forbidding FlashInfer when sf_use_ue8m0=True; update get_valid_tactics
(the method name) to check the instance flag self.sf_use_ue8m0 and only append
Fp4QuantTactic.FLASHINFER if IS_FLASHINFER_AVAILABLE is true AND
self.sf_use_ue8m0 is False, so autotuning won't select FlashInfer when
sf_use_ue8m0 is enabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`:
- Around line 2424-2432: The get_valid_tactics() method currently appends
Fp4QuantTactic.FLASHINFER whenever IS_FLASHINFER_AVAILABLE is true, which
conflicts with the earlier assertion forbidding FlashInfer when
sf_use_ue8m0=True; update get_valid_tactics (the method name) to check the
instance flag self.sf_use_ue8m0 and only append Fp4QuantTactic.FLASHINFER if
IS_FLASHINFER_AVAILABLE is true AND self.sf_use_ue8m0 is False, so autotuning
won't select FlashInfer when sf_use_ue8m0 is enabled.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4a08cbb6-0708-4058-b909-c4a330d929de

📥 Commits

Reviewing files that changed from the base of the PR and between fdcdcb3 and ebecfe5.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52319 [ run ] triggered by Bot. Commit: ebecfe5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52319 [ run ] completed with state SUCCESS. Commit: ebecfe5
/LLM/main/L0_MergeRequest_PR pipeline #41625 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FI is selected but sf_use_ue8m0 is somehow passed, would that cause a problem?
can we exclude FI in get_valid_tactics when self.sf_use_ue8m0 is True, so the tuner never selects or profiles it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants