[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control by luyiyun1021 · Pull Request #15002 · NVIDIA/TensorRT-LLM

luyiyun1021 · 2026-06-05T09:12:17Z

Summary by CodeRabbit

Release Notes

Bug Fixes
- Corrected FP4 quantization dispatch to properly align scale factors between different inference backends
Enhancements
- Added configuration options for scale factor layout format and sizing to provide finer control over quantization behavior

Description

Fixes a latent bug in tunable_fp4_quantize (added in PR #12126) where the Python wrapper's 4th kwarg, named is_sf_swizzled_layout, was misforwarded inside _fp4_quantize_dispatch. The TRTLLM dispatch branch passed it as the 4th positional argument to the 5-arg C++ fp4_quantize op, where the 4th slot is actually sfUseUE8M0 (the MXFP4 toggle) and the 5th is isSfSwizzledLayout. As a result, three things were wrong: (1) the wrapper kwarg name lied about what it controlled; (2) the FlashInfer branch interpreted the same kwarg correctly as do_shuffle (swizzled), so the wrapper had divergent semantics across tactics; (3) callers had no way to actually control isSfSwizzledLayout — the C++ default True was always used.

The bug stayed latent because every existing call site passes positional False (production NVFP4 Linear in tensorrt_llm/_torch/modules/linear.py, plus the two cases in tests/unittest/_torch/thop/parallel/test_fp4_quantize_flashinfer.py), which lands as sfUseUE8M0=False (correct for NVFP4) and lets the C++ default isSfSwizzledLayout=True produce SWIZZLED output that downstream nvfp4_gemm consumes. Trying to flip the swizzled flag by passing True instead crashes with RuntimeError: sfVecSize can only be 32, when sfUseUE8M0 is true (the UE8M0 + sf_vec_size=16 combination is rejected by the C++ op).

The fix renames the 4th wrapper kwarg to sf_use_ue8m0 (matching what the TRTLLM dispatch actually controls), adds a real 5th kwarg is_sf_swizzled_layout: bool = True (matching the C++ default), and threads both through the dispatch helper, Fp4QuantKernelRunner, and the fake registration. The FlashInfer branch now asserts not sf_use_ue8m0 (FlashInfer has no MXFP4 path) and uses the new is_sf_swizzled_layout for do_shuffle. All existing call sites continue to pass positional False, which now binds to sf_use_ue8m0=False while is_sf_swizzled_layout falls back to the new default True — so each existing caller's effective C++ call is byte-identical to pre-fix.

Test Coverage

tests/unittest/_torch/thop/parallel/test_fp4_quantize_flashinfer.py (the wrapper's own op-level test, both test_tunable_fp4_quantize_op and test_tunable_fp4_quantize_with_autotune) — pass.
LTX-2 transformer block tests (which exercise the production NVFP4 Linear path through this wrapper) — pass.

PR Checklist

PR description clearly explains what and why.
PR Follows TRT-LLM CODING GUIDELINES.
Test cases are provided for new code paths.
Any new dependencies have been scanned for license and vulnerabilities.
CODEOWNERS updated if ownership changes.
Documentation updated as needed.
Update tava architecture diagram if significant design change.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…-swizzle control Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>

luyiyun1021 · 2026-06-05T09:12:25Z

/bot run --disable-fail-fast

coderabbitai · 2026-06-05T09:16:46Z

📝 Walkthrough

Walkthrough

Updated FP4 quantization to make dispatch parameters explicit: the internal dispatch helper now accepts sf_use_ue8m0 (MXFP4 UE8M0 scaling) and is_sf_swizzled_layout (swizzled 128x4 scale layout), propagates them through Fp4QuantKernelRunner with caching, exposes them in the public tunable_fp4_quantize API, and updates the torch.compile fake implementation to handle them.

Changes

FP4 dispatch parameter handling

Layer / File(s)	Summary
Dispatch helper parameter expansion `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	`_fp4_quantize_dispatch` accepts `sf_use_ue8m0` and `is_sf_swizzled_layout` parameters, documents their semantics relative to the C++ backend, enforces that FlashInfer tactic cannot use `sf_use_ue8m0=True`, and the TRTLLM dispatch path passes `sf_use_ue8m0` to the underlying op invocation.
KernelRunner parameter caching `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	`Fp4QuantKernelRunner` constructor stores both parameters, includes them in the cache key alongside `scaling_vector_size`, updates `is_sf_swizzled_layout` default to `True`, and forward method propagates both to the dispatch helper.
Public custom-op API signature and wiring `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	`tunable_fp4_quantize` signature expanded to accept `sf_use_ue8m0` (default `False`) and `is_sf_swizzled_layout` (default `True`), documentation and runner construction updated accordingly, and both the fast-path and fallback dispatch calls pass `sf_use_ue8m0` to the helper.
torch.compile fake implementation support `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	Fake implementation signature accepts the new parameters with matching defaults, and shape computation explicitly ignores both flags to ensure output shape remains independent of control parameters.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: renaming a misnamed kwarg and adding real control for SF-swizzle in the tunable_fp4_quantize function.
Description check	✅ Passed	The PR description comprehensively explains the bug, its consequences, the fix, and test coverage. It follows the template structure and includes detailed technical context.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
2424-2432: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

get_valid_tactics should exclude FlashInfer when sf_use_ue8m0=True.

The assertion at line 2372-2374 enforces that FlashInfer cannot be used with sf_use_ue8m0=True, but get_valid_tactics unconditionally includes Fp4QuantTactic.FLASHINFER when FlashInfer is available. During autotuning warmup, tuner.choose_one will call forward() which invokes _fp4_quantize_dispatch for each tactic, causing an assertion failure if sf_use_ue8m0=True.
🐛 Proposed fix to filter FlashInfer based on sf_use_ue8m0
     def get_valid_tactics(
         self,
         inputs: List[torch.Tensor],
         profile: OptimizationProfile,
     ) -> List[int]:
         tactics = [Fp4QuantTactic.TRTLLM]
-        if IS_FLASHINFER_AVAILABLE:
+        # FlashInfer does not support MXFP4 (UE8M0) scaling
+        if IS_FLASHINFER_AVAILABLE and not self.sf_use_ue8m0:
             tactics.append(Fp4QuantTactic.FLASHINFER)
         return tactics
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/custom_ops/torch_custom_ops.py` around lines 2424 - 2432,
The get_valid_tactics() method currently appends Fp4QuantTactic.FLASHINFER
whenever IS_FLASHINFER_AVAILABLE is true, which conflicts with the earlier
assertion forbidding FlashInfer when sf_use_ue8m0=True; update get_valid_tactics
(the method name) to check the instance flag self.sf_use_ue8m0 and only append
Fp4QuantTactic.FLASHINFER if IS_FLASHINFER_AVAILABLE is true AND
self.sf_use_ue8m0 is False, so autotuning won't select FlashInfer when
sf_use_ue8m0 is enabled.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`:
- Around line 2424-2432: The get_valid_tactics() method currently appends
Fp4QuantTactic.FLASHINFER whenever IS_FLASHINFER_AVAILABLE is true, which
conflicts with the earlier assertion forbidding FlashInfer when
sf_use_ue8m0=True; update get_valid_tactics (the method name) to check the
instance flag self.sf_use_ue8m0 and only append Fp4QuantTactic.FLASHINFER if
IS_FLASHINFER_AVAILABLE is true AND self.sf_use_ue8m0 is False, so autotuning
won't select FlashInfer when sf_use_ue8m0 is enabled.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4a08cbb6-0708-4058-b909-c4a330d929de

📥 Commits

Reviewing files that changed from the base of the PR and between fdcdcb3 and ebecfe5.

📒 Files selected for processing (1)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

tensorrt-cicd · 2026-06-05T09:18:34Z

PR_Github #52319 [ run ] triggered by Bot. Commit: ebecfe5 Link to invocation

tensorrt-cicd · 2026-06-05T16:16:42Z

PR_Github #52319 [ run ] completed with state SUCCESS. Commit: ebecfe5
/LLM/main/L0_MergeRequest_PR pipeline #41625 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chang-l · 2026-06-06T00:47:31Z

If FI is selected but sf_use_ue8m0 is somehow passed, would that cause a problem?
can we exclude FI in get_valid_tactics when self.sf_use_ue8m0 is True, so the tuner never selects or profiles it?

[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF…

ebecfe5

…-swizzle control Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>

luyiyun1021 requested a review from a team as a code owner June 5, 2026 09:12

luyiyun1021 requested a review from hyukn June 5, 2026 09:12

github-actions Bot assigned luyiyun1021 Jun 5, 2026

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

chang-l approved these changes Jun 6, 2026

View reviewed changes

chang-l reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control#15002

[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control#15002
luyiyun1021 wants to merge 1 commit into
NVIDIA:mainfrom
luyiyun1021:fix-tunable-fp4-quantize-kwarg

luyiyun1021 commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

luyiyun1021 commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

chang-l Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

luyiyun1021 commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

luyiyun1021 commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

chang-l Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

luyiyun1021 commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading