Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/source/models/visual-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ TensorRT-LLM **VisualGen** provides a unified inference stack for diffusion mode
| `Lightricks/LTX-2` | Text-to-Video (with Audio), Image-to-Video (with Audio) |
| `Qwen/Qwen-Image` | Text-to-Image |
| `Qwen/Qwen-Image-2512` | Text-to-Image |
| `Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers` | Text-to-Image |
| `Tencent-Hunyuan/HunyuanDiT-v1.1-Diffusers` | Text-to-Image |
| `Tencent-Hunyuan/HunyuanDiT-v1.0-Diffusers` | Text-to-Image |

Models are auto-detected from the checkpoint directory. Diffusers-format models are detected via `model_index.json`; LTX-2 monolithic safetensors checkpoints are detected via embedded metadata. The `AutoPipeline` registry selects the appropriate pipeline class automatically.

Expand All @@ -48,10 +51,13 @@ Models are auto-detected from the checkpoint directory. Diffusers-format models
| **Wan 2.2** | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| **LTX-2** | Yes | Yes | No | Yes | Yes | No | No | Yes | Yes | Yes | Yes | No |
| **Qwen-Image** [^2] | Yes | Yes | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes | No |
| **HunyuanDiT** [^3] | No | No | No | No | Yes | No | No | No | Yes | Yes | No | No |

[^1]: FLUX models use embedded guidance and do not have a separate negative prompt path, so CFG parallelism is not applicable.

[^2]: Qwen-Image ships a native BF16 implementation with per-module numerical parity vs `diffusers.QwenImagePipeline` (cosine >= 0.999 on the full 20B transformer) and `trtllm-serve` / `/v1/images/generations` support. FP8 blockwise and NVFP4 use VisualGen dynamic quantization from BF16 checkpoints; no pre-quantized checkpoint is required.
[^2]: Qwen-Image ships a native BF16 implementation with per-module numerical parity vs `diffusers.QwenImagePipeline` (cosine >= 0.999 on the full 20B transformer) and `trtllm-serve` / `/v1/images/generations` support. FP8 blockwise and NVFP4 use VisualGen dynamic quantization from BF16 checkpoints; no pre-quantized checkpoint is required.

[^3]: HunyuanDiT uses bilingual (Chinese/English) text conditioning via a BertModel CLIP encoder and an MT5EncoderModel. Ulysses sequence parallelism is supported: after the patch-embed the latent sequence is sharded across ranks; a custom attention processor injects all-to-all collectives around self-attention while text cross-attention remains standard SDPA (text tokens are replicated). Set `ulysses_size` to the desired number of sequence-parallel ranks (must divide `num_attention_heads=16`). Quantization and ring-attention optimizations are planned for future releases.

## Quick Start

Expand Down
5 changes: 5 additions & 0 deletions examples/visual_gen/serve/configs/hunyuandit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
attention_config:
backend: VANILLA
parallel_config:
cfg_size: 1
ulysses_size: 1
2 changes: 2 additions & 0 deletions tensorrt_llm/_torch/visual_gen/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
from ..pipeline_registry import AutoPipeline, register_pipeline
from .cosmos3 import Cosmos3OmniMoTPipeline
from .flux import Flux2Pipeline, FluxPipeline
from .hunyuandit import HunyuanDiTPipeline
from .ltx2 import LTX2Pipeline # noqa: F401
from .qwen_image import QwenImagePipeline
from .wan import WanImageToVideoPipeline, WanPipeline
Expand All @@ -44,6 +45,7 @@
"BasePipeline",
"FluxPipeline",
"Flux2Pipeline",
"HunyuanDiTPipeline",
"QwenImagePipeline",
"WanPipeline",
"WanImageToVideoPipeline",
Expand Down
12 changes: 12 additions & 0 deletions tensorrt_llm/_torch/visual_gen/models/hunyuandit/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""HunyuanDiT text-to-image pipeline exports."""

from .pipeline_hunyuandit import HunyuanDiTPipeline
from .transformer_hunyuandit import HunyuanDiT2DModelWrapper

__all__ = [
"HunyuanDiTPipeline",
"HunyuanDiT2DModelWrapper",
]
36 changes: 36 additions & 0 deletions tensorrt_llm/_torch/visual_gen/models/hunyuandit/defaults.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""HunyuanDiT default generation parameters and extra-param schema."""

from tensorrt_llm._torch.visual_gen.pipeline import ExtraParamSchema

_HUNYUANDIT_DEFAULT_PARAMS = {
"height": 1024,
"width": 1024,
"num_inference_steps": 50,
"guidance_scale": 7.5,
"max_sequence_length": 77,
}


def get_hunyuandit_default_params() -> dict:
return dict(_HUNYUANDIT_DEFAULT_PARAMS)


def get_hunyuandit_extra_param_specs() -> dict:
return {
"negative_prompt": ExtraParamSchema(
type="str",
default="",
description="Negative text prompt for classifier-free guidance.",
),
"use_resolution_binning": ExtraParamSchema(
type="bool",
default=True,
description=(
"Snap resolution to the nearest HunyuanDiT training bucket "
"(recommended for best quality)."
),
),
}
Loading
Loading