Skip to content

Releases: amd/Quark

AMD Quark Release 0.12

Choose a tag to compare

@bill-teng bill-teng released this 03 Jul 23:29

AMD Quark for PyTorch

AMD Quark 0.12 is tested against PyTorch 2.10 and 2.11, and compatible with upstream transformers==4.57.6 and transformers==5.2.

New Features

  • Support for Python 3.11 up to 3.13
  • Support NVFP4 quantization (scheme: nvfp4).
  • Support FP4 quantization with E5M3 per-block scales, called AMDFP4 quantization (amdfp4, amdfp4_g32).
  • Support FP4 quantization with E5M3 per-block scales and a global FP32 scale (schemes: amdfp4_global16, amdfp4_global32).
  • Support native inference for xDiT and Diffusers workflows
  • Pre-quantized layers excluded from quantization (FP8Linear, compressed-tensors, HF-dequantized MXFP4) are now preserved in their original format on export instead of being dequantized to bf16/fp16.
  • Support for compressed-tensors==0.15 in PyTorch export/import and file-to-file quantization flows.

Model Support

Supported out-of-box model architectures:

  • DeepSeek-V4-Pro, DeepSeek-V4-Flash
  • GLM-5, GLM-5.1, GLM-5.2
  • Kimi-K2.5, Kimi-K2.6 (Reference)
  • MiniMax-M2.5, MiniMax-M2.7, MiniMax-M3
  • Qwen3.5-397B-A17B, Qwen3.5-35B-A3B

Bug fixes and minor improvements

  • Fixed MXFP4 dequantization kernel failures for large tensor shapes.
  • Fixed E5M3 Triton kernel dispatch on correct device in multi-device setting.
  • Fixed LLMTemplate validation to raise a clear error when a required algorithm configuration is missing.
  • Fixed a bug where MOE calibration diagnostics was not warning when static activation quantizers were not receiving calibration tokens.
  • Fixed AWQ scaling for Qwen3.5-style RMSNorm.

Diffusion model quantization and Hugging Face Diffusers integration

  • AMD Quark now plugs directly into Hugging Face diffusers. Importing quark.integrations.diffusers self-registers Quark into the diffusers AUTO_QUANTIZER_MAPPING / AUTO_QUANTIZATION_CONFIG_MAPPING, so quantized diffusion models can be saved and reloaded through the standard save_pretrained / from_pretrained APIs:

    from quark.integrations import diffusers  # registers the "quark" quantizer
    from diffusers import DiffusionPipeline
    
    pipe = DiffusionPipeline.from_pretrained("<org>/<sdxl-or-flux-quark-checkpoint>")
  • Export: DiffusersSafetensorsExporter writes a quantized pipeline submodule via save_pretrained, embedding the serialized Quark QConfig under quantization_config in config.json. Reload reconstructs the quantized layers automatically (meta-device / low_cpu_mem_usage loading supported) and freezes them for inference.

  • On-the-fly quantization: a pipeline submodule (pipe.unet / pipe.transformer) can be quantized in-process, without a separate export/reload round-trip.

  • Calibration utilities promoted into the library: quark.torch.utils.diffusers.get_calib_dataloader(pipe, target_module, prompts, n_steps=...) runs the pipeline, captures the submodule's intermediate inputs, and returns a dataloader ready for ModelQuantizer.quantize_model — no more copying calibration code out of the examples.

  • Works with round-to-nearest, SmoothQuant, and SVDQuant. A PR to add Quark to diffusers upstream is planned; the self-registration path works today.

SVDQuant (SVD-based low-bit error correction)

  • Added SVDQuant (quark.torch.algorithm.svdquant, configured via SVDQuantConfig), from SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models. It pairs SmoothQuant-style smoothing with a high-precision low-rank correction branch, making INT4 / MXFP4 / NVFP4 weight quantization (optionally with 4-bit activations) viable. The algorithm applies to both diffusion models and LLMs.
  • Ready-made schemes via build_quant_layer_config: w4a16, w4a4, mxfp4, and nvfp4. Optional GPTQ residual quantization (use_gptq=True) and per-layer alpha search (search_alpha=True).
  • A calibration / grid-search helper, examples/torch/diffusers/svdquant_calibrate.py, sweeps the smoothing alpha, GPTQ on/off, and the number of calibration samples, scoring each configuration by reference-image quality (PSNR / MSE, plus lpips when installed) or quantized-submodule MSE, and reports the best.
  • On FLUX.1-dev, SVDQuant in W4A4 / MXFP4 / NVFP4 nearly matches the FP16 CLIP score.
  • Native inference: SVDQuant-MXFP4 models can run with real low-bit aiter GEMM kernels (ROCm) via quark.torch.enable_native_inference — the MXFP4 residual GEMM plus the low-rank correction branch — replacing the emulation/QDQ path for faster, lower-memory inference.

Agent Skills

Added a Claude Code skill suite for the PyTorch flow, auto-discovered from .claude/skills/ and routed by file type (HuggingFace / safetensors / PyTorch checkpoints → Torch skills, never silently mixed with the ONNX flow):

  • quark-torch-ptq — end-to-end PTQ pipeline for HF / safetensors models (FP8, INT4, MXFP4, etc.), stopping at the quantized output.
  • quark-torch-llm-ptq-eval — PTQ plus validation and perplexity evaluation in one flow.
  • quark-torch-file2file-quantization — file-to-file quantization for ultra-large models.
  • quark-torch-model-intake — inspects a model and assesses quantization support.
  • quark-torch-export — exports quantized models (e.g. GGUF, ONNX).
  • quark-torch-install / quark-torch-debug — set up the matching PyTorch build and diagnose failed PTQ runs.

The backend-neutral quark-env-preflight and quark-install skills apply to both the Torch and ONNX flows.

vLLM Online Quantization

Added quark.online_quantization.vllm (tested against vLLM 0.21), bringing Quark's powerful online quantization flow into vLLM at load time. It extends vLLM's built-in online quantization and is designed to map Quark online quantization configs rather than being limited to a fixed list of schemes. The current release supports three schemas: per-channel FP8 (ptpc_fp8), MXFP4 (mxfp4), and a mixed linear-FP8/MoE-MXFP4 scheme (linear_ptpc_fp8_moe_mxfp4). It also supports re-quantizing offline-quantized checkpoints (e.g., DeepSeek-R1 FP8 block-scale) to a different online scheme at load time. Online versions of more Quark quantization algorithms are planned for future updates. See the runnable example.

AMD Quark Infrastructure

New Features

  • Support for Python 3.11 up to 3.13

  • Bumped the minimum required numpy to >= 2.0 across the ONNX and PyTorch flows.

  • Pre-built wheels are now published for PyTorch 2.10+ on the AMD package index (CPU, CUDA 12.8, ROCm 7.1, ROCm 7.2; Linux/Windows; Python 3.11–3.13). They ship pre-compiled C++ extensions, so no C++ compiler is needed and the first import quark no longer triggers a one-time kernel/custom-op build. To fetch a pre-built wheel, point pip at the matching index:

    pip install amd-quark --extra-index-url https://pypi.amd.com/quark/cpu/simple    # CPU
    pip install amd-quark --extra-index-url https://pypi.amd.com/quark/cu128/simple  # CUDA 12.8
    pip install amd-quark --extra-index-url https://pypi.amd.com/quark/rocm71/simple # ROCm 7.1 (Linux only)
    pip install amd-quark --extra-index-url https://pypi.amd.com/quark/rocm72/simple # ROCm 7.2 (Linux only)

Deprecations and breaking changes

  • The quark.testing module has been deprecated and removed. All testing utilities have been consolidated into quark.common.utils.testing_utils. Update your imports as follows:
    • from quark.testing import skip_if_no_gpu, slow_test, slow_test_iffrom quark.common.utils.testing_utils import skip_if_no_gpu, slow_test, slow_test_if
    • from quark.testing.common_utils import TestCasefrom quark.common.utils.testing_utils import TestCase

Quark Shapeshifter (formerly Quark ONNX Adapter)

New Features

  • Added support for Quark ONNX post-processing workflows, including Q/DQ cleanup, scale alignment, bfloat16 adaptation, and XINT8/NPU simulation.
  • Added support for Quark Torch workflows through PyTorch model transformation passes, including dropout removal and model tracing.
  • Added automatic pass discovery and registration for official passes in quark/shapeshifter/passes/ and optional community passes in quark/contrib/shapeshifter_community_passes/. Shapeshifter validates pass types so that each workflow contains either ONNX passes or PyTorch passes, but not both.

Deprecations and breaking changes

  • The quark-cli onnx-adapter command is deprecated and will be removed in a future release. Please use quark-cli shapeshifter instead. Both commands are functionally identical during this release to provide backward compatibility.

AMD Quark for ONNX

New Features

  • Added additional quantize/dequantize node pairs at the mixed‑precision tensors to simulate node-wise quantization under QDQ mode.
  • Added support for excluding specific nodes' outputs from quantization via setting a new extra option NodesToExcludeOutputQuantization.
  • Refined the block axis of BFP and MX to make sure it's always on the reduction dimension in matrix multiplication operations.

Enhancements

Enhancements for calibration:

  • Added Selective Calibration Propagation (SCP) via the CalibPassthroughOpTypes extra option. Distribution-preserving operators (e.g., Reshape, Transpose, Gather) are skipped during calibration, reducing calibration time and memory.
  • Added CalibOptimizeDisk option for LayerwisePercentile calibration. When True (default), activation tensors are processed on-the-fly and never written to disk or held in memory, eliminating disk usage at the cost of slightly increased runtime.
  • Reduced the peak memory of MinMax and NonOverflow calibration methods by disabling CPU memory ...
Read more

AMD Quark Release 0.11.1

Choose a tag to compare

@thiagocrepaldi thiagocrepaldi released this 19 Feb 16:55

AMD Quark for PyTorch

Model Support

Supported out-of-box model architectures:

  • Kimi-K2-Thinking, Kimi-K2-Instruct, Kimi-K2.5
  • Qwen3 MoE, Qwen3 Coder, Qwen3 Coder-Next
  • DeepSeek-V3.2, DeepSeek-OCR
  • GLM-4.7
  • Minimax-M2.1

New Features

  • Added File-to-File quantization for ultra-large models. This mode supports weight-only quantization and dynamic activation quantization + weight quantization, exports hf_format only, and can also accept pre-quantized inputs (deepseek-style FP8, compressed-tensors) and re-quantize them to a different format.

    For example, the command below runs file-to-file quantization to MXFP4:

    python3 quantize_quark.py --model_dir [model checkpoint folder] \
                              --output_dir [output folder] \
                              --quant_scheme mxfp4 \
                              --file2file_quantization \
                              --skip_evaluation
  • Added a pre-quantization compatibility check for transformers in LLM PTQ workflows, and enabled dry-run compatibility checking by default with clearer error messages when model loading fails.

Bug fixes and minor improvements

  • Fixed weight calibration coverage to ensure complete calibration even for weights outside the forward path, and added token distribution coverage warnings during calibration.

AMD Quark for ONNX

New Features

  • Support using a YAML file as input to perform custom preprocessing for float models before quantization.

Enhancements

  • Memory optimization has been extended to all calibration methods, particularly further reducing memory usage during activation data collection.

Bug fixes and minor improvements

  • Infer kernel size from weights if the attribute kernel_shape of Conv nodes are not presented explicitly during fast finetuning.
  • Fix scale extraction to handle int32_data in ONNX initializers.
  • Add optional config parameter to onnxslim optimization.
  • Update requirements.txt for onnxslim from 0.1.77 to 0.1.84.
  • Handle NaN and Inf values in model inference output.

Release 0.11

Choose a tag to compare

@thiagocrepaldi thiagocrepaldi released this 19 Feb 16:42

AMD Quark for PyTorch

AMD Quark 0.11 is tested against PyTorch 2.9, and compatible with upstream transformers==4.57.

Fused "rotation" and "quarot" algorithms in a single interface

The pre-quantization algorithms "rotation" and "quarot" are fused together into a single rotation algorithm. It can be configured using RotationConfig. By default, only R1 rotation is applied, corresponding to the previous quant_algo="rotation" behavior.

Quark Torch Quantization Config Refactor

  • The quantization configuration classes have been renamed for better clarity and consistency:

    • QuantizationSpec is deprecated in favor of QTensorConfig.
    • QuantizationConfig is deprecated in favor of QLayerConfig.
    • Config is deprecated in favor of QConfig.
  • The deprecated class names (QuantizationSpec, QuantizationConfig, Config) are still available as aliases for backward compatibility, but will be removed in a future release.

  • Before Refactor:

    from quark.torch.quantization.config.config import Config, QuantizationConfig, QuantizationSpec
    
    quant_spec = QuantizationSpec(dtype=Dtype.int8, ...)
    quant_config = QuantizationConfig(weight=quant_spec, ...)
    config = Config(global_quant_config=quant_config, ...)
  • After Refactor:

    from quark.torch.quantization.config.config import QConfig, QLayerConfig, QTensorConfig
    
    quant_spec = QTensorConfig(dtype=Dtype.int8, ...)
    quant_config = QLayerConfig(weight=quant_spec, ...)
    config = QConfig(global_quant_config=quant_config, ...)

quark torch-llm-ptq CLI Refactor and Simplification

The CLI has been significantly refactored to use the new LLMTemplate interface and remove redundant features:

  • Removed model-specific algorithm configuration files (e.g., awq_config.json, gptq_config.json, smooth_config.json). Algorithm configurations are now automatically handled by LLMTemplate.
  • Removed unnecessary CLI arguments, retaining only a dozen or so essential arguments.
  • Simplified export: The CLI now only exports to Hugging Face safetensors format.
  • Simplified evaluation: Evaluation now uses perplexity (PPL) on wikitext-2 dataset instead of the previous multi-task evaluation framework.

Code Organization and Examples Refactor

Moved common utilities to quark.torch.utils:

  • model_preparation.py and data_preparation.py are now available in quark.torch.utils for easier reuse across examples and applications.
  • module_replacement utilities are now located in quark.torch.utils.module_replacement.

Moved LLM evaluation code to quark.contrib:

  • The llm_eval module has been moved to quark.contrib.llm_eval and examples/contrib/llm_eval.
  • Perplexity evaluation (ppl_eval) is now shared between CLI and examples via quark.contrib.llm_eval.

Reorganized example scripts:

  • Removed model-specific algorithm configuration files (e.g., awq_config.json, gptq_config.json, smooth_config.json). Algorithm configurations are now automatically handled by LLMTemplate.

Extended quantize_quark.py example script and quark torch-llm-ptq CLI with new features:

  • Support for custom model templates and quantization schemes registration (example script only).
  • Support for per-layer quantization scheme configuration via --layer_quant_scheme argument.
  • Support for custom algorithm configurations via --quant_algo_config_file argument (example script only).
  • Simplified quantization scheme naming, directly use the built-in scheme names (see breaking changes below).

Setting log level with QUARK_LOG_LEVEL

Logging level can now be set with the environment variable QUARK_LOG_LEVEL, e.g. QUARK_LOG_LEVEL=debug or QUARK_LOG_LEVEL=warning or QUARK_LOG_LEVEL=error or QUARK_LOG_LEVEL=critical.

Support for online rotations (online hadamard transform)

The rotation algorithm supports online rotations, such that:

$$y = xRR^TW$$

where $x$ is the input activation, $W$ the weight, and $R$ an orthogonal matrix (e.g. hadamard transform). With the quantization operator $\mathcal{Q}$ added, this becomes $\mathcal{Q}(xR) \times \mathcal{Q}(WR)^T$. The activation quantization $\mathcal{Q}(xR)$ is done online, that is the rotation is applied during inference and is not fused in a preceding layer.

Online rotations can be enabled using online_r1_rotation=True in RotationConfig. Please refer to its documentation and to the user guide for more details.

Support for rotation / SmoothQuant scales fine-tuning (SpinQuant/OSTQuant)

We support fine-tuning joint rotations and smoothing scales as a non-destructive transformation $O = DR$, where $R$ is an orthogonal matrix and $D$ is a diagonal matrix (SmoothQuant scales), such that:

$$y = xOO^{-1}W$$ $$= xDRR^TD^{-1}W^T$$ $$= xDR \times (WD^{-1}R)^T$$ $$= ... x'R \times (WD^{-1}R)^T$$

The support is well tested for llama, qwen3, qwen3_moe and gpt_oss architectures.

Rotation fine-tuning and online rotations are compatible with other algorithms as GPTQ or Qronos.

Please refer to the documentation of RotationConfig, the example and the user guide for more details.

Minor changes and bug fixes

  • Fix memory duplication and OOM issues when loading gpt_oss models for quantization.
  • ModelQuantizer.freeze behavior is changed to permanently quantize weights. Weights are still in high precision, but QDQ (quantize + dequantize) is run on them. This allows to avoid to rerun QDQ on static weights at each subsequent call.
  • scaled_fake_quantize operator, which is used for QDQ, is now by default compiled with torch.compile, allowing significant speedups depending on the quantization scheme (1x - 8x).
  • An efficient MXFP4 dynamic quantization kernel is used for activations when quantizing models, fusing scale computation and QDQ operations.
  • Batching support is fixed in lm-evaluation-harness integration in the examples, correctly passing the user-provided --eval_batch_size.
  • CPU/GPU communication is removed in quantization observers, allowing for faster quantization and runtime during e.g. the evaluation of models.

Deprecations and breaking changes

  • Quantization scheme names in examples/torch/language_modeling/llm_ptq/quantize_quark.py and quark torch-llm-ptq CLI have been simplified and renamed:

    • w_int4_per_group_sym is deprecated in favor of int4_wo_32, int4_wo_64, int4_wo_128 (depending on group size).
    • w_uint4_per_group_asym is deprecated in favor of uint4_wo_32, uint4_wo_64, uint4_wo_128 (depending on group size).
    • w_int8_a_int8_per_tensor_sym is deprecated in favor of int8.
    • w_fp8_a_fp8 is deprecated in favor of fp8.
    • w_mxfp4_a_mxfp4 is deprecated in favor of mxfp4.
    • w_mxfp4_a_fp8 is deprecated in favor of mxfp4_fp8.
    • w_mxfp6_e3m2_a_mxfp6_e3m2 is deprecated in favor of mxfp6_e3m2.
    • w_mxfp6_e2m3_a_mxfp6_e2m3 is deprecated in favor of mxfp6_e2m3.
    • w_bfp16_a_bfp16 is deprecated in favor of bfp16.
    • w_mx6_a_mx6 is deprecated in favor of mx6.
  • The --group_size and --group_size_per_layer arguments in examples/torch/language_modeling/llm_ptq/quantize_quark.py and quark torch-llm-ptq CLI have been removed. Group size is now embedded in the scheme name (e.g., int4_wo_32, int4_wo_64, int4_wo_128).

  • The --layer_quant_scheme argument format in examples/torch/language_modeling/llm_ptq/quantize_quark.py and quark torch-llm-ptq CLI has changed to repeated arguments with pattern and scheme pairs (e.g., --layer_quant_scheme lm_head int8 --layer_quant_scheme '*down_proj' fp8).

  • The token counter used count the number of tokens seen by each expert during calibration is now disabled by default, and requires the environment variable QUARK_COUNT_OBSERVED_SAMPLES=1.

  • The export format "quark_format" is removed, following deprecation in AMD Quark 0.10. Additionally, quark.torch.export.api.ModelExporter and quark.torch.export.api.ModelImporter are removed, please refer to the 0.10 release notes and to the documentation for the current API.

AMD Quark for ONNX

New Features

  • Auto Search Pro

    • Hierarchical Search: Support for conditional and nested hyperparameter trees for advanced search strategies.
    • Custom Objectives: Support custom evaluation logic that perfectly aligns with specific needs.
    • Sampler Flexibility: Various samplers ('TPE', 'Grid Search', etc) are available .
    • Parallel search: Take advantage of parallelization to run multiple searches simultaneously, reducing time to solution.
    • Checkpoint: Resume interrupted hyperparameter optimization from the last checkpoint.
    • Visualization: View real-time visualizations that show your optimization performance and feature importance, making it easier to interpret results.
    • Output Saving: Automatically save the best configuration, study database, and generated plots for your analysis.
  • Latency and memory usage profiling

    • Latency Profiling: Each quantization stage performs specific operations that contribute to the overall quantization pipeline, and their individual latency are reported in the profiling results.

    • Memory profiling

      • CPU Memory Profiling: By wrapping the Python script with mprof, we can record detailed memory traces during execution.
      • ROCM GPU Memory Profiling: For workflows involving ROCMExecutionProvider or any GPU-based quantization step, Quark ONNX offers a lightwe...
Read more

Release 0.10

Choose a tag to compare

@thiagocrepaldi thiagocrepaldi released this 26 Sep 22:24

Release Notes

Release 0.10

  • AMD Quark for PyTorch

    • New Features

      • Support PyTorch 2.7.1 and 2.8.0.
      • Support for int3 quantization and exporting of models.
      • Support the AWQ algorithm with Gemma3 and Phi4.
      • Support Qronos advanced quantization algorithm.
      • Applying the GPTQ algorithm runs x3-x4 faster compared to AMD Quark 0.9, using CUDA/HIP Graph by default. If requirement, CUDA Graph for GPTQ can be disabled using the environment variable QUARK_GRAPH_DEBUG=0.
      • Quarot algorithm supports a new configuration parameter rotation_size to define custom hadamard rotation sizes. Please refer to QuaRotConfig documentation.
      • Support the Qronos post-training quantization algorithm. Please refer to the arXiv paper and Quark documentation.
    • QuantizationSpec check:

      • Every time user finishes init QuantizationSpec will automatically perform config check. If any invalid config is supplied, a warning or error message will be given to user for better correction. In this way, find potential error as early as possible rather than cause a runtime error during quantization process.
    • LLM Depth-Wise Pruning tool:

      • Depth-wise pruning tool that can decrease the LLM model size. This tool deletes the consecutive decode layers in LLM under a certain supplied pruning ratio.
      • Based on PPL influence, the consecutive layers that have less influence on PPL will be regarded as having less influence on LLM and can be deleted.
    • Model Support:

      • Support OCP MXFP4, MXFP6, MXFP8 quantization of new models: DeepSeek-R1, Llama4-Scout, Llama4-Maverick, gpt-oss-20b, gpt-oss-120b.
    • Deprecations and breaking changes

      • OCP MXFP6 weight packing layout is modified to fit the expected layout by CDNA4 mfma_scale instruction.

      • In the examples/language_modeling/llm_ptq/quantize_quark.py example, the quantization scheme w_mxfp4_a_mxfp6 is removed and replaced by w_mxfp4_a_mxfp6_e2m3 and w_mxfp4_a_mxfp6_e3m2.

    • Important bug fixes

      • A bug in Quarot and Rotation algorithms where fused rotations were wrongly applied twice on input embeddings / LM head weights is fixed.

      • Reduce the slowness of the reloading of large quantized models as DeepSeek-R1 using Transformers + Quark.

  • AMD Quark for ONNX

    • New Features:

      • API Refactor (Introduced the new API design with improved consistency and usability)

        • Supported class-based algorithm usage.
        • Aligned data type both for Quark Torch and Quark ONNX.
        • Refactored quantization configs.
      • Auto Search Enhancements

        • Two-Stage Search: First identifies the best calibration config, then searches for the optimal FastFinetune config based on it. Expands the search space for higher efficiency.
        • Advanced-Fastft Search: Supports continuous search spaces, advanced algorithms (e.g., TPE), and parallel execution for faster, smarter searching.
        • Joint-Parameter Search: Combines coupled parameters into a unified space to avoid ineffective configurations and improve search quality.
      • Added support for ONNX 1.19 and ONNXRuntime 1.22.1

      • Added optimized weight-scale calculation with the MinMSE method to improve quantization accuracy.

      • Accelerated calibration with multi-process support, covering algorithms such as MinMSE, Percentile, Entropy, Distribution, and LayerwisePercentile.

      • Added progress bars for Percentile, Entropy, Distribution, and LayerwisePercentile algorithms.

      • Supported users to specify a directory for saving cache files.

    • Enhancements:

      • Significantly reduced memory usage across various configurations, including calibration and FastFinetune stages, with optimizations for both CPU and GPU memory.
      • Improved clarity of error and warning outputs, helping users select better parameters based on memory and disk conditions.
    • Bug fixes and minor improvements:

      • Provided actionable hints when OOM or insufficient disk space issues occur in calibration and fast fine-tuning.
      • Fixed multi-GPU issues during FastFinetune.
      • Fixed a bug related to converting BatchNorm to Conv.
      • Fixed a bug in BF16 conversion on models larger than 2GB.
  • Quark Torch API Refactor

    • LLMTemplate for simplified quantization configuration:

      • Introduced :py:class:.LLMTemplate class for convenient LLM quantization configuration
      • Built-in templates for popular LLM architectures (Llama4, Qwen, Mistral, Phi, DeepSeek, GPT-OSS, etc.)
      • Support for multiple quantization schemes: int4/uint4 (group sizes 32, 64, 128), int8, fp8, mxfp4, mxfp6e2m3, mxfp6e3m2, bfp16, mx6
      • Advanced features: layer-wise quantization, KV cache quantization, attention quantization
      • Algorithm support: AWQ, GPTQ, SmoothQuant, AutoSmoothQuant, Rotation
      • Custom template and scheme registration capabilities for users to define their own template and quantization schemes
            from quark.torch import LLMTemplate

            # List available templates
            templates = LLMTemplate.list_available()
            print(templates)  # ['llama', 'opt', 'qwen', 'mistral', ...]

            # Get a specific template
            llama_template = LLMTemplate.get("llama")

            # Create a basic configuration
            config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")
  • Export and import APIs are deprecated in favor of new ones:

    • ModelExporter.export_safetensors_model is deprecated in favor of export_safetensors:

      Before:

            from quark.torch import ModelExporter
            from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig

            export_config = ExporterConfig(json_export_config=JsonExporterConfig())
            exporter = ModelExporter(config=export_config, export_dir=export_dir)
            exporter.export_safetensors_model(model, quant_config)
     After:
            from quark.torch import export_safetensors
            export_safetensors(model, output_dir=export_dir)
  -  `ModelImporter.import_model_info` is deprecated in favor of `import_model_from_safetensors`:

     Before:
            from quark.torch.export.api import ModelImporter

            model_importer = ModelImporter(
               model_info_dir=export_dir,
               saved_format="safetensors"
            )
            quantized_model = model_importer.import_model_info(original_model)
     After:
            from quark.torch import import_model_from_safetensors
            quantized_model = import_model_from_safetensors(
               original_model,
               model_dir=export_dir
            )
  • Quark ONNX API Refactor

    • Before:

      • Basic Usage:
           from quark.onnx import ModelQuantizer
           from quark.onnx.quantization.config.config import Config
           from quark.onnx.quantization.config.custom_config import get_default_config

           input_model_path = "demo.onnx"
           quantized_model_path = "demo_quantized.onnx"
           calib_data_path = "calib_data"
           calib_data_reader = ImageDataReader(calib_data_path)

           a8w8_config = get_default_config("A8W8")
           quantization_config = Config(global_quant_config=a8w8_config )
           quantizer = ModelQuantizer(quantization_config)
           quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)
  -  Advanced Usage:
	   from quark.onnx import ModelQuantizer
	   from quark.onnx.quantization.config.config import Config, QuantizationConfig
	   from onnxruntime.quantization.calibrate import CalibrationMethod
	   from onnxruntime.quantization.quant_utils import QuantFormat, QuantType, ExtendedQuantType

	   input_model_path = "demo.onnx"
	   quantized_model_path = "demo_quantized.onnx"
	   calib_data_path = "calib_data"
	   calib_data_reader = ImageDataReader(calib_data_path)

	   DEFAULT_ADAROUND_PARAMS = {
	       "DataSize": 1000,
	       "FixedSeed": 1705472343,
	       "BatchSize": 2,
	       "NumIterations": 1000,
	       "LearningRate": 0.1,
	       "OptimAlgorithm": "adaround",
	       "OptimDevice": "cpu",
	       "InferDevice": "cpu",
	       "EarlyStop": True,
	   }

	   quant_config = QuantizationConfig(
	       calibrate_method=CalibrationMethod.Percentile,
	       quant_format=QuantFormat.QDQ,
	       activation_type=QuantType.QInt8,
	       weight_type=QuantType.QInt8,
	       nodes_to_exclude=["/layer.2/Conv_1", "^/Conv/.*"],
	       subgraphs_to_exclude=[(["start_node_1", "start_node_2"], ["end_node_1", "end_node_2"])],
	       include_cle=True,
	       include_fast...
Read more