-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Integrate AutoRound into Diffusers #13552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
42d4fdc
e1714a9
c0daf15
f04afa9
677a26e
bc46f4f
fb2e4c2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| <!-- Copyright 2026 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. --> | ||
|
|
||
| # AutoRound | ||
|
|
||
| [AutoRound](https://github.com/intel/auto-round) is an advanced quantization toolkit. It achieves high accuracy at ultra-low bit widths (2-4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers [SignRoundV1](https://arxiv.org/pdf/2309.05516) and [SignRoundV2](https://arxiv.org/abs/2512.04746) for more details. | ||
|
|
||
|
|
||
| Install `auto-round`(version ≥ 0.13.0): | ||
|
|
||
| ```bash | ||
| pip install "auto-round>=0.13.0" | ||
| ``` | ||
|
|
||
| To use the Marlin kernel for faster CUDA inference, install `gptqmodel`: | ||
|
|
||
| ```bash | ||
| pip install "gptqmodel>=5.8.0" | ||
| ``` | ||
|
|
||
| ## Load a quantized model | ||
|
|
||
| Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. The method works with any model that loads via [Accelerate(https://hf.co/docs/accelerate/index) and has `torch.nn.Linear` layers. | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import ZImageTransformer2DModel, ZImagePipeline, AutoRoundConfig | ||
|
|
||
| model_id = "INCModel/Z-Image-W4A16-AutoRound" | ||
|
|
||
| quantization_config = AutoRoundConfig(backend="marlin") | ||
| transformer = ZImageTransformer2DModel.from_pretrained( | ||
| model_id, | ||
| subfolder="transformer", | ||
| quantization_config=quantization_config, | ||
| torch_dtype=torch.bfloat16, | ||
| device_map="cuda", | ||
| ) | ||
|
|
||
| pipe = ZImagePipeline.from_pretrained( | ||
| model_id, | ||
| transformer=transformer, | ||
| torch_dtype=torch.bfloat16, | ||
| device_map="cuda", | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. device->auto |
||
| ) | ||
|
|
||
| image = pipe("a cat holding a sign that says hello").images[0] | ||
| image.save("output.png") | ||
| ``` | ||
|
|
||
| > [!NOTE] | ||
| > AutoRound in Diffusers only supports loading *pre-quantized* models. To quantize a model from scratch, use the [AutoRound CLI or Python API](https://github.com/intel/auto-round) directly, then load the result with Diffusers. | ||
|
|
||
| ## Backends | ||
|
|
||
| AutoRound supports multiple inference backends. The backend controls which kernel handles dequantization during the forward pass. Set the `backend` parameter in [`AutoRoundConfig`] to choose one: | ||
|
|
||
| | Backend | Value | Device | Requirements | Notes | | ||
| |---------|-------|--------|--------------|-------| | ||
| | **Auto** | `"auto"` | Any | — | Default. Automatically selects the best available backend. | | ||
| | **PyTorch** | `"torch"` | CPU / CUDA | — | Pure PyTorch implementation. Broadest compatibility. | | ||
| | **Triton** | `"tritonv2"` | CUDA | `triton` | Triton-based kernel for GPU inference. | | ||
| | **ExllamaV2** | `"exllamav2"` | CUDA | `gptqmodel>=5.8.0` | Good CUDA performance via the ExllamaV2 kernel. | | ||
| | **Marlin** | `"marlin"` | CUDA | `gptqmodel>=5.8.0` | Best CUDA performance via the Marlin kernel. | | ||
|
|
||
|
|
||
| ```python | ||
| from diffusers import AutoRoundConfig | ||
|
|
||
| # Auto-select (default) | ||
| config = AutoRoundConfig() | ||
|
|
||
| # Explicit Triton backend for CUDA | ||
| config = AutoRoundConfig(backend="tritonv2") | ||
|
|
||
| # Marlin backend for best CUDA performance (requires gptqmodel>=5.8.0) | ||
| config = AutoRoundConfig(backend="marlin") | ||
|
|
||
| # Marlin backend for best CUDA performance (requires gptqmodel>=5.8.0) | ||
| config = AutoRoundConfig(backend="exllamav2") | ||
|
|
||
| # PyTorch backend for CPU/CUDA inference | ||
| config = AutoRoundConfig(backend="torch") | ||
| ``` | ||
|
|
||
|
|
||
| ## Quantization configurations | ||
|
|
||
| AutoRound focuses on weight-only quantization. The primary configuration is W4A16 (4-bit weights, 16-bit activations), with flexibility in group size and symmetry: | ||
|
|
||
| | Configuration | `bits` | `group_size` | `sym` | Description | | ||
| |--------------|--------|-------------|-------|-------------| | ||
| | W4G128 asymmetric | `4` | `128` | `False` | Default. Good balance of accuracy and compression. | | ||
| | W4G128 symmetric | `4` | `128` | `True` | Faster dequantization, small accuracy trade-off. | | ||
| | W4G32 asymmetric | `4` | `32` | `False` | Higher accuracy at the cost of more metadata. | | ||
|
|
||
| ## Save and load | ||
|
|
||
| <hfoptions id="save-and-load"> | ||
| <hfoption id="save"> | ||
|
|
||
| ```python | ||
| from auto_round import AutoRound | ||
| autoround = AutoRound( | ||
| tiny_z_image_model_path, | ||
| num_inference_steps=3, | ||
| guidance_scale=7.5, | ||
| dataset="coco2014, | ||
| ) | ||
| autoround.quantize_and_save("Z-Image-W4A16-AutoRound") | ||
| ``` | ||
|
|
||
| </hfoption> | ||
| <hfoption id="load"> | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import ZImageTransformer2DModel, ZImagePipeline | ||
|
|
||
| model_id = "INCModel/Z-Image-W4A16-AutoRound" | ||
|
|
||
| # The inference backend will be automatically selected. | ||
| pipe = ZImagePipeline.from_pretrained( | ||
| model_id, | ||
| torch_dtype=torch.bfloat16, | ||
| device_map="cuda", | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how about just setting the device_map to "auto" |
||
| ) | ||
|
|
||
| image = pipe("a cat holding a sign that says hello").images[0] | ||
| image.save("output.png") | ||
| ``` | ||
|
|
||
| </hfoption> | ||
| </hfoptions> | ||
|
|
||
| ## Resources | ||
|
|
||
| - [Pre-quantized AutoRound models on the Hub](https://huggingface.co/models?search=autoround) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| from .autoround_quantizer import AutoRoundQuantizer |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from typing import TYPE_CHECKING | ||
|
|
||
| from ...utils import ( | ||
| is_auto_round_available, | ||
| is_torch_available, | ||
| logging, | ||
| ) | ||
| from ..base import DiffusersQuantizer | ||
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| from ...models.modeling_utils import ModelMixin | ||
|
|
||
|
|
||
| if is_torch_available(): | ||
| import torch | ||
|
|
||
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class AutoRoundQuantizer(DiffusersQuantizer): | ||
| r""" | ||
| Diffusers Quantizer for AutoRound (https://github.com/intel/auto-round). | ||
|
|
||
| AutoRound is a weight-only quantization method that uses sign gradient descent to jointly optimize | ||
| rounding values and min-max ranges for weights. It supports W4A16 (4-bit weight, 16-bit activation) | ||
| quantization for efficient inference. | ||
|
|
||
| This quantizer only supports loading pre-quantized AutoRound models. On-the-fly quantization | ||
| (calibration) is not supported through this interface. | ||
| """ | ||
|
|
||
| # AutoRound requires data calibration — we only support loading pre-quantized checkpoints. | ||
| requires_calibration = True | ||
| required_packages = ["auto_round"] | ||
|
|
||
| def __init__(self, quantization_config, **kwargs): | ||
| super().__init__(quantization_config, **kwargs) | ||
|
|
||
| def validate_environment(self, *args, **kwargs): | ||
| """ | ||
| Validates that the auto-round library (>= 0.5) is installed and captures the device_map | ||
| for later use during model conversion. | ||
| """ | ||
| self.device_map = kwargs.get("device_map", None) | ||
| if not is_auto_round_available(): | ||
| raise ImportError( | ||
| "Loading an AutoRound quantized model requires the auto-round library " | ||
| "(`pip install 'auto-round>=0.5'`)" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 0.10 ? |
||
| ) | ||
|
|
||
| def _process_model_before_weight_loading( | ||
| self, | ||
| model: "ModelMixin", | ||
| device_map, | ||
| keep_in_fp32_modules: list[str] = [], | ||
| **kwargs, | ||
| ): | ||
| """ | ||
| Replaces target nn.Linear layers with AutoRound's quantized QuantLinear layers before | ||
| weights are loaded from the checkpoint. | ||
|
|
||
| Uses `auto_round.inference.convert_model.convert_hf_model` which: | ||
| - Inspects the model architecture and the quantization config (bits, group_size, sym, backend). | ||
| - Replaces eligible nn.Linear modules with the appropriate QuantLinear variant | ||
| (the packed-weight layer that stores qweight, scales, qzeros). | ||
| - Returns the converted model and a set of used backend names. | ||
|
|
||
| `infer_target_device` resolves the device_map into a single target device string | ||
| that AutoRound uses to select the correct kernel backend (e.g. "cuda", "cpu"). | ||
| """ | ||
| from auto_round.inference.convert_model import convert_hf_model, infer_target_device | ||
|
|
||
| if self.pre_quantized: | ||
| target_device = infer_target_device(self.device_map) | ||
| model, used_backends = convert_hf_model(model, target_device) | ||
| self.used_backends = used_backends | ||
|
|
||
| def _process_model_after_weight_loading(self, model, **kwargs): | ||
| """ | ||
| Finalizes the model after all quantized weights (qweight, scales, qzeros, etc.) have | ||
| been loaded into the QuantLinear layers. | ||
|
|
||
| Uses `auto_round.inference.convert_model.post_init` which: | ||
| - Performs backend-specific finalization (e.g. repacking weights into the kernel's | ||
| expected memory layout, moving buffers to the correct device). | ||
| - Freezes quantized parameters (requires_grad=False). | ||
| - Prepares the model for inference. | ||
|
|
||
| Raises ValueError if the model is not pre-quantized, since AutoRound does not support | ||
| on-the-fly quantization through this loading path. | ||
| """ | ||
| if self.pre_quantized: | ||
| from auto_round.inference.convert_model import post_init | ||
|
|
||
| post_init(model, self.used_backends) | ||
| else: | ||
| raise ValueError( | ||
| "AutoRound quantizer in diffusers only supports pre-quantized models. " | ||
| "Please provide a model that has already been quantized with AutoRound." | ||
| ) | ||
| return model | ||
|
|
||
| @property | ||
| def is_trainable(self) -> bool: | ||
| """AutoRound W4A16 pre-quantized models do not support training.""" | ||
| return False | ||
|
|
||
| @property | ||
| def is_serializable(self): | ||
| """AutoRound quantized models can be serialized (the quantization config may be | ||
| updated by the backend, e.g. for GPTQ/AWQ-compatible formats).""" | ||
| return True | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case there is a better backend in the future, we'd better not to explicitly code like this. Besides, If users have install gptqmodel, we will use marlin. Otherwise, we will remind user to install it.