Integrate AutoRound into Diffusers by xin3he · Pull Request #13552 · huggingface/diffusers

xin3he · 2026-04-23T13:37:55Z

What does this PR do?

This pull request introduces support for the AutoRound quantization algorithm in the Diffusers library. AutoRound is a weight-only quantization method that enables efficient inference by optimizing weight rounding and min-max ranges, primarily targeting the W4A16 configuration (4-bit weights, 16-bit activations). The changes add a new quantization config, quantizer class, backend integration, and comprehensive documentation, while ensuring proper handling of optional dependencies.

Key changes:

AutoRound quantization support

Added a new AutoRoundConfig class to quantization_config.py for configuring AutoRound quantization parameters, including bits, group size, symmetry, backend, and modules to exclude from quantization.
Introduced the AutoRoundQuantizer class in quantizers/autoround/autoround_quantizer.py, implementing the logic for loading pre-quantized AutoRound models and integrating with the auto-round library.
Registered AutoRoundConfig and AutoRoundQuantizer in the quantization auto-mapping logic, enabling selection via the "auto-round" key

Dependency management and import handling

Added is_auto_round_available utility and integrated it into the main import structure and conditional imports, ensuring that AutoRound features are only available if the dependency is installed. Dummy objects are provided otherwise.
Implemented a test utility require_auto_round_version_greater_or_equal for version-gated testing of AutoRound features.

Documentation

Added a comprehensive user guide at docs/source/en/quantization/autoround.md, including usage examples, backend options, configuration details, and resource links.

These changes collectively enable seamless integration of AutoRound quantization into Diffusers, with robust configuration, backend selection, and user guidance.

Existed Model

cc @wenhuach21 @thuang6 @hshen14

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@yiyixuxu @asomoza @stevhliu @sayakpaul

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-23T13:44:35Z

[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.
[2025/05] AutoRound has been integrated into Transformers: Blog.

I would like to integrate AutoRound into Diffusers to support diffusion models. Although the performance improvement is not significant, memory usage is notably reduced.

sayakpaul · 2026-04-23T16:01:09Z

Thanks for this PR! Could you provide some example code using AutoRound and also some example outputs? Feel free to also report latency and memory consumption so that there's some signal into its effectiveness.

Cc: @SunMarc

stevhliu

thanks for the integration!

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

thuang6

LGTM

…ecified backend. Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-24T05:38:39Z

Thank you all for your valuable feedback; I have updated the document and fixed the errors.

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-24T06:13:06Z

Thanks for this PR! Could you provide some example code using AutoRound and also some example outputs? Feel free to also report latency and memory consumption so that there's some signal into its effectiveness.

Cc: @SunMarc

Apologies; due to policy restrictions, I am unable to provide specific latency data. Based on local testing, the time consumed by the W4A16 Marlin backend is comparable to that of BF16.

For the memory consumption, I observe that the original z-image takes about 25GB and the quantized one takes about 16GB

You are welcome to use the code provided in the model card to attempt to reproduce these results. the generated picture looks good.

wenhuach21 · 2026-04-24T08:28:54Z

+
+model_id = "INCModel/Z-Image-W4A16-AutoRound"
+
+quantization_config = AutoRoundConfig(backend="marlin")


in case there is a better backend in the future, we'd better not to explicitly code like this. Besides, If users have install gptqmodel, we will use marlin. Otherwise, we will remind user to install it.

wenhuach21 · 2026-04-24T08:30:13Z

+pipe = ZImagePipeline.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda",


how about just setting the device_map to "auto"

wenhuach21 · 2026-04-24T08:30:34Z

+    model_id,
+    transformer=transformer,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda",


device->auto

wenhuach21 · 2026-04-24T08:31:17Z

+        if not is_auto_round_available():
+            raise ImportError(
+                "Loading an AutoRound quantized model requires the auto-round library "
+                "(`pip install 'auto-round>=0.5'`)"


sayakpaul · 2026-04-25T13:48:51Z

Thanks for this PR! Could you provide some example code using AutoRound and also some example outputs? Feel free to also report latency and memory consumption so that there's some signal into its effectiveness.
Cc: @SunMarc

Apologies; due to policy restrictions, I am unable to provide specific latency data. Based on local testing, the time consumed by the W4A16 Marlin backend is comparable to that of BF16.

For the memory consumption, I observe that the original z-image takes about 25GB and the quantized one takes about 16GB

You are welcome to use the code provided in the model card to attempt to reproduce these results. the generated picture looks good.

I don't understand. Won't it run on a CUDA GPU? You cannot expect maintainers to run the code to address the bare minimums, I am afraid. Having the output samples and having comparisons to other methods gives us confidence in the quantization method itself.

xin3he added 3 commits April 10, 2026 15:19

support auto_round

42d4fdc

Signed-off-by: Xin He <xin3.he@intel.com>

add document and unit tests

e1714a9

Signed-off-by: Xin He <xin3.he@intel.com>

fix CI

c0daf15

Signed-off-by: Xin He <xin3.he@intel.com>

github-actions Bot added documentation Improvements or additions to documentation quantization tests utils size/L PR with diff > 200 LOC labels Apr 23, 2026

Merge branch 'main' into auto_round

f04afa9

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 23, 2026

xin3he changed the title ~~Auto round~~ Integrate AutoRound into Diffusers Apr 23, 2026

stevhliu reviewed Apr 23, 2026

View reviewed changes

Apply suggestions from code review

677a26e

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 24, 2026

hshen14 approved these changes Apr 24, 2026

View reviewed changes

wenhuach21 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/source/en/quantization/autoround.md Outdated

wenhuach21 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/source/en/quantization/autoround.md Outdated

wenhuach21 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/source/en/quantization/autoround.md Outdated

wenhuach21 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/source/en/quantization/autoround.md Outdated

wenhuach21 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/source/en/quantization/autoround.md Outdated

thuang6 reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/source/en/quantization/autoround.md Outdated

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 24, 2026

xin3he requested review from stevhliu, thuang6 and wenhuach21 April 24, 2026 05:24

thuang6 approved these changes Apr 24, 2026

View reviewed changes

update document and overwrite the default quantization_config with sp…

bc46f4f

…ecified backend. Signed-off-by: Xin He <xin3.he@intel.com>

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 24, 2026

add UT and fix bug

fb2e4c2

Signed-off-by: Xin He <xin3.he@intel.com>

wenhuach21 reviewed Apr 24, 2026

View reviewed changes


		model_id = "INCModel/Z-Image-W4A16-AutoRound"

		quantization_config = AutoRoundConfig(backend="marlin")

Conversation

xin3he commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

AutoRound quantization support

Dependency management and import handling

Documentation

Existed Model

Before submitting

Who can review?

Uh oh!

xin3he commented Apr 23, 2026

Uh oh!

sayakpaul commented Apr 23, 2026

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thuang6 left a comment

Choose a reason for hiding this comment

Uh oh!

xin3he commented Apr 24, 2026

Uh oh!

xin3he commented Apr 24, 2026

Uh oh!

wenhuach21 Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xin3he commented Apr 23, 2026 •

edited

Loading

wenhuach21 Apr 24, 2026 •

edited

Loading