PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

Junxian Li; Kai Liu; Kaiwen Tao; Renjing Pei; Yongsen Cheng; Yulun Zhang; Zhikai Chen; Zhixin Wang

arxiv: 2605.09503 · v2 · pith:SCKZ2TKUnew · submitted 2026-05-10 · 💻 cs.CV

PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

Yongsen Cheng , Kai Liu , Kaiwen Tao , Junxian Li , Zhixin Wang , Zhikai Chen , Renjing Pei , Yulun Zhang This is my paper

Pith reviewed 2026-06-30 22:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords post-training quantizationper-group quantizationchannel reorderingdiffusion modelslow-bit inferenceFLUX.1-devDiTquantization error reduction

0 comments

The pith

Reordering channels by joint second-moment similarity reduces per-group quantization error for low-bit diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that per-group quantization suffers large errors when channels with dissimilar activation and weight statistics share a scale, as outliers dominate the choice. PermuQuant sorts channels according to a joint second-moment criterion to place similar ones together, then accepts the permutation only if it lowers measured error on calibration data. The chosen ordering is folded into adjacent weights or modules so no extra runtime cost appears. This yields lower quantization error than prior post-training methods and supports 4-bit weights and activations on large models such as FLUX.1-dev.

Core claim

PermuQuant shows that channel ordering is a controllable factor in per-group quantization; sorting channels by a joint second-moment criterion places statistically similar channels in the same group, and a calibration acceptance rule applies the reordering only when it reduces error on held-out calibration inputs, with the permutation absorbed offline into the model.

What carries the argument

Joint second-moment criterion that ranks channels so those with comparable activation and weight statistics share a quantization scale.

If this is right

Lower per-group quantization error at W4A4 and similar low-bit settings compared with existing PTQ baselines.
Up to 1.7× single-step speedup and 3.5× DiT memory reduction on FLUX.1-dev under W4A4 NVFP4.
Permutations absorbed offline so inference speed and memory gains require no extra runtime operations.
Consistent gains across multiple large diffusion models without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reordering idea could be tested on transformer-based models outside the diffusion family to check whether the second-moment grouping principle transfers.
Combining the channel permutation with existing outlier-suppression or mixed-precision techniques might produce additive error reductions.
The offline absorption step suggests the method can be applied once during model export and then used in any inference engine that supports static weight layouts.

Load-bearing premise

Channels grouped by similar second-moment statistics will incur lower error under a shared scale, and the calibration rule selects permutations that improve performance on unseen inputs.

What would settle it

Measure the quantization error or generation quality (FID or similar) on a held-out test set after applying the accepted permutations versus the original ordering; absence of consistent reduction would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09503 by Junxian Li, Kai Liu, Kaiwen Tao, Renjing Pei, Yongsen Cheng, Yulun Zhang, Zhikai Chen, Zhixin Wang.

**Figure 1.** Figure 1: PermuQuant is a post-training quantization framework for low-bit diffusion models. In the W4A4 setting, it achieves 3.5× DiT memory reduction and 6.3× speedup on a single RTX 5090 32GB GPU by eliminating CPU offloading. In the challenging W3A3 setting, PermuQuant still produces visually clean results with faithful details, significantly outperforming other baselines. Abstract Large-scale visual generative … view at source ↗

**Figure 2.** Figure 2: Example of per-group quantization. Quantization error is greatly affected by channel orders. Despite this progress, extremely low-bit weight-activation quantization remains challenging (see [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: (a) and (b): Relative change in activation quantization error caused by random channel [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of PermuQuant. (a) Channel reordering places channels with similar statistics [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on SANA-1.5-1.6B and FLUX.1-dev. All methods are evaluated [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Efficiency comparison on FLUX.1-dev using an RTX 5090 Desktop 32GB GPU. Reordering [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on SANA-1.5-1.6B. The dashed line separates BF16 from the [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on FLUX.1-dev. The dashed line separates BF16 from the quantized [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on Z-Image-Turbo. The dashed line separates BF16 from the [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.7$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Channel reordering by second-moment stats is a straightforward PTQ tweak for diffusion models that can cut error in low-bit settings, but the fixed permutation chosen on calibration data risks failing when activations shift across timesteps.

read the letter

The paper's main move is to reorder channels before per-group quantization so that ones with similar joint second-moment statistics land in the same group, then only keep the permutation if it lowers error on a calibration set. The reordering is absorbed into weights or nearby layers, so there is no extra runtime cost. This is presented as a simple addition to existing PTQ pipelines for diffusion models.

It does a reasonable job spelling out why mismatched channels in one group can let outliers set the scale and inflate error. The reported numbers on FLUX.1-dev under W4A4 NVFP4—1.7× single-step speedup and 3.5× DiT memory cut—are the kind of concrete deployment gains that matter for single-GPU work.

The soft spot is exactly the stress-test point. Diffusion models change activation distributions strongly with timestep. A permutation picked from a finite calibration set at selected timesteps can put dissimilar channels together at other timesteps, and the acceptance rule only checks the calibration distribution. Nothing in the abstract shows they tested robustness across the full timestep range or ran ablations that would rule this out. Without those checks the claimed consistent outperformance is hard to trust.

The experimental claims are stated without protocol details, error bars, or dataset breakdowns in the abstract, so the strength of the evidence cannot be judged from what is here. No formal proofs or released artifacts are mentioned beyond a future code link.

This is for practitioners who need to squeeze large diffusion models onto limited hardware. A reader already working on PTQ for generative models would find the idea worth trying if the full experiments address the timestep issue. It is worth sending to a serious referee because the problem is real and the method is lightweight, even though the generalization claim needs more scrutiny.

Referee Report

3 major / 2 minor

Summary. The paper proposes PermuQuant, a post-training quantization (PTQ) method for diffusion models that reorders channels according to a joint second-moment criterion before applying per-group quantization. Channels with similar activation and weight statistics are grouped to avoid outlier-dominated scales. A calibration-based acceptance rule decides whether to apply the permutation, which is then absorbed into adjacent modules or weights offline. The central empirical claim is that this consistently lowers quantization error relative to existing PTQ baselines and yields up to 1.7× single-step speedup and 3.5× DiT memory reduction on FLUX.1-dev under W4A4 NVFP4 quantization.

Significance. If the reported gains hold under proper controls, the work supplies a lightweight, training-free technique that directly targets a known weakness of per-group quantization in large generative models. The offline absorption of permutations and the promise of public code are practical strengths that would aid reproducibility.

major comments (3)

[§3] §3 (method description): the joint second-moment criterion produces a single fixed permutation per layer derived from a finite calibration set at selected timesteps. Because diffusion models exhibit strong timestep-dependent variation in activation distributions, the paper must demonstrate that groupings optimal on calibration data do not increase error at other timesteps; no such analysis or cross-timestep ablation is supplied, leaving the generalization claim unsupported.
[§4] §4 (experiments): the acceptance rule only verifies error reduction on the calibration distribution, yet the central claim is that PermuQuant “consistently reduces quantization error” on held-out runs. Without reporting calibration-set size, timestep sampling strategy, or error metrics on a disjoint set of timesteps, it is impossible to rule out overfitting to the calibration distribution.
[Table 1 / Figure 2] Table 1 / Figure 2 (quantitative results): the reported speed and memory gains on FLUX.1-dev are presented without error bars, multiple random seeds, or an explicit statement of the calibration data used; these omissions are load-bearing because the method’s benefit is defined by the acceptance rule’s ability to select permutations that generalize.

minor comments (2)

The abstract states that code will be released at a GitHub link, but the manuscript itself contains no pointer to supplementary material or a reproducibility checklist.
Notation for the joint second-moment criterion is introduced without an explicit equation number; readers must infer the precise formula from surrounding text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and will incorporate the requested analyses and clarifications in a revised version.

read point-by-point responses

Referee: [§3] §3 (method description): the joint second-moment criterion produces a single fixed permutation per layer derived from a finite calibration set at selected timesteps. Because diffusion models exhibit strong timestep-dependent variation in activation distributions, the paper must demonstrate that groupings optimal on calibration data do not increase error at other timesteps; no such analysis or cross-timestep ablation is supplied, leaving the generalization claim unsupported.

Authors: We agree that an explicit cross-timestep analysis is needed to support generalization. Our calibration samples activations across a range of timesteps, but we did not provide a dedicated ablation on unseen timesteps. In the revision we will add such an ablation demonstrating that the selected permutations do not increase error outside the calibration timesteps. revision: yes
Referee: [§4] §4 (experiments): the acceptance rule only verifies error reduction on the calibration distribution, yet the central claim is that PermuQuant “consistently reduces quantization error” on held-out runs. Without reporting calibration-set size, timestep sampling strategy, or error metrics on a disjoint set of timesteps, it is impossible to rule out overfitting to the calibration distribution.

Authors: We will revise the experimental section to report the exact calibration-set size, the timestep sampling strategy, and error metrics evaluated on a disjoint set of timesteps. This will directly address the overfitting concern and strengthen the consistency claim. revision: yes
Referee: [Table 1 / Figure 2] Table 1 / Figure 2 (quantitative results): the reported speed and memory gains on FLUX.1-dev are presented without error bars, multiple random seeds, or an explicit statement of the calibration data used; these omissions are load-bearing because the method’s benefit is defined by the acceptance rule’s ability to select permutations that generalize.

Authors: We acknowledge the value of reporting variability. The revised manuscript will add error bars from multiple random seeds to Table 1 and Figure 2 and will explicitly describe the calibration data used for the FLUX.1-dev experiments. revision: yes

Circularity Check

0 steps flagged

Empirical PTQ heuristic with no load-bearing circular steps

full rationale

The paper presents PermuQuant as an empirical post-training quantization method that reorders channels according to a joint second-moment criterion and applies a calibration acceptance rule to retain only permutations that reduce measured quantization error on the calibration set. All reported gains are obtained by direct experimental comparison against baselines on held-out model evaluations; no equation, uniqueness theorem, or self-citation is invoked to derive the improvement by construction from the inputs themselves. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that mismatched channel statistics within a quantization group are the dominant source of error and that a second-moment sort plus calibration check will reliably mitigate it. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Channel ordering materially affects per-group quantization error when statistics differ within a group
Stated as the key observation motivating the method.

pith-pipeline@v0.9.1-grok · 5858 in / 1280 out tokens · 29259 ms · 2026-06-30T22:40:05.007236+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers
cs.CV 2026-07 unverdicted novelty 7.0

OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.

Reference graph

Works this paper leans on

63 extracted references · cited by 1 Pith paper

[1]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. InNeurIPS, 2024

2024
[2]

ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022

2022
[3]

Demystifying mmd gans.arXiv, 2018

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv, 2018

2018
[4]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025

2025
[5]

Efficientqat: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InACL, 2025

2025
[6]

Asyncdiff: Parallelizing diffusion models by asynchronous denoising

Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, and Xinchao Wang. Asyncdiff: Parallelizing diffusion models by asynchronous denoising. InNeurIPS, 2024

2024
[7]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021

2021
[8]

Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024

Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and Jiannan Wang. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024

2024
[9]

Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022

2022
[10]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

2023
[11]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

2017
[12]

Classifier-free diffusion guidance.arXiv, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv, 2022

2022
[13]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

2020
[14]

Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025

Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, and Haoqian Wang. Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025

2025
[15]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[16]

Playground v2

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv, 2024

2024
[17]

Distrifusion: Distributed parallel inference for high-resolution diffusion models

Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InCVPR, 2024

2024
[18]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InICLR, 2025

2025
[19]

Q-diffusion: Quantizing diffusion models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. InICCV, 2023. 10

2023
[20]

Snapfusion: Text-to-image diffusion model on mobile devices within two seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. InNeurIPS, 2023

2023
[21]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

2014
[22]

From reusing to forecasting: Accelerating diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025

2025
[23]

Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025

Kai Liu, Shaoqiu Zhang, Linghe Kong, and Yulun Zhang. Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025

2025
[24]

Low-bit model quantization for deep neural networks: A survey.arXiv, 2025

Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, et al. Low-bit model quantization for deep neural networks: A survey.arXiv, 2025

2025
[25]

Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022

2022
[26]

Reactnet: Towards precise binary neural network with generalized activation functions

Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. InECCV, 2020

2020
[27]

Spinquant: Llm quantization with learned rotations.arXiv, 2024

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv, 2024

2024
[28]

Post-training quantization for vision transformer

Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer. InNeurIPS, 2021

2021
[29]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InNeurIPS, 2022

2022
[30]

Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023

2023
[31]

Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022

Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022

2022
[32]

Deepcache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InCVPR, 2024

2024
[33]

Daniel Marco and David L. Neuhoff. The validity of the additive noise model for uniform scalar quantizers.TIT, 2005

2005
[34]

Training binary neural networks with real-to-binary convolutions

Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. InICLR, 2020

2020
[35]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023

2023
[36]

A white paper on neural network quantization.arXiv, 2021

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization.arXiv, 2021

2021
[37]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021

2021
[38]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

2022
[39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023
[40]

Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023. 11

2023
[41]

Forward and backward information retention for accurate binary neural networks

Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and backward information retention for accurate binary neural networks. InCVPR, 2020

2020
[42]

Quantsr: Accurate low-bit quantization for efficient image super-resolution

Haotong Qin, Yulun Zhang, Yifu Ding, Yifan liu, Xianglong Liu, Martin Danelljan, and Fisher Yu. Quantsr: Accurate low-bit quantization for efficient image super-resolution. InNeurIPS, 2023

2023
[43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022
[44]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015

2015
[45]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv, 2022

2022
[46]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InECCV, 2024

2024
[47]

Post-training quantization on diffusion models

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InCVPR, 2023

2023
[48]

Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023

2023
[49]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, 2015

2015
[50]

Denoising diffusion implicit models.arXiv, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv, 2020

2020
[51]

Score-based generative modeling through stochastic differential equations.arXiv, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv, 2020

2020
[52]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

2023
[53]

Efficient neural network deployment for microcontroller.arXiv, 2020

Hasan Unlu. Efficient neural network deployment for microcontroller.arXiv, 2020

2020
[54]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML, 2023

2023
[55]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025

2025
[56]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

2023
[57]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024

2024
[58]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024

2024
[59]

Fast sampling of diffusion models with exponential integrator.arXiv, 2022

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv, 2022. 12 A Table of Contents In the supplementary material, we provide complete proofs, implementation details, analysis, and results, including: • Sec. B: Proofs of the expected quantization error bound, the optimality of second-moment sorting, and the a...

2022
[60]

, x[π(K)]from global memory; 16

loads one activation row asx[π(1)], . . . , x[π(K)]from global memory; 16
[61]

computes the RMS statistic or the mean and variance on the loaded values
[62]

applies the corresponding channel-wise scale or modulation
[63]

In this way, the reordering is absorbed into the mandatory input-read stage of normalization

writes the reordered normalized output contiguously. In this way, the reordering is absorbed into the mandatory input-read stage of normalization. The fused kernel avoids a standalone reorder pass over the activation tensor. The only additional work is reading the permutation indices and generating indexed memory addresses. This is much cheaper than mater...

[1] [1]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. InNeurIPS, 2024

2024

[2] [2]

ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022

2022

[3] [3]

Demystifying mmd gans.arXiv, 2018

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv, 2018

2018

[4] [4]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025

2025

[5] [5]

Efficientqat: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InACL, 2025

2025

[6] [6]

Asyncdiff: Parallelizing diffusion models by asynchronous denoising

Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, and Xinchao Wang. Asyncdiff: Parallelizing diffusion models by asynchronous denoising. InNeurIPS, 2024

2024

[7] [7]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021

2021

[8] [8]

Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024

Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and Jiannan Wang. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024

2024

[9] [9]

Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022

2022

[10] [10]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

2023

[11] [11]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

2017

[12] [12]

Classifier-free diffusion guidance.arXiv, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv, 2022

2022

[13] [13]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

2020

[14] [14]

Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025

Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, and Haoqian Wang. Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025

2025

[15] [15]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[16] [16]

Playground v2

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv, 2024

2024

[17] [17]

Distrifusion: Distributed parallel inference for high-resolution diffusion models

Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InCVPR, 2024

2024

[18] [18]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InICLR, 2025

2025

[19] [19]

Q-diffusion: Quantizing diffusion models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. InICCV, 2023. 10

2023

[20] [20]

Snapfusion: Text-to-image diffusion model on mobile devices within two seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. InNeurIPS, 2023

2023

[21] [21]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

2014

[22] [22]

From reusing to forecasting: Accelerating diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025

2025

[23] [23]

Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025

Kai Liu, Shaoqiu Zhang, Linghe Kong, and Yulun Zhang. Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025

2025

[24] [24]

Low-bit model quantization for deep neural networks: A survey.arXiv, 2025

Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, et al. Low-bit model quantization for deep neural networks: A survey.arXiv, 2025

2025

[25] [25]

Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022

2022

[26] [26]

Reactnet: Towards precise binary neural network with generalized activation functions

Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. InECCV, 2020

2020

[27] [27]

Spinquant: Llm quantization with learned rotations.arXiv, 2024

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv, 2024

2024

[28] [28]

Post-training quantization for vision transformer

Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer. InNeurIPS, 2021

2021

[29] [29]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InNeurIPS, 2022

2022

[30] [30]

Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023

2023

[31] [31]

Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022

Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022

2022

[32] [32]

Deepcache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InCVPR, 2024

2024

[33] [33]

Daniel Marco and David L. Neuhoff. The validity of the additive noise model for uniform scalar quantizers.TIT, 2005

2005

[34] [34]

Training binary neural networks with real-to-binary convolutions

Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. InICLR, 2020

2020

[35] [35]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023

2023

[36] [36]

A white paper on neural network quantization.arXiv, 2021

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization.arXiv, 2021

2021

[37] [37]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021

2021

[38] [38]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

2022

[39] [39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023

[40] [40]

Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023. 11

2023

[41] [41]

Forward and backward information retention for accurate binary neural networks

Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and backward information retention for accurate binary neural networks. InCVPR, 2020

2020

[42] [42]

Quantsr: Accurate low-bit quantization for efficient image super-resolution

Haotong Qin, Yulun Zhang, Yifu Ding, Yifan liu, Xianglong Liu, Martin Danelljan, and Fisher Yu. Quantsr: Accurate low-bit quantization for efficient image super-resolution. InNeurIPS, 2023

2023

[43] [43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022

[44] [44]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015

2015

[45] [45]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv, 2022

2022

[46] [46]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InECCV, 2024

2024

[47] [47]

Post-training quantization on diffusion models

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InCVPR, 2023

2023

[48] [48]

Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023

2023

[49] [49]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, 2015

2015

[50] [50]

Denoising diffusion implicit models.arXiv, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv, 2020

2020

[51] [51]

Score-based generative modeling through stochastic differential equations.arXiv, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv, 2020

2020

[52] [52]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

2023

[53] [53]

Efficient neural network deployment for microcontroller.arXiv, 2020

Hasan Unlu. Efficient neural network deployment for microcontroller.arXiv, 2020

2020

[54] [54]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML, 2023

2023

[55] [55]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025

2025

[56] [56]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

2023

[57] [57]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024

2024

[58] [58]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024

2024

[59] [59]

Fast sampling of diffusion models with exponential integrator.arXiv, 2022

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv, 2022. 12 A Table of Contents In the supplementary material, we provide complete proofs, implementation details, analysis, and results, including: • Sec. B: Proofs of the expected quantization error bound, the optimality of second-moment sorting, and the a...

2022

[60] [60]

, x[π(K)]from global memory; 16

loads one activation row asx[π(1)], . . . , x[π(K)]from global memory; 16

[61] [61]

computes the RMS statistic or the mean and variance on the loaded values

[62] [62]

applies the corresponding channel-wise scale or modulation

[63] [63]

In this way, the reordering is absorbed into the mandatory input-read stage of normalization

writes the reordered normalized output contiguously. In this way, the reordering is absorbed into the mandatory input-read stage of normalization. The fused kernel avoids a standalone reorder pass over the activation tensor. The only additional work is reading the permutation indices and generating indexed memory addresses. This is much cheaper than mater...