pith. sign in

arxiv: 2509.23582 · v2 · pith:4YPHQ4TYnew · submitted 2025-09-28 · 💻 cs.CV

RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

Pith reviewed 2026-05-22 13:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion transformerquantization aware traininglow-bit quantizationimage generationactivation quantizationternary weightsmixed precision
0
0 comments X

The pith

A new training framework lets Diffusion Transformers generate competitive images on ImageNet using ternary weights and average 2-bit activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that activation quantization, not weight quantization, is the main obstacle to running large Diffusion Transformers at extremely low precision. It introduces a quantization-aware training method that first builds a solid ternary-weight baseline and then adds a RobustQuantizer step. This step applies the Hadamard transform to reshape per-token activation statistics into normal distributions that are easier to quantize. An additional activation-only mixed-precision scheme then assigns different bit widths to individual layers to avoid information loss. If the approach holds, it would make high-quality image generation feasible with far less memory and compute than current full-precision DiTs require.

Core claim

The central claim is that the Hadamard transform converts unknown per-token activation distributions into per-token normal distributions, which in turn supports reliable quantization to an average of 2 bits. When this is paired with an activation-only mixed-precision network that keeps ternary weights everywhere while varying activation precision layer by layer, the resulting model produces stable and competitive unconditional and conditional images on ImageNet-1K, establishing the first such result at this bit width.

What carries the argument

RobustQuantizer, which uses the Hadamard transform to normalize per-token activations before quantization, together with the AMPN pipeline that allocates different activation precisions per layer while holding all weights at ternary precision.

Load-bearing premise

The Hadamard transform can reliably turn arbitrary per-token activation distributions into normal distributions that quantize accurately at low bit widths.

What would settle it

A sharp rise in FID score or visible degradation in generated ImageNet images when the Hadamard transform is removed from the activation quantization path.

Figures

Figures reproduced from arXiv: 2509.23582 by Haotong Qin, Kaicheng Yang, Kaisen Yang, Xianglong Yan, Xun Zhang, Yucheng Lin, Yulun Zhang.

Figure 1
Figure 1. Figure 1: RobuQ enables DiTs to generate competitive results at ultra-low bit setting. We select 256×256 images from W1.58A3 quantized DiT-XL/2 trained on ImageNet-1K. ABSTRACT Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by sub￾stantial comp… view at source ↗
Figure 2
Figure 2. Figure 2: Overall Framework of Our Quantization Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of how the Hadamard transforms per-token unknown distributions (left) into a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An illustration of why PTQ sensitivity metrics fail for ultra-low-bit QAT mixed-precision. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the performance and efficiency of RobuQ and comparative approaches. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FLOPs and Memory Breakdown in DiT-XL/2 Model. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Schematic diagram of actual deployment. For simplicity, we have omitted the AdaLN [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Activation Bit-Width Distribution [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-Block Activation Statistics. Top: average activation bits per block; Bottom: normal [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: W1.58 DiT-XL/2 samples at 256×256. Labels = [360, 985, 309, 207, 387, 279, 417, 973]. Cfg = 4.0, sampling steps = 250. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
read the original abstract

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims to introduce RobuQ, a systematic QAT framework for DiTs that achieves W1.58A2 quantization. It starts with a strong ternary weight W1.58A4 baseline, proposes RobustQuantizer based on theoretical analysis that the Hadamard transform converts per-token distributions to normal distributions for robust activation quantization, and introduces AMPN for activation-only mixed-precision to allocate different precisions per layer. Through experiments on unconditional and conditional image generation on ImageNet-1K, it achieves SOTA for sub-4-bit DiT quantization and is the first to achieve stable competitive performance with average 2-bit activations.

Significance. If the results and theoretical justification hold, this is a significant advance in quantizing large generative models. Pushing DiTs to extremely low-bit settings with competitive performance on large datasets like ImageNet-1K could enable more efficient deployment of diffusion models. The use of Hadamard transform for activation normalization and the AMPN pipeline are potentially impactful contributions to the field of model quantization for vision transformers.

major comments (1)
  1. [Theoretical Analysis] Theoretical analysis section: The justification for RobustQuantizer rests on the Hadamard transform converting arbitrary per-token activation distributions into per-token normal distributions that are easy to quantize at 2 bits. This distributional property is invoked to argue that activation quantization is no longer the primary bottleneck, but the manuscript provides no direct empirical verification (e.g., histograms, QQ plots, or statistical tests) on activation statistics from actual DiT attention and MLP blocks. Without this, the quantization error bounds and the central W1.58A2 claim on ImageNet-1K rest on an unconfirmed assumption.
minor comments (2)
  1. The abstract states that code and models will be released; ensure the repository includes full reproduction scripts, exact hyperparameter settings for AMPN and QAT, and the per-layer bit-allocation tables.
  2. Experimental results would benefit from reporting standard deviations or multiple random seeds to substantiate claims of 'stable' performance at W1.58A2.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us identify areas to strengthen the manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: Theoretical analysis section: The justification for RobustQuantizer rests on the Hadamard transform converting arbitrary per-token activation distributions into per-token normal distributions that are easy to quantize at 2 bits. This distributional property is invoked to argue that activation quantization is no longer the primary bottleneck, but the manuscript provides no direct empirical verification (e.g., histograms, QQ plots, or statistical tests) on activation statistics from actual DiT attention and MLP blocks. Without this, the quantization error bounds and the central W1.58A2 claim on ImageNet-1K rest on an unconfirmed assumption.

    Authors: We appreciate the referee's point that direct empirical verification would strengthen the theoretical justification. While the analysis derives the normalization property from the Hadamard matrix's orthogonality and its effect on per-token statistics, we agree that showing this on real DiT activations is valuable. In the revised manuscript we will add histograms, Q-Q plots, and summary statistics (skewness, kurtosis, and Shapiro-Wilk p-values) for activations extracted from both attention and MLP blocks of the DiT model, comparing distributions before and after the Hadamard transform. These additions will empirically support the claim that activation quantization ceases to be the dominant bottleneck and will reinforce the reported W1.58A2 results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent theoretical derivation and empirical results

full rationale

The paper derives RobustQuantizer from a theoretical analysis of the Hadamard transform's effect on per-token activation distributions, presents this as a first-principles property within the manuscript, and validates the overall W1.58A2 framework through direct experiments on ImageNet-1K. No step reduces a claimed prediction or uniqueness result to a fitted parameter from the target data, a self-citation chain, or a renaming of known patterns; the central performance claims remain externally falsifiable via the reported generation metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework introduces two new algorithmic components (RobustQuantizer and AMPN) whose correctness depends on the Hadamard normality claim and on the empirical observation that mixed activation precision removes bottlenecks. No new physical entities are postulated.

free parameters (1)
  • per-layer activation bit allocations
    Chosen to eliminate information bottlenecks; values are not stated in the abstract but must be selected or searched for each architecture.
axioms (1)
  • domain assumption Hadamard transform converts unknown per-token distributions into per-token normal distributions
    Invoked to justify RobustQuantizer; treated as a mathematical property that holds for the activation statistics encountered in DiTs.

pith-pipeline@v0.9.0 · 5848 in / 1393 out tokens · 31149 ms · 2026-05-22T13:08:27.848447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 4 internal anchors

  1. [1]

    Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. In NeurIPS, 2024

  2. [2]

    A note on the inception score

    Shane Barratt and Rishi Rharma. A note on the inception score. In ICML Workshop, 2018

  3. [3]

    A lyapunov type bound in R ^d

    Vidmantas Bentkus. A lyapunov type bound in R ^d . Theory of Probability & Its Applications, 1997

  4. [4]

    Probability and Measure

    Patrick Billingsley. Probability and Measure. 1995

  5. [5]

    Sergey G. Bobkov. Refinements of berry--esseen inequalities in terms of lyapunov coefficients. Journal of Fourier Analysis and Applications, 2023

  6. [6]

    Cl \'e ment L. Canonne. A short note on an inequality between kl and tv. arXiv:2202.07198, 2022

  7. [7]

    Q-dit: Accurate post-training quantization for diffusion transformers

    Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jinyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-dit: Accurate post-training quantization for diffusion transformers. In CVPR, 2025

  8. [8]

    Wavegrad: Estimating gradients for waveform generation

    Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In ICLR, 2020

  9. [9]

    Hierarchical integration diffusion model for realistic image deblurring

    Zheng Chen, Yulun Zhang, Ding Liu, Bin Xia, Jinjin Gu, Linghe Kong, and Xin Yuan. Hierarchical integration diffusion model for realistic image deblurring. In NeurIPS, 2023

  10. [10]

    Binarized diffusion model for image super-resolution

    Zheng Chen, Haotong Qin, Yong Guo, Xiongfei Su, Xin Yuan, Linghe Kong, and Yulun Zhang. Binarized diffusion model for image super-resolution. In NeurIPS, 2024

  11. [11]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. 2006

  12. [12]

    Diffusion models in vision: A survey

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. TPAMI, 2023

  13. [13]

    Information Theory: Coding Theorems for Discrete Memoryless Systems

    Imre Csisz \'a r and J \'a nos K \"o rner. Information Theory: Coding Theorems for Discrete Memoryless Systems. 2011

  14. [14]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021

  15. [15]

    Mpq-dm: Mixed precision quantization for extremely low bit diffusion models

    Weilun Feng, Haotong Qin, Chuanguang Yang, Zhulin An, Libo Huang, Boyu Diao, Fei Wang, Renshuai Tao, Yongjun Xu, and Michele Magno. Mpq-dm: Mixed precision quantization for extremely low bit diffusion models. In AAAI, 2025 a

  16. [16]

    Mpq-dmv2: Flexible residual mixed precision quantization for low-bit diffusion models with temporal distillation

    Weilun Feng, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Boyu Diao, Fuzhen Zhuang, Michele Magno, Yongjun Xu, Yingli Tian, and Tingwen Huang. Mpq-dmv2: Flexible residual mixed precision quantization for low-bit diffusion models with temporal distillation. In arXiv preprint arXiv:2507.04290, 2025 b

  17. [17]

    Unified matrix treatment of the fast walsh--hadamard transform

    Bernard Fino and Vadim Algazi. Unified matrix treatment of the fast walsh--hadamard transform. IEEE Transactions on Computers, 1976

  18. [18]

    Limit Distributions for Sums of Independent Random Variables

    Boris Vladimirovich Gnedenko and Andrey Nikolaevich Kolmogorov. Limit Distributions for Sums of Independent Random Variables. 1954

  19. [19]

    Reti-diff: Illumination degradation image restoration with retinex-based latent diffusion model

    Chunming He, Chengyu Fang, Yulun Zhang, Kai Li, Longxiang Tang, Chenyu You, Fengyang Xiao, Zhenhua Guo, and Xiu Li. Reti-diff: Illumination degradation image restoration with retinex-based latent diffusion model. arXiv preprint arXiv:2311.11638, 2023

  20. [20]

    Diffusion models in low-level vision: A survey

    Chunming He, Yuqi Shen, Chengyu Fang, Fengyang Xiao, Longxiang Tang, Yulun Zhang, Wangmeng Zuo, Zhenhua Guo, and Xiu Li. Diffusion models in low-level vision: A survey. arXiv preprint arXiv:2406.11138, 2024 a

  21. [21]

    Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models

    YeFei He, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models. In ICLR, 2024 b

  22. [22]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ransauer, Thomas Unterhiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017

  23. [23]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv:2207.12598, 2022

  24. [24]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

  25. [25]

    St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, 2022

  26. [26]

    Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting

    Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. In ICLR, 2025

  27. [27]

    Tq-dit: Efficient time-aware quantization for diffusion transformers

    Younghye Hwang, Hyojin Lee, and Joonhyuk Kang. Tq-dit: Efficient time-aware quantization for diffusion transformers. In arXiv preprint arXiv:2502.04056, 2025

  28. [28]

    Edwin T. Jaynes. Information theory and statistical mechanics. 1957

  29. [29]

    Learned step size quantization

    Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. In ICLR, 2019

  30. [30]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019

  31. [31]

    Mixdit: Accelerating image diffusion transformer inference with mixed-precision mx quantization

    Daeun Kim, Jinwoo Hwang, Changhun Oh, and Jongse Park. Mixdit: Accelerating image diffusion transformer inference with mixed-precision mx quantization. In arXiv preprint arXiv:2504.08398, 2025

  32. [32]

    Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization

    Chris Kolb, Christian L. Müller, Bernd Bisch, and David Rügamer. Smoothing the edges: Smooth optimization for sparse regularization using hadamard overparametrization. In arXiv preprint arXiv:2307.03571, 2023

  33. [33]

    Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

    Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. In ICLR, 2025

  34. [34]

    Snapfusion: Text-to-image diffusion model on mobile devices within two seconds

    Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In NeurIPS, 2024

  35. [35]

    Mpgraf: a modular and pre-trained graphformer for learning to rank at web-scale

    Yuchen Li, Haoyi Xiong, Linghe Kong, Zeyi Sun, Hongyang Chen, Shuaiqiang Wang, and Dawei Yin. Mpgraf: a modular and pre-trained graphformer for learning to rank at web-scale. In ICDM, 2023 a

  36. [36]

    Mhrr: Moocs recommender service with meta hierarchical reinforced ranking

    Yuchen Li, Haoyi Xiong, Linghe Kong, Rui Zhang, Fanqin Xu, Guihai Chen, and Minglu Li. Mhrr: Moocs recommender service with meta hierarchical reinforced ranking. TSC, 2023 b

  37. [37]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. In NeurIPS, 2025

  38. [38]

    Intelligent grimm-open-ended visual storytelling via latent diffusion models

    Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. In CVPR, 2024

  39. [39]

    Bimacosr: Binary one-step diffusion model leveraging flexible matrix compression for real super-resolution

    Kai Liu, Kaicheng Yang, Zheng Chen, Zhiteng Li, Yong Guo, Wenbo Li, Linghe Kong, and Yulun Zhang. Bimacosr: Binary one-step diffusion model leveraging flexible matrix compression for real super-resolution. In ICML, 2025 a

  40. [40]

    Hq-dit: Efficient diffusion transformer with fp4 hybrid quantization

    Wenxuan Liu and Sai Qian Zhang. Hq-dit: Efficient diffusion transformer with fp4 hybrid quantization. In arXiv preprint arXiv:2405.19751, 2024

  41. [41]

    Spinquant: Llm quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations. In ICLR, 2025 b

  42. [42]

    Lloyd and Bell Laboratories

    S. Lloyd and Bell Laboratories. Least squares quantization in pcm. In IEEEXplore, 1982

  43. [43]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019

  44. [44]

    Terdit: Ternary diffusion models with transformers

    Xudong Lu, Aojun Zhou, Ziyi Lin, Yuhui Liu, Qi adn Xu, Renrui Zhang, Xue Yang, Junchi Yan, Peng Gao, and Hongsheng Li. Terdit: Ternary diffusion models with transformers. In arXiv preprint arXiv:2405.14854, 2024

  45. [45]

    Ptq4sam: Post-training quantization for segment anything

    Chengtao Lv, Hong Chen, Jingyang Guo, Yifu Ding, and Xianglong Liu. Ptq4sam: Post-training quantization for segment anything. In CVPR, 2024

  46. [46]

    A calculus proof of the Cram\'er-Wold theorem

    Russell Lyons and Kevin Zumbrun. A calculus proof of the cram \'e r--wold theorem. Proceedings of the American Mathematical Society, 2017. arXiv:1607.03206

  47. [47]

    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    Shunming Ma, Hongyu Wang, Lingxiao Ma, Wenhui Wang, Lei adn Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits. In arXiv preprint arXiv:2402.17764, 2024

  48. [48]

    Generating images with sparse representations

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. In arXiv preprint arXiv:2103.03841, 2021

  49. [49]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019

  50. [50]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023

  51. [51]

    The matrix cookbook, 2012

    Kaare Brandt Petersen and Michael Syskind Pedersen. The matrix cookbook, 2012

  52. [52]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

  53. [53]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015

  54. [54]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015

  55. [55]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciench Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NeurIPS, 2016

  56. [56]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

  57. [57]

    Quest: Low-bit diffusion model quantization via efficient selective finetuning

    Haoxuan Wang, Yuzhang Shang, Zhihang Yuan, Junchi Wu, and Yan Yan. Quest: Low-bit diffusion model quantization via efficient selective finetuning. In ICCV, 2025 a

  58. [58]

    Bitnet v2: Native 4-bit activations with hadamard transformation for 1-bit llms

    Hongyu Wang, Shuming Ma, and Furu Wei. Bitnet v2: Native 4-bit activations with hadamard transformation for 1-bit llms. In arXiv preprint arXiv:2504.18415, 2025 b

  59. [59]

    Ptq4dit: Post-training quantization for diffusion transformers

    Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. Ptq4dit: Post-training quantization for diffusion transformers. In NeurIPS, 2024

  60. [60]

    Diffusion models: A comprehensive survey of methods and applications

    Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. In ACM Computing Surveys, 2023

  61. [61]

    Hadamard Matrix Analysis and Synthesis: With Applications to Communications and Signal/Image Processing

    Rao Yarlagadda and John Hershey. Hadamard Matrix Analysis and Synthesis: With Applications to Communications and Signal/Image Processing. 1993

  62. [62]

    F. Yates. A fast algorithm for hadamard transform. Mathematical Proceedings of the Cambridge Philosophical Society, 1968

  63. [63]

    Flexible residual binarization for image super-resolution

    Yulun Zhang, Haotong Qin, Zixiang Zhao, Xianglong Liu, Martin Danelljan, and Fisher Yu. Flexible residual binarization for image super-resolution. In ICML, 2024

  64. [64]

    Shengen, Guohao Dai, and Yu Wang

    Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Yan. Shengen, Guohao Dai, and Yu Wang. Mixdq: Memory-efficient few-step text-to-image diffusion models with metric-decoupled mixed precision quantization. In ECCV, 2024 a

  65. [65]

    Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation

    Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Xuefei Yang, Huazhong aand Nong, and Yu Wang. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation. In ICLR, 2025

  66. [66]

    Dc-solver: Improving predictor-corrector diffusion sampler via dynamic compensation

    Wenliang Zhao, Haolin Wang, Jie Zhou, and Jiwen Lu. Dc-solver: Improving predictor-corrector diffusion sampler via dynamic compensation. In arXiv preprint arXiv:2409.03755, 2024, 2024 b

  67. [67]

    Bidm: Pushing the limit of quantization for diffusion models

    Xinyu Zheng, Xianglong Liu, Yichen Bian, Xudong Ma, Yulun Zhang, Jiakai Wang, Jingyang Guo, and Haotong Qin. Bidm: Pushing the limit of quantization for diffusion models. In NeurIPS, 2024

  68. [68]

    Binarydm: Accurate weight binarization for efficient diffusion models

    Xinyu Zheng, Xianglong Liu, Haotong Qin, Xudong Ma, Mingyuan Zhang, Haojie Hao, Jiakai Wang, Zixiang Zhao, Jingyang Guo, and Michele Magno. Binarydm: Accurate weight binarization for efficient diffusion models. In ICLR, 2025

  69. [69]

    Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients

    Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ICLR, 2016