pith. sign in

arxiv: 2605.16732 · v1 · pith:CDQRCQ33new · submitted 2026-05-16 · 💻 cs.CV · cs.LG

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

Pith reviewed 2026-05-19 21:43 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords diffusion transformerspost-training quantization4-bit quantizationactivation rotationPCA subspaceimage generationmodel compressionefficient inference
0
0 comments X

The pith

DiRotQ rotates activations into a PCA-derived basis to protect dominant variance directions during 4-bit quantization of diffusion transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformers produce high-quality images but require large amounts of memory and compute. DiRotQ is a post-training quantization technique that reduces both weights and activations to 4 bits while limiting quality loss. The method finds a low-rank subspace of activations using PCA on a calibration set and keeps that part at higher precision. Activations are rotated to align with this subspace before quantization, and the inverse rotation is merged into the model weights ahead of time. When combined with existing weight quantization, this yields an FID score of 15.9 and PSNR of 19.1 dB, beating the previous best 4-bit approach.

Core claim

DiRotQ identifies a low-rank subspace capturing dominant activation variance via PCA, preserves coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit, rotates activations into the PCA basis at inference using calibration-derived orthogonal transformations, fuses the inverse rotation into the layer weights offline, and combines this with GPTQ-based weight quantization to achieve an FID of 15.9 and PSNR of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset under the INT W4A4 setting.

What carries the argument

PCA-based identification of a low-rank subspace with rotation of activations into that basis at inference and offline fusion of the inverse rotation into weights.

If this is right

  • Outperforms SVDQuant with an FID of 15.9 and PSNR of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset under the same INT W4A4 setting.
  • Reduces memory usage of the 12B FLUX.1-dev model by 2.1x and delivers 2.3x speedup over the BF16 baseline on a 24 GB RTX 4090 GPU via a Triton-based custom kernel.
  • Introduces a VLM-as-a-Judge evaluation protocol for assessing perceptual quality and prompt alignment in quantized diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rotation and subspace protection approach could be tested on other large transformer generators such as video or audio diffusion models to check for similar compression gains.
  • Fusing rotations offline suggests the technique can slot into existing quantization toolchains without requiring changes to inference hardware.
  • The reported speedups on consumer GPUs indicate that custom kernels may be required to capture full benefits when mixing precision levels within layers.

Load-bearing premise

That a low-rank subspace identified via PCA on calibration data captures the dominant activation variance sufficiently to allow safe 4-bit quantization of the remaining components, with the rotation and fusion preserving overall model behavior.

What would settle it

Applying DiRotQ to PixArt-Σ and measuring an FID higher than 18.9 on the MJHQ-30K dataset would show that the approach does not outperform the prior state-of-the-art under the same W4A4 conditions.

Figures

Figures reproduced from arXiv: 2605.16732 by Hesham Mostafa, Mahsa Salmani, Sayeh Sharify.

Figure 1
Figure 1. Figure 1: PixArt-Σ [8] Block 2, self-attention input activation distributions across 20 denoising timesteps. (a) Temporal variance for channels 239 and 364. (b) Original activations show severe channel-wise variance with large outliers (max/median = 83×). (c, d) PCA isolates outliers in a 16-bit subspace (red), reducing the ratio in the remaining 4-bit channels (blue) to 16×. (e, f) Rotation further equalizes the 4-… view at source ↗
Figure 2
Figure 2. Figure 2: Per-block activation QSNR for the self-attention output projection across 28 PixArt-Σ transformer blocks at five timesteps. DiRotQ outperforms RTN and SVDQuant [37] by 5–10 dB consistently across blocks and timesteps. 2 Related work Diffusion models Diffusion models [63, 27] generate samples via iterative denoising and have achieved strong performance in text-to-image tasks [3, 55, 53]. Early diffusion mod… view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison on MJHQ-30K across three models. DiRotQ consistently outperforms [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Memory usage and one-step latency for FLUX.1-dev across batch sizes 1 and 2. DiRotQ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: VLM-as-a-Judge pairwise comparison between SVDQuant and DiRotQ on PixArt- [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-block activation QSNR (dB) for RTN, SVDQuant, and DiRotQ across [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Activation QSNR (dB) on PixArt-Σ as a function of FP16 tail fraction, averaged across all 28 transformer blocks, for FFN up, self-attention QKV, and cross-attention Q layers across five denoising timesteps. RTN baselines are shown as horizontal references. QSNR rises sharply up to r=10% and saturates beyond, motivating our choice of r=10% throughout the paper. H Licenses of existing assets [PITH_FULL_IMAG… view at source ↗
Figure 8
Figure 8. Figure 8: Per-category VLM-as-a-Judge pairwise comparison between SVDQuant and DiRotQ on [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative visual results of PixArt-Σ on the MJHQ-30K and sDCI datasets. Prompts are reproduced verbatim. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative visual results of FLUX.1-dev on the MJHQ-30K and sDCI datasets. Prompts [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative visual results of SANA-1.6B on the MJHQ-30K and sDCI datasets. Prompts [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
read the original abstract

Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-{\Sigma} over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents DiRotQ, a rotation-aware post-training quantization (PTQ) framework for 4-bit Diffusion Transformers (DiTs). It identifies a low-rank subspace via PCA on calibration activations to preserve dominant variance at higher precision, quantizes the orthogonal components to 4 bits, applies calibration-derived rotations to activations at inference (fused into weights offline), and combines this with GPTQ for weights. Empirical results show FID 15.9 and PSNR 19.1 dB on PixArt-Σ with MJHQ-30K, outperforming SVDQuant (FID 18.9, PSNR 17.6) in W4A4 setting. It also proposes a VLM-as-a-Judge protocol and demonstrates 2.1x memory reduction and 2.3x speedup on FLUX.1-dev using a Triton kernel.

Significance. If the central empirical claims hold under rigorous verification, this contributes a practical PTQ method that narrows the quality gap for 4-bit DiT inference, enabling deployment of large models like 12B FLUX on consumer GPUs. The VLM-as-a-Judge evaluation is a positive addition for assessing semantic fidelity beyond FID/PSNR. The fusion of rotation into weights is a standard efficiency trick but applied here in a novel combination with PCA subspace for activations.

major comments (1)
  1. [§3.2 (PCA Subspace Identification)] The description of the calibration procedure for determining the low-rank subspace does not specify the distribution or range of timesteps used in the calibration dataset. Since DiT activations vary significantly with the diffusion timestep t, a subspace derived from a limited t-range may leave substantial variance in the 4-bit path at other timesteps, potentially accumulating errors over the denoising trajectory and affecting the reported FID and PSNR improvements.
minor comments (2)
  1. [Abstract] The abstract mentions 'the first such evaluation in this setting' for VLM-as-a-Judge; a reference to prior VLM evaluations in other quantization contexts would strengthen this claim.
  2. [§5 (Efficiency Results)] The reported 2.3x speedup on RTX 4090 should include the batch size and resolution used for the measurement to allow reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [§3.2 (PCA Subspace Identification)] The description of the calibration procedure for determining the low-rank subspace does not specify the distribution or range of timesteps used in the calibration dataset. Since DiT activations vary significantly with the diffusion timestep t, a subspace derived from a limited t-range may leave substantial variance in the 4-bit path at other timesteps, potentially accumulating errors over the denoising trajectory and affecting the reported FID and PSNR improvements.

    Authors: We agree that the timestep distribution in calibration is an important detail given the strong timestep dependence of DiT activations. In our procedure, activations were collected by sampling timesteps uniformly at random from the full range [0, T] (T = 1000) across a calibration set of 256 images, ensuring the PCA subspace reflects variance across the entire denoising trajectory rather than a narrow interval. We will revise §3.2 to state this uniform full-range sampling explicitly, including the number of calibration samples and the exact timestep selection method. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DiRotQ empirical claims

full rationale

The paper presents a post-training quantization method that computes a low-rank subspace via PCA on calibration data, rotates activations into that basis at inference, fuses the inverse rotation into weights offline, and applies GPTQ to weights. Reported FID of 15.9 and PSNR of 19.1 are measured on the held-out MJHQ-30K dataset for PixArt-Σ and compared against the external baseline SVDQuant under identical W4A4 settings. No equation or result reduces by construction to a fitted parameter or self-defined quantity within the paper; the performance numbers are independent empirical outcomes on data separate from calibration. The derivation chain relies on standard PTQ techniques and external benchmarks rather than self-citation chains or tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or axioms; the subspace dimension choice and calibration data selection are implicit but not quantified.

pith-pipeline@v0.9.0 · 5895 in / 1281 out tokens · 51764 ms · 2026-05-19T21:43:48.527133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 17 internal anchors

  1. [1]

    QuaRot: Outlier-free 4-bit inference in rotated LLMs

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. In38th Annual Conference on Neural Information Processing Systems, 2024

  2. [2]

    Blended diffusion for text-driven editing of natural images

    Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18208–18218, 2022

  3. [3]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022

  4. [4]

    All are worth words: A ViT backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

  5. [5]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster.arXiv preprint arXiv:2210.09461, 2022

  6. [6]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

  7. [7]

    MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024

  8. [8]

    PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In18th European Conference on Computer Vision. Springer, 2024

  9. [9]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. PixArt- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

  10. [10]

    Q-DiT: Accurate post-training quantization for diffusion transformers

    Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-DiT: Accurate post-training quantization for diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025

  11. [11]

    Mills, and Di Niu

    Ruichen Chen, Keith G. Mills, and Di Niu. FP4DiT: Towards effective floating point quan- tization for diffusion transformers.Transactions on Machine Learning Research (TMLR), 2025

  12. [12]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

  13. [13]

    MJ-Bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

    Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, et al. MJ-Bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

  14. [14]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 10

  15. [15]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational conference on machine learning, pages 7480–7512. PMLR, 2023

  16. [16]

    LLM.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

  17. [17]

    DiTAS: Quantizing diffusion transformers via enhanced activation smoothing

    Zhenyuan Dong and Sai Qian Zhang. DiTAS: Quantizing diffusion transformers via enhanced activation smoothing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4606–4615. IEEE, 2025

  18. [18]

    The approximation of one matrix by another of lower rank

    Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936

  19. [19]

    Scaling rectified flow transform- ers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In41st international conference on machine learning, 2024

  20. [20]

    OPTQ: Accurate post- training quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan-Adrian Alistarh. OPTQ: Accurate post- training quantization for generative pre-trained transformers. In11th International Conference on Learning Representations, 2023

  21. [21]

    Springer Science & Business Media, 2012

    Allen Gersho and Robert M Gray.Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012

  22. [22]

    FineGRAIN: Evaluating failure modes of text-to-image models with vision language model judges

    Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, and Tom Goldstein. FineGRAIN: Evaluating failure modes of text-to-image models with vision language model judges. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  23. [23]

    PTQD: Accurate post-training quantization for diffusion models.Advances in Neural Information Processing Systems, 36:13237–13249, 2023

    Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. PTQD: Accurate post-training quantization for diffusion models.Advances in Neural Information Processing Systems, 36:13237–13249, 2023

  24. [24]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415, 2016

  25. [25]

    ClipScore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. ClipScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

  26. [26]

    GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  27. [27]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  28. [28]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  29. [29]

    ConvRot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv preprint arXiv:2512.03673, 2025

    Feice Huang et al. ConvRot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv preprint arXiv:2512.03673, 2025

  30. [30]

    Bk-SDM: A lightweight, fast, and cheap version of stable diffusion

    Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. Bk-SDM: A lightweight, fast, and cheap version of stable diffusion. InEuropean Conference on Computer Vision, pages 381–399. Springer, 2024

  31. [31]

    DiffWave: A Versatile Diffusion Model for Audio Synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020. 11

  32. [32]

    VIEScore: Towards explain- able metrics for conditional image synthesis evaluation

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. VIEScore: Towards explain- able metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12268–12290, 2024

  33. [33]

    Black Forest Labs. FLUX. https://github.com/black-forest-labs/flux, 2024. Ac- cessed: 2026-04-10

  34. [34]

    Prometheus-vision: Vision-language model as a judge for fine-grained evaluation

    Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, 2024

  35. [35]

    Lhotse documentation: lhotse.cut.mixed module

    Lhotse Development Team. Lhotse documentation: lhotse.cut.mixed module. https:// lhotse.readthedocs.io/en/v1.24.2/_modules/lhotse/cut/mixed.html, 2024. Ac- cessed: 2026-04-10

  36. [36]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

  37. [37]

    SVDQuant: Absorbing outliers by low-rank component for 4-bit diffusion models

    Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Junxian Guo, Xiuyu Li, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. SVDQuant: Absorbing outliers by low-rank component for 4-bit diffusion models. In13th International Conference on Learning Representations, 2025

  38. [38]

    Q-Diffusion: Quantizing diffusion models

    Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-Diffusion: Quantizing diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17535–17545, 2023

  39. [39]

    SnapFusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Processing Systems, 36:20662–20678, 2023

    Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Processing Systems, 36:20662–20678, 2023

  40. [40]

    AWQ: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024

  41. [41]

    QServe: W4A8KV4 quantization and system co-design for efficient LLM serving

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. Proceedings of Machine Learning and Systems, 7, 2025

  42. [42]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. In13th International Conference on Learning Representa- tions, 2025

  43. [43]

    DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

  44. [44]

    DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

  45. [45]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  46. [46]

    DeepCache: Accelerating diffusion models for free

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024

  47. [47]

    Midjourney: Text-to-image generation model

    Midjourney, Inc. Midjourney: Text-to-image generation model. https://www.midjourney. com, 2022. Accessed: 2026-04-10. 12

  48. [48]

    Up or down? Adaptive rounding for post-training quantization

    Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? Adaptive rounding for post-training quantization. InInternational Conference on Machine Learning (ICML), 2020

  49. [49]

    NVIDIA GeForce RTX 4090

    NVIDIA Corporation. NVIDIA GeForce RTX 4090. https://www.nvidia.com/en-us/ geforce/graphics-cards/40-series/rtx-4090/, 2022. Accessed: 2026-04-28

  50. [50]

    NVIDIA Blackwell Architecture

    NVIDIA Corporation. NVIDIA Blackwell Architecture. https://resources.nvidia.com/ en-us-blackwell-architecture, 2024. Accessed: 2026-04-11

  51. [51]

    GPT-4o System Card

    OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

  52. [52]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  53. [53]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  54. [54]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  55. [55]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  56. [56]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  57. [57]

    DGQ: Distribution-aware group quantization for text-to-image diffusion models

    Hyogon Ryu, NaHyeon Park, and Hyunjung Shim. DGQ: Distribution-aware group quantization for text-to-image diffusion models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  58. [58]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  59. [59]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

  60. [60]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

  61. [61]

    ResQ: Mixed-precision quantiza- tion of large language models with low-rank residuals

    Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. ResQ: Mixed-precision quantiza- tion of large language models with low-rank residuals. InProceedings of the 42nd International Conference on Machine Learning, 2025

  62. [62]

    Post-training quantization on diffusion models

    Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981, 2023

  63. [63]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

  64. [64]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  65. [65]

    Improved Techniques for Training Consistency Models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023. 13

  66. [66]

    Triton: an intermediate language and compiler for tiled neural network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

  67. [67]

    A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

    Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700–26709, 2024

  68. [68]

    Diffusers: State-of-the-art diffusion models

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022. Accessed: 2026-04-10

  69. [69]

    Exploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

  70. [70]

    SparseDM: Toward sparse efficient diffusion models

    Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, and Jun Zhu. SparseDM: Toward sparse efficient diffusion models. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–7. IEEE, 2025

  71. [71]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  72. [72]

    Image Quality Assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality Assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004

  73. [73]

    PTQ4DiT: Post- training quantization for diffusion transformers.Advances in neural information processing systems, 37:62732–62755, 2024

    Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. PTQ4DiT: Post- training quantization for diffusion transformers.Advances in neural information processing systems, 37:62732–62755, 2024

  74. [74]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

  75. [75]

    SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. In13th International Conference on Learning Representations, 2025

  76. [76]

    ImageReward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  77. [77]

    LRQ-DiT: Log-rotation post-training quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2508.03485, 2025

    Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, and Qingyi Gu. LRQ-DiT: Log-rotation post-training quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2508.03485, 2025

  78. [78]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  79. [79]

    A Survey on Multimodal Large Language Models.National Science Review, 11(12):nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A Survey on Multimodal Large Language Models.National Science Review, 11(12):nwae403, 2024

  80. [80]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 14

Showing first 80 references.