DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers
Pith reviewed 2026-05-19 21:43 UTC · model grok-4.3
The pith
DiRotQ rotates activations into a PCA-derived basis to protect dominant variance directions during 4-bit quantization of diffusion transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiRotQ identifies a low-rank subspace capturing dominant activation variance via PCA, preserves coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit, rotates activations into the PCA basis at inference using calibration-derived orthogonal transformations, fuses the inverse rotation into the layer weights offline, and combines this with GPTQ-based weight quantization to achieve an FID of 15.9 and PSNR of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset under the INT W4A4 setting.
What carries the argument
PCA-based identification of a low-rank subspace with rotation of activations into that basis at inference and offline fusion of the inverse rotation into weights.
If this is right
- Outperforms SVDQuant with an FID of 15.9 and PSNR of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset under the same INT W4A4 setting.
- Reduces memory usage of the 12B FLUX.1-dev model by 2.1x and delivers 2.3x speedup over the BF16 baseline on a 24 GB RTX 4090 GPU via a Triton-based custom kernel.
- Introduces a VLM-as-a-Judge evaluation protocol for assessing perceptual quality and prompt alignment in quantized diffusion models.
Where Pith is reading between the lines
- The rotation and subspace protection approach could be tested on other large transformer generators such as video or audio diffusion models to check for similar compression gains.
- Fusing rotations offline suggests the technique can slot into existing quantization toolchains without requiring changes to inference hardware.
- The reported speedups on consumer GPUs indicate that custom kernels may be required to capture full benefits when mixing precision levels within layers.
Load-bearing premise
That a low-rank subspace identified via PCA on calibration data captures the dominant activation variance sufficiently to allow safe 4-bit quantization of the remaining components, with the rotation and fusion preserving overall model behavior.
What would settle it
Applying DiRotQ to PixArt-Σ and measuring an FID higher than 18.9 on the MJHQ-30K dataset would show that the approach does not outperform the prior state-of-the-art under the same W4A4 conditions.
Figures
read the original abstract
Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-{\Sigma} over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DiRotQ, a rotation-aware post-training quantization (PTQ) framework for 4-bit Diffusion Transformers (DiTs). It identifies a low-rank subspace via PCA on calibration activations to preserve dominant variance at higher precision, quantizes the orthogonal components to 4 bits, applies calibration-derived rotations to activations at inference (fused into weights offline), and combines this with GPTQ for weights. Empirical results show FID 15.9 and PSNR 19.1 dB on PixArt-Σ with MJHQ-30K, outperforming SVDQuant (FID 18.9, PSNR 17.6) in W4A4 setting. It also proposes a VLM-as-a-Judge protocol and demonstrates 2.1x memory reduction and 2.3x speedup on FLUX.1-dev using a Triton kernel.
Significance. If the central empirical claims hold under rigorous verification, this contributes a practical PTQ method that narrows the quality gap for 4-bit DiT inference, enabling deployment of large models like 12B FLUX on consumer GPUs. The VLM-as-a-Judge evaluation is a positive addition for assessing semantic fidelity beyond FID/PSNR. The fusion of rotation into weights is a standard efficiency trick but applied here in a novel combination with PCA subspace for activations.
major comments (1)
- [§3.2 (PCA Subspace Identification)] The description of the calibration procedure for determining the low-rank subspace does not specify the distribution or range of timesteps used in the calibration dataset. Since DiT activations vary significantly with the diffusion timestep t, a subspace derived from a limited t-range may leave substantial variance in the 4-bit path at other timesteps, potentially accumulating errors over the denoising trajectory and affecting the reported FID and PSNR improvements.
minor comments (2)
- [Abstract] The abstract mentions 'the first such evaluation in this setting' for VLM-as-a-Judge; a reference to prior VLM evaluations in other quantization contexts would strengthen this claim.
- [§5 (Efficiency Results)] The reported 2.3x speedup on RTX 4090 should include the batch size and resolution used for the measurement to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment point by point below.
read point-by-point responses
-
Referee: [§3.2 (PCA Subspace Identification)] The description of the calibration procedure for determining the low-rank subspace does not specify the distribution or range of timesteps used in the calibration dataset. Since DiT activations vary significantly with the diffusion timestep t, a subspace derived from a limited t-range may leave substantial variance in the 4-bit path at other timesteps, potentially accumulating errors over the denoising trajectory and affecting the reported FID and PSNR improvements.
Authors: We agree that the timestep distribution in calibration is an important detail given the strong timestep dependence of DiT activations. In our procedure, activations were collected by sampling timesteps uniformly at random from the full range [0, T] (T = 1000) across a calibration set of 256 images, ensuring the PCA subspace reflects variance across the entire denoising trajectory rather than a narrow interval. We will revise §3.2 to state this uniform full-range sampling explicitly, including the number of calibration samples and the exact timestep selection method. revision: yes
Circularity Check
No significant circularity in DiRotQ empirical claims
full rationale
The paper presents a post-training quantization method that computes a low-rank subspace via PCA on calibration data, rotates activations into that basis at inference, fuses the inverse rotation into weights offline, and applies GPTQ to weights. Reported FID of 15.9 and PSNR of 19.1 are measured on the held-out MJHQ-30K dataset for PixArt-Σ and compared against the external baseline SVDQuant under identical W4A4 settings. No equation or result reduces by construction to a fitted parameter or self-defined quantity within the paper; the performance numbers are independent empirical outcomes on data separate from calibration. The derivation chain relies on standard PTQ techniques and external benchmarks rather than self-citation chains or tautological redefinitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
QuaRot: Outlier-free 4-bit inference in rotated LLMs
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. In38th Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[2]
Blended diffusion for text-driven editing of natural images
Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18208–18218, 2022
work page 2022
-
[3]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
All are worth words: A ViT backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023
work page 2023
-
[5]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster.arXiv preprint arXiv:2210.09461, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022
work page 2022
-
[7]
MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[8]
PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In18th European Conference on Computer Vision. Springer, 2024
work page 2024
-
[9]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. PixArt- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Q-DiT: Accurate post-training quantization for diffusion transformers
Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-DiT: Accurate post-training quantization for diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025
work page 2025
-
[11]
Ruichen Chen, Keith G. Mills, and Di Niu. FP4DiT: Towards effective floating point quan- tization for diffusion transformers.Transactions on Machine Learning Research (TMLR), 2025
work page 2025
-
[12]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, et al. MJ-Bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024
-
[14]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Scaling vision transformers to 22 billion parameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational conference on machine learning, pages 7480–7512. PMLR, 2023
work page 2023
-
[16]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022
work page 2022
-
[17]
DiTAS: Quantizing diffusion transformers via enhanced activation smoothing
Zhenyuan Dong and Sai Qian Zhang. DiTAS: Quantizing diffusion transformers via enhanced activation smoothing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4606–4615. IEEE, 2025
work page 2025
-
[18]
The approximation of one matrix by another of lower rank
Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936
work page 1936
-
[19]
Scaling rectified flow transform- ers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In41st international conference on machine learning, 2024
work page 2024
-
[20]
OPTQ: Accurate post- training quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan-Adrian Alistarh. OPTQ: Accurate post- training quantization for generative pre-trained transformers. In11th International Conference on Learning Representations, 2023
work page 2023
-
[21]
Springer Science & Business Media, 2012
Allen Gersho and Robert M Gray.Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012
work page 2012
-
[22]
FineGRAIN: Evaluating failure modes of text-to-image models with vision language model judges
Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, and Tom Goldstein. FineGRAIN: Evaluating failure modes of text-to-image models with vision language model judges. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
work page 2025
-
[23]
Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. PTQD: Accurate post-training quantization for diffusion models.Advances in Neural Information Processing Systems, 36:13237–13249, 2023
work page 2023
-
[24]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[25]
ClipScore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. ClipScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021
work page 2021
-
[26]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[27]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[28]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Feice Huang et al. ConvRot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv preprint arXiv:2512.03673, 2025
-
[30]
Bk-SDM: A lightweight, fast, and cheap version of stable diffusion
Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. Bk-SDM: A lightweight, fast, and cheap version of stable diffusion. InEuropean Conference on Computer Vision, pages 381–399. Springer, 2024
work page 2024
-
[31]
DiffWave: A Versatile Diffusion Model for Audio Synthesis
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020. 11
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[32]
VIEScore: Towards explain- able metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. VIEScore: Towards explain- able metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12268–12290, 2024
work page 2024
-
[33]
Black Forest Labs. FLUX. https://github.com/black-forest-labs/flux, 2024. Ac- cessed: 2026-04-10
work page 2024
-
[34]
Prometheus-vision: Vision-language model as a judge for fine-grained evaluation
Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, 2024
work page 2024
-
[35]
Lhotse documentation: lhotse.cut.mixed module
Lhotse Development Team. Lhotse documentation: lhotse.cut.mixed module. https:// lhotse.readthedocs.io/en/v1.24.2/_modules/lhotse/cut/mixed.html, 2024. Ac- cessed: 2026-04-10
work page 2024
-
[36]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
SVDQuant: Absorbing outliers by low-rank component for 4-bit diffusion models
Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Junxian Guo, Xiuyu Li, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. SVDQuant: Absorbing outliers by low-rank component for 4-bit diffusion models. In13th International Conference on Learning Representations, 2025
work page 2025
-
[38]
Q-Diffusion: Quantizing diffusion models
Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-Diffusion: Quantizing diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17535–17545, 2023
work page 2023
-
[39]
Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Processing Systems, 36:20662–20678, 2023
work page 2023
-
[40]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024
work page 2024
-
[41]
QServe: W4A8KV4 quantization and system co-design for efficient LLM serving
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. Proceedings of Machine Learning and Systems, 7, 2025
work page 2025
-
[42]
SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. In13th International Conference on Learning Representa- tions, 2025
work page 2025
-
[43]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022
work page 2022
-
[44]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025
work page 2025
-
[45]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
DeepCache: Accelerating diffusion models for free
Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024
work page 2024
-
[47]
Midjourney: Text-to-image generation model
Midjourney, Inc. Midjourney: Text-to-image generation model. https://www.midjourney. com, 2022. Accessed: 2026-04-10. 12
work page 2022
-
[48]
Up or down? Adaptive rounding for post-training quantization
Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? Adaptive rounding for post-training quantization. InInternational Conference on Machine Learning (ICML), 2020
work page 2020
-
[49]
NVIDIA Corporation. NVIDIA GeForce RTX 4090. https://www.nvidia.com/en-us/ geforce/graphics-cards/40-series/rtx-4090/, 2022. Accessed: 2026-04-28
work page 2022
-
[50]
NVIDIA Corporation. NVIDIA Blackwell Architecture. https://resources.nvidia.com/ en-us-blackwell-architecture, 2024. Accessed: 2026-04-11
work page 2024
-
[51]
OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[53]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[55]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[56]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
work page 2015
-
[57]
DGQ: Distribution-aware group quantization for text-to-image diffusion models
Hyogon Ryu, NaHyeon Park, and Hyunjung Shim. DGQ: Distribution-aware group quantization for text-to-image diffusion models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[58]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
work page 2022
-
[59]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024
work page 2024
-
[61]
ResQ: Mixed-precision quantiza- tion of large language models with low-rank residuals
Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. ResQ: Mixed-precision quantiza- tion of large language models with low-rank residuals. InProceedings of the 42nd International Conference on Machine Learning, 2025
work page 2025
-
[62]
Post-training quantization on diffusion models
Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981, 2023
work page 1972
-
[63]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[64]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[65]
Improved Techniques for Training Consistency Models
Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023. 13
work page internal anchor Pith review arXiv 2023
-
[66]
Triton: an intermediate language and compiler for tiled neural network computations
Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019
work page 2019
-
[67]
A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions
Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700–26709, 2024
work page 2024
-
[68]
Diffusers: State-of-the-art diffusion models
Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022. Accessed: 2026-04-10
work page 2022
-
[69]
Exploring clip for assessing the look and feel of images
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023
work page 2023
-
[70]
SparseDM: Toward sparse efficient diffusion models
Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, and Jun Zhu. SparseDM: Toward sparse efficient diffusion models. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–7. IEEE, 2025
work page 2025
-
[71]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality Assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004
work page 2004
-
[73]
Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. PTQ4DiT: Post- training quantization for diffusion transformers.Advances in neural information processing systems, 37:62732–62755, 2024
work page 2024
-
[74]
SmoothQuant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023
work page 2023
-
[75]
SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. In13th International Conference on Learning Representations, 2025
work page 2025
-
[76]
ImageReward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023
work page 2023
-
[77]
Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, and Qingyi Gu. LRQ-DiT: Log-rotation post-training quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2508.03485, 2025
-
[78]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
A Survey on Multimodal Large Language Models.National Science Review, 11(12):nwae403, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A Survey on Multimodal Large Language Models.National Science Review, 11(12):nwae403, 2024
work page 2024
-
[80]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 14
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.