DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

Hesham Mostafa; Mahsa Salmani; Sayeh Sharify

arxiv: 2605.16732 · v1 · pith:CDQRCQ33new · submitted 2026-05-16 · 💻 cs.CV · cs.LG

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

Sayeh Sharify , Mahsa Salmani , Hesham Mostafa This is my paper

Pith reviewed 2026-05-19 21:43 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords diffusion transformerspost-training quantization4-bit quantizationactivation rotationPCA subspaceimage generationmodel compressionefficient inference

0 comments

The pith

DiRotQ rotates activations into a PCA-derived basis to protect dominant variance directions during 4-bit quantization of diffusion transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformers produce high-quality images but require large amounts of memory and compute. DiRotQ is a post-training quantization technique that reduces both weights and activations to 4 bits while limiting quality loss. The method finds a low-rank subspace of activations using PCA on a calibration set and keeps that part at higher precision. Activations are rotated to align with this subspace before quantization, and the inverse rotation is merged into the model weights ahead of time. When combined with existing weight quantization, this yields an FID score of 15.9 and PSNR of 19.1 dB, beating the previous best 4-bit approach.

Core claim

DiRotQ identifies a low-rank subspace capturing dominant activation variance via PCA, preserves coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit, rotates activations into the PCA basis at inference using calibration-derived orthogonal transformations, fuses the inverse rotation into the layer weights offline, and combines this with GPTQ-based weight quantization to achieve an FID of 15.9 and PSNR of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset under the INT W4A4 setting.

What carries the argument

PCA-based identification of a low-rank subspace with rotation of activations into that basis at inference and offline fusion of the inverse rotation into weights.

If this is right

Outperforms SVDQuant with an FID of 15.9 and PSNR of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset under the same INT W4A4 setting.
Reduces memory usage of the 12B FLUX.1-dev model by 2.1x and delivers 2.3x speedup over the BF16 baseline on a 24 GB RTX 4090 GPU via a Triton-based custom kernel.
Introduces a VLM-as-a-Judge evaluation protocol for assessing perceptual quality and prompt alignment in quantized diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rotation and subspace protection approach could be tested on other large transformer generators such as video or audio diffusion models to check for similar compression gains.
Fusing rotations offline suggests the technique can slot into existing quantization toolchains without requiring changes to inference hardware.
The reported speedups on consumer GPUs indicate that custom kernels may be required to capture full benefits when mixing precision levels within layers.

Load-bearing premise

That a low-rank subspace identified via PCA on calibration data captures the dominant activation variance sufficiently to allow safe 4-bit quantization of the remaining components, with the rotation and fusion preserving overall model behavior.

What would settle it

Applying DiRotQ to PixArt-Σ and measuring an FID higher than 18.9 on the MJHQ-30K dataset would show that the approach does not outperform the prior state-of-the-art under the same W4A4 conditions.

Figures

Figures reproduced from arXiv: 2605.16732 by Hesham Mostafa, Mahsa Salmani, Sayeh Sharify.

**Figure 1.** Figure 1: PixArt-Σ [8] Block 2, self-attention input activation distributions across 20 denoising timesteps. (a) Temporal variance for channels 239 and 364. (b) Original activations show severe channel-wise variance with large outliers (max/median = 83×). (c, d) PCA isolates outliers in a 16-bit subspace (red), reducing the ratio in the remaining 4-bit channels (blue) to 16×. (e, f) Rotation further equalizes the 4-… view at source ↗

**Figure 2.** Figure 2: Per-block activation QSNR for the self-attention output projection across 28 PixArt-Σ transformer blocks at five timesteps. DiRotQ outperforms RTN and SVDQuant [37] by 5–10 dB consistently across blocks and timesteps. 2 Related work Diffusion models Diffusion models [63, 27] generate samples via iterative denoising and have achieved strong performance in text-to-image tasks [3, 55, 53]. Early diffusion mod… view at source ↗

**Figure 3.** Figure 3: Visual comparison on MJHQ-30K across three models. DiRotQ consistently outperforms [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Memory usage and one-step latency for FLUX.1-dev across batch sizes 1 and 2. DiRotQ [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: VLM-as-a-Judge pairwise comparison between SVDQuant and DiRotQ on PixArt- [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Per-block activation QSNR (dB) for RTN, SVDQuant, and DiRotQ across [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Activation QSNR (dB) on PixArt-Σ as a function of FP16 tail fraction, averaged across all 28 transformer blocks, for FFN up, self-attention QKV, and cross-attention Q layers across five denoising timesteps. RTN baselines are shown as horizontal references. QSNR rises sharply up to r=10% and saturates beyond, motivating our choice of r=10% throughout the paper. H Licenses of existing assets [PITH_FULL_IMAG… view at source ↗

**Figure 8.** Figure 8: Per-category VLM-as-a-Judge pairwise comparison between SVDQuant and DiRotQ on [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative visual results of PixArt-Σ on the MJHQ-30K and sDCI datasets. Prompts are reproduced verbatim. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative visual results of FLUX.1-dev on the MJHQ-30K and sDCI datasets. Prompts [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative visual results of SANA-1.6B on the MJHQ-30K and sDCI datasets. Prompts [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-{\Sigma} over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiRotQ gets better FID/PSNR than SVDQuant at W4A4 for DiTs by protecting a PCA subspace and rotating the rest, but the timestep-varying nature of activations makes the fixed calibration a real question mark.

read the letter

The main thing to know is that DiRotQ claims clearer wins over SVDQuant on PixArt-Σ at W4A4, with FID moving from 18.9 to 15.9 and PSNR from 17.6 to 19.1, by keeping a low-rank PCA subspace of activations in higher precision while rotating the orthogonal part into a basis that quantizes more cleanly to 4 bits. They fuse the inverse rotation into the weights offline and pair it with GPTQ on the weights. The VLM-as-a-Judge protocol and the Triton kernel for FLUX.1-dev are also new pieces here.

Referee Report

1 major / 2 minor

Summary. The manuscript presents DiRotQ, a rotation-aware post-training quantization (PTQ) framework for 4-bit Diffusion Transformers (DiTs). It identifies a low-rank subspace via PCA on calibration activations to preserve dominant variance at higher precision, quantizes the orthogonal components to 4 bits, applies calibration-derived rotations to activations at inference (fused into weights offline), and combines this with GPTQ for weights. Empirical results show FID 15.9 and PSNR 19.1 dB on PixArt-Σ with MJHQ-30K, outperforming SVDQuant (FID 18.9, PSNR 17.6) in W4A4 setting. It also proposes a VLM-as-a-Judge protocol and demonstrates 2.1x memory reduction and 2.3x speedup on FLUX.1-dev using a Triton kernel.

Significance. If the central empirical claims hold under rigorous verification, this contributes a practical PTQ method that narrows the quality gap for 4-bit DiT inference, enabling deployment of large models like 12B FLUX on consumer GPUs. The VLM-as-a-Judge evaluation is a positive addition for assessing semantic fidelity beyond FID/PSNR. The fusion of rotation into weights is a standard efficiency trick but applied here in a novel combination with PCA subspace for activations.

major comments (1)

[§3.2 (PCA Subspace Identification)] The description of the calibration procedure for determining the low-rank subspace does not specify the distribution or range of timesteps used in the calibration dataset. Since DiT activations vary significantly with the diffusion timestep t, a subspace derived from a limited t-range may leave substantial variance in the 4-bit path at other timesteps, potentially accumulating errors over the denoising trajectory and affecting the reported FID and PSNR improvements.

minor comments (2)

[Abstract] The abstract mentions 'the first such evaluation in this setting' for VLM-as-a-Judge; a reference to prior VLM evaluations in other quantization contexts would strengthen this claim.
[§5 (Efficiency Results)] The reported 2.3x speedup on RTX 4090 should include the batch size and resolution used for the measurement to allow reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment point by point below.

read point-by-point responses

Referee: [§3.2 (PCA Subspace Identification)] The description of the calibration procedure for determining the low-rank subspace does not specify the distribution or range of timesteps used in the calibration dataset. Since DiT activations vary significantly with the diffusion timestep t, a subspace derived from a limited t-range may leave substantial variance in the 4-bit path at other timesteps, potentially accumulating errors over the denoising trajectory and affecting the reported FID and PSNR improvements.

Authors: We agree that the timestep distribution in calibration is an important detail given the strong timestep dependence of DiT activations. In our procedure, activations were collected by sampling timesteps uniformly at random from the full range [0, T] (T = 1000) across a calibration set of 256 images, ensuring the PCA subspace reflects variance across the entire denoising trajectory rather than a narrow interval. We will revise §3.2 to state this uniform full-range sampling explicitly, including the number of calibration samples and the exact timestep selection method. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DiRotQ empirical claims

full rationale

The paper presents a post-training quantization method that computes a low-rank subspace via PCA on calibration data, rotates activations into that basis at inference, fuses the inverse rotation into weights offline, and applies GPTQ to weights. Reported FID of 15.9 and PSNR of 19.1 are measured on the held-out MJHQ-30K dataset for PixArt-Σ and compared against the external baseline SVDQuant under identical W4A4 settings. No equation or result reduces by construction to a fitted parameter or self-defined quantity within the paper; the performance numbers are independent empirical outcomes on data separate from calibration. The derivation chain relies on standard PTQ techniques and external benchmarks rather than self-citation chains or tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or axioms; the subspace dimension choice and calibration data selection are implicit but not quantified.

pith-pipeline@v0.9.0 · 5895 in / 1281 out tokens · 51764 ms · 2026-05-19T21:43:48.527133+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 17 internal anchors

[1]

QuaRot: Outlier-free 4-bit inference in rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. In38th Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[2]

Blended diffusion for text-driven editing of natural images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18208–18218, 2022

work page 2022
[3]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

All are worth words: A ViT backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

work page 2023
[5]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

work page 2022
[7]

MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024

work page 2024
[8]

PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In18th European Conference on Computer Vision. Springer, 2024

work page 2024
[9]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. PixArt- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Q-DiT: Accurate post-training quantization for diffusion transformers

Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-DiT: Accurate post-training quantization for diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025

work page 2025
[11]

Mills, and Di Niu

Ruichen Chen, Keith G. Mills, and Di Niu. FP4DiT: Towards effective floating point quan- tization for diffusion transformers.Transactions on Machine Learning Research (TMLR), 2025

work page 2025
[12]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

MJ-Bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, et al. MJ-Bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

work page arXiv 2024
[14]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational conference on machine learning, pages 7480–7512. PMLR, 2023

work page 2023
[16]

LLM.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

work page 2022
[17]

DiTAS: Quantizing diffusion transformers via enhanced activation smoothing

Zhenyuan Dong and Sai Qian Zhang. DiTAS: Quantizing diffusion transformers via enhanced activation smoothing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4606–4615. IEEE, 2025

work page 2025
[18]

The approximation of one matrix by another of lower rank

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936

work page 1936
[19]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In41st international conference on machine learning, 2024

work page 2024
[20]

OPTQ: Accurate post- training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan-Adrian Alistarh. OPTQ: Accurate post- training quantization for generative pre-trained transformers. In11th International Conference on Learning Representations, 2023

work page 2023
[21]

Springer Science & Business Media, 2012

Allen Gersho and Robert M Gray.Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012

work page 2012
[22]

FineGRAIN: Evaluating failure modes of text-to-image models with vision language model judges

Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, and Tom Goldstein. FineGRAIN: Evaluating failure modes of text-to-image models with vision language model judges. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[23]

PTQD: Accurate post-training quantization for diffusion models.Advances in Neural Information Processing Systems, 36:13237–13249, 2023

Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. PTQD: Accurate post-training quantization for diffusion models.Advances in Neural Information Processing Systems, 36:13237–13249, 2023

work page 2023
[24]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

ClipScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. ClipScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

work page 2021
[26]

GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[27]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[28]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

ConvRot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv preprint arXiv:2512.03673, 2025

Feice Huang et al. ConvRot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv preprint arXiv:2512.03673, 2025

work page arXiv 2025
[30]

Bk-SDM: A lightweight, fast, and cheap version of stable diffusion

Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. Bk-SDM: A lightweight, fast, and cheap version of stable diffusion. InEuropean Conference on Computer Vision, pages 381–399. Springer, 2024

work page 2024
[31]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020. 11

work page internal anchor Pith review Pith/arXiv arXiv 2009
[32]

VIEScore: Towards explain- able metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. VIEScore: Towards explain- able metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12268–12290, 2024

work page 2024
[33]

Black Forest Labs. FLUX. https://github.com/black-forest-labs/flux, 2024. Ac- cessed: 2026-04-10

work page 2024
[34]

Prometheus-vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, 2024

work page 2024
[35]

Lhotse documentation: lhotse.cut.mixed module

Lhotse Development Team. Lhotse documentation: lhotse.cut.mixed module. https:// lhotse.readthedocs.io/en/v1.24.2/_modules/lhotse/cut/mixed.html, 2024. Ac- cessed: 2026-04-10

work page 2024
[36]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

SVDQuant: Absorbing outliers by low-rank component for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Junxian Guo, Xiuyu Li, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. SVDQuant: Absorbing outliers by low-rank component for 4-bit diffusion models. In13th International Conference on Learning Representations, 2025

work page 2025
[38]

Q-Diffusion: Quantizing diffusion models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-Diffusion: Quantizing diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17535–17545, 2023

work page 2023
[39]

SnapFusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Processing Systems, 36:20662–20678, 2023

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Processing Systems, 36:20662–20678, 2023

work page 2023
[40]

AWQ: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024

work page 2024
[41]

QServe: W4A8KV4 quantization and system co-design for efficient LLM serving

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[42]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. In13th International Conference on Learning Representa- tions, 2025

work page 2025
[43]

DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

work page 2022
[44]

DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

work page 2025
[45]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

DeepCache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024

work page 2024
[47]

Midjourney: Text-to-image generation model

Midjourney, Inc. Midjourney: Text-to-image generation model. https://www.midjourney. com, 2022. Accessed: 2026-04-10. 12

work page 2022
[48]

Up or down? Adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? Adaptive rounding for post-training quantization. InInternational Conference on Machine Learning (ICML), 2020

work page 2020
[49]

NVIDIA GeForce RTX 4090

NVIDIA Corporation. NVIDIA GeForce RTX 4090. https://www.nvidia.com/en-us/ geforce/graphics-cards/40-series/rtx-4090/, 2022. Accessed: 2026-04-28

work page 2022
[50]

NVIDIA Blackwell Architecture

NVIDIA Corporation. NVIDIA Blackwell Architecture. https://resources.nvidia.com/ en-us-blackwell-architecture, 2024. Accessed: 2026-04-11

work page 2024
[51]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[53]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[55]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[56]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[57]

DGQ: Distribution-aware group quantization for text-to-image diffusion models

Hyogon Ryu, NaHyeon Park, and Hyunjung Shim. DGQ: Distribution-aware group quantization for text-to-image diffusion models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[58]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022
[59]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

work page 2024
[61]

ResQ: Mixed-precision quantiza- tion of large language models with low-rank residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. ResQ: Mixed-precision quantiza- tion of large language models with low-rank residuals. InProceedings of the 42nd International Conference on Machine Learning, 2025

work page 2025
[62]

Post-training quantization on diffusion models

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981, 2023

work page 1972
[63]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[64]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[65]

Improved Techniques for Training Consistency Models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023. 13

work page internal anchor Pith review arXiv 2023
[66]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

work page 2019
[67]

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700–26709, 2024

work page 2024
[68]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022. Accessed: 2026-04-10

work page 2022
[69]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

work page 2023
[70]

SparseDM: Toward sparse efficient diffusion models

Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, and Jun Zhu. SparseDM: Toward sparse efficient diffusion models. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–7. IEEE, 2025

work page 2025
[71]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Image Quality Assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality Assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004

work page 2004
[73]

PTQ4DiT: Post- training quantization for diffusion transformers.Advances in neural information processing systems, 37:62732–62755, 2024

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. PTQ4DiT: Post- training quantization for diffusion transformers.Advances in neural information processing systems, 37:62732–62755, 2024

work page 2024
[74]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[75]

SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. In13th International Conference on Learning Representations, 2025

work page 2025
[76]

ImageReward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023
[77]

LRQ-DiT: Log-rotation post-training quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2508.03485, 2025

Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, and Qingyi Gu. LRQ-DiT: Log-rotation post-training quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2508.03485, 2025

work page arXiv 2025
[78]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

A Survey on Multimodal Large Language Models.National Science Review, 11(12):nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A Survey on Multimodal Large Language Models.National Science Review, 11(12):nwae403, 2024

work page 2024
[80]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 14

work page 2024

Showing first 80 references.

[1] [1]

QuaRot: Outlier-free 4-bit inference in rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs. In38th Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[2] [2]

Blended diffusion for text-driven editing of natural images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18208–18218, 2022

work page 2022

[3] [3]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

All are worth words: A ViT backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

work page 2023

[5] [5]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022

work page 2022

[7] [7]

MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024

work page 2024

[8] [8]

PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In18th European Conference on Computer Vision. Springer, 2024

work page 2024

[9] [9]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. PixArt- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Q-DiT: Accurate post-training quantization for diffusion transformers

Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-DiT: Accurate post-training quantization for diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025

work page 2025

[11] [11]

Mills, and Di Niu

Ruichen Chen, Keith G. Mills, and Di Niu. FP4DiT: Towards effective floating point quan- tization for diffusion transformers.Transactions on Machine Learning Research (TMLR), 2025

work page 2025

[12] [12]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

MJ-Bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, et al. MJ-Bench: Is your multimodal reward model really a good judge for text-to-image generation?arXiv preprint arXiv:2407.04842, 2024

work page arXiv 2024

[14] [14]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational conference on machine learning, pages 7480–7512. PMLR, 2023

work page 2023

[16] [16]

LLM.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

work page 2022

[17] [17]

DiTAS: Quantizing diffusion transformers via enhanced activation smoothing

Zhenyuan Dong and Sai Qian Zhang. DiTAS: Quantizing diffusion transformers via enhanced activation smoothing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4606–4615. IEEE, 2025

work page 2025

[18] [18]

The approximation of one matrix by another of lower rank

Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936

work page 1936

[19] [19]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In41st international conference on machine learning, 2024

work page 2024

[20] [20]

OPTQ: Accurate post- training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan-Adrian Alistarh. OPTQ: Accurate post- training quantization for generative pre-trained transformers. In11th International Conference on Learning Representations, 2023

work page 2023

[21] [21]

Springer Science & Business Media, 2012

Allen Gersho and Robert M Gray.Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012

work page 2012

[22] [22]

FineGRAIN: Evaluating failure modes of text-to-image models with vision language model judges

Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, and Tom Goldstein. FineGRAIN: Evaluating failure modes of text-to-image models with vision language model judges. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025

[23] [23]

PTQD: Accurate post-training quantization for diffusion models.Advances in Neural Information Processing Systems, 36:13237–13249, 2023

Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. PTQD: Accurate post-training quantization for diffusion models.Advances in Neural Information Processing Systems, 36:13237–13249, 2023

work page 2023

[24] [24]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [25]

ClipScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. ClipScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

work page 2021

[26] [26]

GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[27] [27]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[28] [28]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

ConvRot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv preprint arXiv:2512.03673, 2025

Feice Huang et al. ConvRot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv preprint arXiv:2512.03673, 2025

work page arXiv 2025

[30] [30]

Bk-SDM: A lightweight, fast, and cheap version of stable diffusion

Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. Bk-SDM: A lightweight, fast, and cheap version of stable diffusion. InEuropean Conference on Computer Vision, pages 381–399. Springer, 2024

work page 2024

[31] [31]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020. 11

work page internal anchor Pith review Pith/arXiv arXiv 2009

[32] [32]

VIEScore: Towards explain- able metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. VIEScore: Towards explain- able metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12268–12290, 2024

work page 2024

[33] [33]

Black Forest Labs. FLUX. https://github.com/black-forest-labs/flux, 2024. Ac- cessed: 2026-04-10

work page 2024

[34] [34]

Prometheus-vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, 2024

work page 2024

[35] [35]

Lhotse documentation: lhotse.cut.mixed module

Lhotse Development Team. Lhotse documentation: lhotse.cut.mixed module. https:// lhotse.readthedocs.io/en/v1.24.2/_modules/lhotse/cut/mixed.html, 2024. Ac- cessed: 2026-04-10

work page 2024

[36] [36]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

SVDQuant: Absorbing outliers by low-rank component for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Junxian Guo, Xiuyu Li, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. SVDQuant: Absorbing outliers by low-rank component for 4-bit diffusion models. In13th International Conference on Learning Representations, 2025

work page 2025

[38] [38]

Q-Diffusion: Quantizing diffusion models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-Diffusion: Quantizing diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17535–17545, 2023

work page 2023

[39] [39]

SnapFusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Processing Systems, 36:20662–20678, 2023

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Processing Systems, 36:20662–20678, 2023

work page 2023

[40] [40]

AWQ: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024

work page 2024

[41] [41]

QServe: W4A8KV4 quantization and system co-design for efficient LLM serving

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. QServe: W4A8KV4 quantization and system co-design for efficient LLM serving. Proceedings of Machine Learning and Systems, 7, 2025

work page 2025

[42] [42]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations. In13th International Conference on Learning Representa- tions, 2025

work page 2025

[43] [43]

DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

work page 2022

[44] [44]

DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, 2025

work page 2025

[45] [45]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

DeepCache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024

work page 2024

[47] [47]

Midjourney: Text-to-image generation model

Midjourney, Inc. Midjourney: Text-to-image generation model. https://www.midjourney. com, 2022. Accessed: 2026-04-10. 12

work page 2022

[48] [48]

Up or down? Adaptive rounding for post-training quantization

Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? Adaptive rounding for post-training quantization. InInternational Conference on Machine Learning (ICML), 2020

work page 2020

[49] [49]

NVIDIA GeForce RTX 4090

NVIDIA Corporation. NVIDIA GeForce RTX 4090. https://www.nvidia.com/en-us/ geforce/graphics-cards/40-series/rtx-4090/, 2022. Accessed: 2026-04-28

work page 2022

[50] [50]

NVIDIA Blackwell Architecture

NVIDIA Corporation. NVIDIA Blackwell Architecture. https://resources.nvidia.com/ en-us-blackwell-architecture, 2024. Accessed: 2026-04-11

work page 2024

[51] [51]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[53] [53]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021

[55] [55]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[56] [56]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015

[57] [57]

DGQ: Distribution-aware group quantization for text-to-image diffusion models

Hyogon Ryu, NaHyeon Park, and Hyunjung Shim. DGQ: Distribution-aware group quantization for text-to-image diffusion models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025

[58] [58]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022

[59] [59]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[60] [60]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

work page 2024

[61] [61]

ResQ: Mixed-precision quantiza- tion of large language models with low-rank residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. ResQ: Mixed-precision quantiza- tion of large language models with low-rank residuals. InProceedings of the 42nd International Conference on Machine Learning, 2025

work page 2025

[62] [62]

Post-training quantization on diffusion models

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981, 2023

work page 1972

[63] [63]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015

[64] [64]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[65] [65]

Improved Techniques for Training Consistency Models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023. 13

work page internal anchor Pith review arXiv 2023

[66] [66]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

work page 2019

[67] [67]

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700–26709, 2024

work page 2024

[68] [68]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022. Accessed: 2026-04-10

work page 2022

[69] [69]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

work page 2023

[70] [70]

SparseDM: Toward sparse efficient diffusion models

Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, and Jun Zhu. SparseDM: Toward sparse efficient diffusion models. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–7. IEEE, 2025

work page 2025

[71] [71]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [72]

Image Quality Assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality Assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004

work page 2004

[73] [73]

PTQ4DiT: Post- training quantization for diffusion transformers.Advances in neural information processing systems, 37:62732–62755, 2024

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, and Yan Yan. PTQ4DiT: Post- training quantization for diffusion transformers.Advances in neural information processing systems, 37:62732–62755, 2024

work page 2024

[74] [74]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

work page 2023

[75] [75]

SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. In13th International Conference on Learning Representations, 2025

work page 2025

[76] [76]

ImageReward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023

[77] [77]

LRQ-DiT: Log-rotation post-training quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2508.03485, 2025

Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, and Qingyi Gu. LRQ-DiT: Log-rotation post-training quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2508.03485, 2025

work page arXiv 2025

[78] [78]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [79]

A Survey on Multimodal Large Language Models.National Science Review, 11(12):nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A Survey on Multimodal Large Language Models.National Science Review, 11(12):nwae403, 2024

work page 2024

[80] [80]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 14

work page 2024