pith. machine review for the scientific record. sign in

arxiv: 2605.09503 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: no theorem link

PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords post-training quantizationdiffusion modelschannel reorderingper-group quantizationlow-bit inferencequantization errorFLUX.1DiT models
0
0 comments X

The pith

Reordering channels to group similar statistics reduces per-group quantization error in diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large diffusion models require heavy compute and memory, limiting their use on single GPUs or in interactive settings. Post-training quantization compresses them but suffers quality loss at low bits because per-group scaling lets outlier channels dominate shared scales when unlike channels sit together. The paper shows that a simple pre-sort of channels by a joint second-moment measure of activations and weights, followed by a calibration check that accepts the order only if it lowers error, consistently improves quantization. The chosen permutations are folded into weights or neighboring layers offline so inference incurs no extra cost. Experiments on several large models confirm lower error than prior PTQ methods and deliver measurable speed and memory gains.

Core claim

The central claim is that selecting a channel permutation on calibration data via a joint second-moment criterion places channels with similar activation and weight statistics into the same per-group quantization block, thereby lowering the shared scale's sensitivity to outliers and reducing overall quantization error for diffusion models. The method applies the permutation only when it improves calibration error and absorbs the reordering into adjacent modules or weights, preserving exact inference speed. This yields lower error than existing PTQ baselines across multiple large diffusion models.

What carries the argument

The joint second-moment criterion used to sort channels before grouping, paired with a calibration-data acceptance rule that applies the permutation only if it reduces measured quantization error.

If this is right

  • Quantization error falls under W4A4 and similar low-bit regimes for diffusion models.
  • Models achieve up to 1.8 times single-step speedup on hardware such as RTX 5090.
  • DiT memory footprint shrinks by 3.5 times under W4A4 NVFP4 quantization.
  • The approach outperforms prior post-training quantization baselines without requiring retraining.
  • Permutations can be applied offline to weights or absorbed into adjacent layers with no runtime overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reordering principle could extend to per-group quantization in non-diffusion transformers if the second-moment criterion remains predictive.
  • If calibration sets are chosen to cover edge-case prompts, the method might support even lower bit widths such as 2-bit without visible degradation.
  • Applying the technique after weight pruning or alongside activation-aware scaling could compound compression gains.
  • The offline absorption step suggests similar reordering tactics are feasible in other compression pipelines where runtime cost must stay zero.

Load-bearing premise

A permutation chosen on calibration data will generalize to the full input distribution seen at inference time without creating new artifacts in generated images.

What would settle it

Compare quantization error or generation metrics such as FID on a held-out prompt set between the quantized model with the selected permutation and the same model without permutation; an increase in error or drop in quality falsifies the claim.

Figures

Figures reproduced from arXiv: 2605.09503 by Junxian Li, Kai Liu, Kaiwen Tao, Renjing Pei, Yongsen Cheng, Yulun Zhang, Zhikai Chen, Zhixin Wang.

Figure 1
Figure 1. Figure 1: PermuQuant is a post-training quantization framework for low-bit diffusion models. In the W4A4 setting, it achieves 3.5× DiT memory reduction and 6.3× speedup on a single RTX 5090 32GB GPU by eliminating CPU offloading. In the challenging W3A3 setting, PermuQuant still produces visually clean results with faithful details, significantly outperforming other baselines. Abstract Large-scale visual generative … view at source ↗
Figure 2
Figure 2. Figure 2: Example of per-group quan￾tization. Quantization error is greatly affected by channel orders. Despite this progress, extremely low-bit weight-activation quantization remains challenging (see [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) and (b): Relative change in activation quantization error caused by random channel [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of PermuQuant. (a) Channel reordering places channels with similar statistics [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on SANA-1.5-1.6B and FLUX.1-dev. All methods are evaluated [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Efficiency comparison on FLUX.1-dev using an RTX 5090 Desktop 32GB GPU. Reordering [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on SANA-1.5-1.6B. The dashed line separates BF16 from the [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on FLUX.1-dev. The dashed line separates BF16 from the quantized [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on Z-Image-Turbo. The dashed line separates BF16 from the [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.8$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PermuQuant, a post-training quantization (PTQ) method for diffusion models that identifies channel ordering as a key factor in per-group quantization error. It proposes sorting channels by a joint second-moment criterion computed on calibration data to group channels with similar activation/weight statistics, applies a calibration-based acceptance rule to retain only error-reducing permutations, and absorbs the selected permutations offline into adjacent modules or weights to avoid runtime cost. The central claim is that this yields consistent quantization error reduction and outperforms existing PTQ baselines, with concrete gains reported on FLUX.1-dev (up to 1.8× single-step speedup and 3.5× DiT memory reduction under W4A4 NVFP4).

Significance. If the empirical claims hold, the work provides a lightweight, training-free improvement to low-bit quantization of large generative models, directly addressing deployment barriers on single-GPU or resource-constrained settings. The offline absorption of permutations is a practical strength that incurs no inference overhead. The focus on an underexplored aspect of per-group quantization (channel ordering) could influence future PTQ designs for diffusion and other generative architectures.

major comments (2)
  1. [Method (permutation selection and acceptance rule)] The central claim of consistent error reduction and outperformance rests on the assumption that a permutation chosen via the joint second-moment criterion on calibration data will generalize across the full range of activations encountered during diffusion sampling. Diffusion models exhibit substantial distribution shifts from noisy to clean latents; the manuscript should therefore report quantization error or generation quality (e.g., FID) measured on held-out activations spanning multiple timesteps, or provide an ablation comparing calibration-only vs. full-trajectory error to substantiate generalization.
  2. [Experiments] The abstract and introduction assert 'consistent' outperformance and error reduction with specific speedup/memory numbers, yet the manuscript text provides no quantitative tables, per-layer or per-model error comparisons, baseline implementation details, ablation studies on the acceptance rule, or error-bar statistics. Without these, the magnitude and reliability of the claimed improvements cannot be verified.
minor comments (2)
  1. [Method] Clarify the exact definition of the joint second-moment criterion (e.g., the mathematical formulation combining activation and weight statistics) and whether it is computed per-layer or globally.
  2. [Abstract] The statement 'Code will be available' should be accompanied by a concrete repository link or a note on reproducibility artifacts (e.g., calibration data splits).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, explaining our approach and planned revisions.

read point-by-point responses
  1. Referee: [Method (permutation selection and acceptance rule)] The central claim of consistent error reduction and outperformance rests on the assumption that a permutation chosen via the joint second-moment criterion on calibration data will generalize across the full range of activations encountered during diffusion sampling. Diffusion models exhibit substantial distribution shifts from noisy to clean latents; the manuscript should therefore report quantization error or generation quality (e.g., FID) measured on held-out activations spanning multiple timesteps, or provide an ablation comparing calibration-only vs. full-trajectory error to substantiate generalization.

    Authors: We agree that distribution shifts across timesteps in diffusion models warrant explicit verification of generalization. Our calibration set is constructed by sampling activations from multiple timesteps along the diffusion trajectory to capture a representative range of statistics. The acceptance rule further filters permutations to those that reduce error on this calibration data. To directly address the concern, we will add an ablation in the revised manuscript comparing per-group quantization error (and FID where feasible) on the calibration set versus held-out timesteps from the full sampling trajectory. revision: yes

  2. Referee: [Experiments] The abstract and introduction assert 'consistent' outperformance and error reduction with specific speedup/memory numbers, yet the manuscript text provides no quantitative tables, per-layer or per-model error comparisons, baseline implementation details, ablation studies on the acceptance rule, or error-bar statistics. Without these, the magnitude and reliability of the claimed improvements cannot be verified.

    Authors: We acknowledge that the initial manuscript text did not include the requested quantitative tables, per-layer breakdowns, or statistical details, which limits verifiability. The abstract reports aggregate gains, but we will revise the experiments section to add comprehensive tables showing per-layer and per-model quantization error reductions, full baseline implementation details, dedicated ablations on the acceptance rule, and error bars computed over multiple independent calibration runs. revision: yes

Circularity Check

0 steps flagged

No circularity: PermuQuant selects permutations via explicit external criterion on calibration data

full rationale

The derivation defines a joint second-moment sorting criterion and a calibration-based acceptance rule that are applied offline to weights or adjacent modules. These steps are independent of the final inference-time error reduction claims; the paper measures improvement on separate test sets and models rather than re-using the selection criterion as its own output. No equations reduce claimed gains to a fitted parameter defined by the result itself, and no self-citation chain bears the central premise. The approach is externally falsifiable on held-out diffusion sampling trajectories.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that grouping channels by similar second-moment statistics reduces per-group quantization error; the acceptance rule is an empirical safeguard rather than a free parameter.

axioms (1)
  • domain assumption Channels with similar activation and weight second-moment statistics benefit from sharing a quantization scale.
    This is the core observation stated in the abstract that motivates the reordering step.

pith-pipeline@v0.9.0 · 5627 in / 1329 out tokens · 64553 ms · 2026-05-12T04:17:00.312464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

  1. [1]

    Quarot: Outlier-free 4-bit inference in rotated llms

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. InNeurIPS, 2024

  2. [2]

    ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022

  3. [3]

    Demystifying mmd gans.arXiv, 2018

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv, 2018

  4. [4]

    Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025

  5. [5]

    Efficientqat: Efficient quantization-aware training for large language models

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InACL, 2025

  6. [6]

    Asyncdiff: Parallelizing diffusion models by asynchronous denoising

    Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, and Xinchao Wang. Asyncdiff: Parallelizing diffusion models by asynchronous denoising. InNeurIPS, 2024

  7. [7]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021

  8. [8]

    Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024

    Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and Jiannan Wang. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024

  9. [9]

    Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022

  10. [10]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  11. [11]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

  12. [12]

    Classifier-free diffusion guidance.arXiv, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv, 2022

  13. [13]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

  14. [14]

    Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025

    Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, and Haoqian Wang. Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025

  15. [15]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  16. [16]

    Playground v2

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv, 2024

  17. [17]

    Distrifusion: Distributed parallel inference for high-resolution diffusion models

    Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InCVPR, 2024

  18. [18]

    Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

    Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InICLR, 2025

  19. [19]

    Q-diffusion: Quantizing diffusion models

    Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. InICCV, 2023. 10

  20. [20]

    Snapfusion: Text-to-image diffusion model on mobile devices within two seconds

    Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. InNeurIPS, 2023

  21. [21]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

  22. [22]

    From reusing to forecasting: Accelerating diffusion models with taylorseers

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025

  23. [23]

    Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025

    Kai Liu, Shaoqiu Zhang, Linghe Kong, and Yulun Zhang. Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025

  24. [24]

    Low-bit model quantization for deep neural networks: A survey.arXiv, 2025

    Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, et al. Low-bit model quantization for deep neural networks: A survey.arXiv, 2025

  25. [25]

    Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022

  26. [26]

    Reactnet: Towards precise binary neural network with generalized activation functions

    Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. InECCV, 2020

  27. [27]

    Spinquant: Llm quantization with learned rotations.arXiv, 2024

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv, 2024

  28. [28]

    Post-training quantization for vision transformer

    Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer. InNeurIPS, 2021

  29. [29]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InNeurIPS, 2022

  30. [30]

    Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023

  31. [31]

    Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022

    Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022

  32. [32]

    Deepcache: Accelerating diffusion models for free

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InCVPR, 2024

  33. [33]

    Daniel Marco and David L. Neuhoff. The validity of the additive noise model for uniform scalar quantizers.TIT, 2005

  34. [34]

    Training binary neural networks with real-to-binary convolutions

    Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. InICLR, 2020

  35. [35]

    On distillation of guided diffusion models

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023

  36. [36]

    A white paper on neural network quantization.arXiv, 2021

    Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization.arXiv, 2021

  37. [37]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021

  38. [38]

    On aliased resizing and surprising subtleties in gan evaluation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

  39. [39]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  40. [40]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023. 11

  41. [41]

    Forward and backward information retention for accurate binary neural networks

    Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and backward information retention for accurate binary neural networks. InCVPR, 2020

  42. [42]

    Quantsr: Accurate low-bit quantization for efficient image super-resolution

    Haotong Qin, Yulun Zhang, Yifu Ding, Yifan liu, Xianglong Liu, Martin Danelljan, and Fisher Yu. Quantsr: Accurate low-bit quantization for efficient image super-resolution. InNeurIPS, 2023

  43. [43]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

  44. [44]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015

  45. [45]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv, 2022

  46. [46]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InECCV, 2024

  47. [47]

    Post-training quantization on diffusion models

    Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InCVPR, 2023

  48. [48]

    Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023

  49. [49]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, 2015

  50. [50]

    Denoising diffusion implicit models.arXiv, 2020

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv, 2020

  51. [51]

    Score-based generative modeling through stochastic differential equations.arXiv, 2020

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv, 2020

  52. [52]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

  53. [53]

    Efficient neural network deployment for microcontroller.arXiv, 2020

    Hasan Unlu. Efficient neural network deployment for microcontroller.arXiv, 2020

  54. [54]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML, 2023

  55. [55]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025

  56. [56]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

  57. [57]

    Improved distribution matching distillation for fast image synthesis

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024

  58. [58]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024

  59. [59]

    Fast sampling of diffusion models with exponential integrator.arXiv, 2022

    Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv, 2022. 12 A Table of Contents In the supplementary material, we provide complete proofs, implementation details, analysis, and results, including: • Sec. B: Proofs of the expected quantization error bound, the optimality of second-moment sorting, and the a...

  60. [60]

    , x[π(K)]from global memory; 16

    loads one activation row asx[π(1)], . . . , x[π(K)]from global memory; 16

  61. [61]

    computes the RMS statistic or the mean and variance on the loaded values

  62. [62]

    applies the corresponding channel-wise scale or modulation

  63. [63]

    In this way, the reordering is absorbed into the mandatory input-read stage of normalization

    writes the reordered normalized output contiguously. In this way, the reordering is absorbed into the mandatory input-read stage of normalization. The fused kernel avoids a standalone reorder pass over the activation tensor. The only additional work is reading the permutation indices and generating indexed memory addresses. This is much cheaper than mater...