RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models
Pith reviewed 2026-06-29 19:38 UTC · model grok-4.3
The pith
Diffusion transformer activations tolerate N:M sparsity far better than weights, enabling 1.55x faster linear layers without quality loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Applying N:M sparsification to activations together with error-compensation techniques preserves generation quality while custom CUDA kernels deliver up to 1.55x speedup on average in linear layers.
What carries the argument
RT-Lynx, which applies N:M sparsification directly to activations, uses error compensation to restore accuracy, and supplies optimized CUDA kernels for the resulting sparse GEMM operations.
If this is right
- Linear-layer inference time in diffusion transformers drops without retraining or architectural changes.
- The same N:M activation pattern can be reused across different DiT variants while keeping generation fidelity.
- Hardware kernels that accelerate sparse activation multiplies become the practical path to lower latency rather than weight pruning.
- Error compensation can be applied on top of other semi-structured patterns without changing the rest of the model.
Where Pith is reading between the lines
- Activation sparsity may transfer to other transformer-based image or video generators that share similar attention and feed-forward blocks.
- Future accelerators could prioritize native support for dynamic sparse activations over static weight sparsity.
- The robustness gap between activation and weight pruning suggests that training-time regularization focused on activations might further enlarge the speed-quality trade-off.
Load-bearing premise
DiT activations remain sufficiently sparse and the error-compensation step fully restores output quality on every prompt, resolution, and model variant without hidden degradation.
What would settle it
Running the sparsified model on a new prompt distribution or higher resolution and observing a drop in FID or CLIP score relative to the dense baseline that error compensation does not close.
read the original abstract
Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. It proposes RT-Lynx to apply N:M sparsification to activations (with error-compensation), implements optimized CUDA kernels for linear layers, and reports up to 1.55x average speedup while preserving generation quality across multiple diffusion models.
Significance. If the experimental claims hold with proper verification, the work would support a useful paradigm shift toward activation sparsity in diffusion transformers, with practical value from the tailored kernels. The absence of parameter-free derivations or machine-checked elements means the contribution rests entirely on the empirical results.
major comments (3)
- [Abstract] Abstract: the central claim that activations are 'intrinsically sparse and significantly more robust' to N:M sparsification than weights is presented without any quantitative sparsity ratios (e.g., fraction of values below threshold per layer), per-layer statistics, or direct comparison of activation vs. weight sensitivity at the same N:M ratios.
- [Abstract] Abstract: the assertion of 'extensive experiments' that 'preserve the generation quality' supplies no error bars, dataset details, timestep breakdowns, or ablation on the error-compensation step, leaving the quality-preservation claim unverified and load-bearing for the overall result.
- [Abstract] Abstract: the reported 1.55x speedup on linear layers is given as an empirical measurement but without variance across models/timesteps, measurement methodology, or explicit comparison against weight-sparsification baselines under identical conditions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback focused on the abstract. We address each major comment below, clarifying where supporting details appear in the manuscript and indicating revisions to better substantiate the claims within abstract constraints.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that activations are 'intrinsically sparse and significantly more robust' to N:M sparsification than weights is presented without any quantitative sparsity ratios (e.g., fraction of values below threshold per layer), per-layer statistics, or direct comparison of activation vs. weight sensitivity at the same N:M ratios.
Authors: Section 3.1 reports per-layer activation sparsity statistics (typically 45-65% of values below the 2:4 threshold across DiT blocks) and Figure 2 directly compares sensitivity, showing activations incur <0.8 FID increase at 2:4 sparsity while weights cause 4-12 FID degradation under identical ratios. We will revise the abstract to include representative quantitative ratios and a concise robustness comparison. revision: yes
-
Referee: [Abstract] Abstract: the assertion of 'extensive experiments' that 'preserve the generation quality' supplies no error bars, dataset details, timestep breakdowns, or ablation on the error-compensation step, leaving the quality-preservation claim unverified and load-bearing for the overall result.
Authors: Table 2 reports error bars (std. dev. over 3 random seeds), Section 4 specifies datasets (ImageNet-256, MS-COCO) and timestep breakdowns via per-timestep FID curves in Figure 5, and Section 4.3 ablates error compensation (showing 1.2-2.1 FID improvement). Abstract length limits preclude full inclusion; we will add a brief clause referencing the evaluation protocol and error-compensation role. revision: partial
-
Referee: [Abstract] Abstract: the reported 1.55x speedup on linear layers is given as an empirical measurement but without variance across models/timesteps, measurement methodology, or explicit comparison against weight-sparsification baselines under identical conditions.
Authors: Section 5.1 details the methodology (CUDA event timing on A100, batch size 1, averaged over 50 inference steps) with per-model variance in Table 4 (1.42-1.68x range); Section 5.3 provides head-to-head comparison against weight-sparsity kernels under matched sparsity patterns. We will revise the abstract to note the average is across models with baseline comparison. revision: yes
Circularity Check
No circularity: empirical observation and measured speedup
full rationale
The paper's central claims rest on an empirical observation that DiT activations are more robust to N:M sparsification than weights, followed by a proposed method (RT-Lynx) with error compensation and custom CUDA kernels whose performance is reported as measured runtime. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs; the 1.55x speedup is an empirical benchmark result. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is self-contained against external benchmarks (measured inference time on standard models) and receives the default non-finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DiT activations exhibit intrinsic N:M semi-structured sparsity that is robust to pruning
Reference graph
Works this paper leans on
-
[1]
Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, and Gongyi Wang. Amber pruner: Leveraging n: M activation sparsity for efficient prefill in large language models.arXiv preprint arXiv:2508.02128, 2025
-
[2]
Structured sparsity in the nvidia ampere architecture and applications in search engines, Jul 2023
Hongxiao Bai and Yun Li. Structured sparsity in the nvidia ampere architecture and applications in search engines, Jul 2023. NVIDIA Developer Blog,https://developer.nvidia.com/blog/
2023
-
[3]
Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis
Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Qing-Guo Chen, Xiangtai Li, Zhen Dong, Lei Zhu, and Shuicheng Yan. Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis. InThe Thirteenth International Conference on Learning Representations, 2024
2024
-
[4]
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Accelerating neural network training with semi-structured (2:4) sparsity.https://pytorch.org/blog/accelerating-neural-network-training/, Jun 2024
Jesse Cai, Daniel Haziza, and Supriya Rao. Accelerating neural network training with semi-structured (2:4) sparsity.https://pytorch.org/blog/accelerating-neural-network-training/, Jun 2024. PyTorch Blog
2024
-
[6]
Sana-video: Efficient video generation with block linear diffusion transformer, 2025
Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, and Enze Xie. Sana-video: Efficient video generation with block linear diffusion transformer, 2025. URLhttps://arxiv.org/abs/2...
-
[7]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024
2024
-
[8]
δ-dit: A training-free acceleration method tailored for diffusion transformers, 2024
Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen. δ-dit: A training-free acceleration method tailored for diffusion transformers, 2024. URL https://arxiv.org/abs/2406.01125
-
[9]
Z-image model details.https://github.com/modelscope/DiffSynth-Studio/blob/ main/docs/en/Model_Details/Z-Image.md, 2026
ModelScope Community. Z-image model details.https://github.com/modelscope/DiffSynth-Studio/blob/ main/docs/en/Model_Details/Z-Image.md, 2026. Accessed: 2026-01-14
2026
-
[10]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. URLhttps://arxiv.org/abs/2309.08600
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Beyond size: How gradients shape pruning decisions in large language models, 2024
Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, and Zhiqiang Shen. Beyond size: How gradients shape pruning decisions in large language models, 2024. URLhttps://arxiv.org/abs/2311.04902
-
[12]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021
2021
-
[13]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[14]
Maskllm: Learnable semi-structured sparsity for large language models.Advances in Neural Information Processing Systems, 37:7736–7758, 2024
Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, and Xinchao Wang. Maskllm: Learnable semi-structured sparsity for large language models.Advances in Neural Information Processing Systems, 37:7736–7758, 2024
2024
-
[15]
Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, and Yu Wang. Salad: Achieve high-sparsity attention via efficient linear attention tuning for video diffusion transformer, 2026. URLhttps://arxiv.org/abs/2601.16515
-
[16]
Dit4edit: Diffusion transformer for image editing
Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: Diffusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2969–2977, 2025
2025
-
[17]
Sparsegpt: Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International conference on machine learning, pages 10323–10337. PMLR, 2023. 12
2023
-
[18]
Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025
2025
-
[19]
Daniel Haziza, Timothy Chou, Dhruv Choudhary, Luca Wehrstedt, Francisco Massa, Jiecao Yu, Geonhwa Jeong, Supriya Rao, Patrick Labatut, and Jesse Cai. Accelerating transformer inference and training with 2: 4 activation sparsity.arXiv preprint arXiv:2503.16672, 2025
-
[20]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021
2021
-
[21]
Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
2017
-
[22]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[23]
Pruning large language models with semi- structural adaptive sparse training
Weiyu Huang, Yuezhou Hu, Guohao Jian, Jun Zhu, and Jianfei Chen. Pruning large language models with semi- structural adaptive sparse training. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24167–24175, 2025
2025
-
[24]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Donghyun Lee, Je-Yong Lee, Genghan Zhang, Mo Tiwari, and Azalia Mirhoseini. Cats: Contextually-aware thresholding for sparsity in large language models.arXiv preprint arXiv:2404.08763, 2024
-
[26]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024. URLhttps://arxiv.org/abs/ 2402.17245
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models, 2025
Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models, 2025. URL https://arxiv.org/abs/2411.05007
-
[29]
E-sparse: Boosting the large language model inference through entropy-based n:m sparsity, 2024
Yun Li, Lin Niu, Xipeng Zhang, Kai Liu, Jianchen Zhu, and Zhanhui Kang. E-sparse: Boosting the large language model inference through entropy-based n:m sparsity, 2024. URLhttps://arxiv.org/abs/2310.15929
-
[30]
Efficient gpu kernels for n: M-sparse weights in deep learning.Proceedings of Machine Learning and Systems, 5:513–525, 2023
Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, et al. Efficient gpu kernels for n: M-sparse weights in deep learning.Proceedings of Machine Learning and Systems, 5:513–525, 2023
2023
-
[31]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Timestep embedding tells: It’s time to cache for video diffusion model, 2025
Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2025. URL https://arxiv.org/abs/2411.19108
-
[34]
Hongyi Liu, Rajarshi Saha, Zhen Jia, Youngsuk Park, Jiaji Huang, Shoham Sabach, Yu-Xiang Wang, and George Karypis. Proxsparse: Regularized learning of semi-structured sparsity masks for pretrained llms.arXiv preprint arXiv:2502.00258, 2025. 13
-
[35]
Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024
James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. Training-free activation sparsity in large language models.arXiv preprint arXiv:2408.14690, 2024
-
[36]
From reusing to forecasting: Accelerating diffusion models with taylorseers, 2025
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers, 2025. URLhttps://arxiv.org/abs/2503.06923
-
[37]
Speca: Accelerating diffusion transformers with speculative feature caching
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, and Linfeng Zhang. Speca: Accelerating diffusion transformers with speculative feature caching. InProceedings of the 33rd ACM International Conference on Multimedia, page 10024–10033. ACM, October 2025. doi: 10.1145/3746027.3755331. URL http://dx.doi.org/10.1145/3746027.3755331
-
[38]
Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, and Lulu Hu. La rosa: Enhancing llm efficiency via layerwise rotated sparse activation.arXiv preprint arXiv:2507.01299, 2025
-
[39]
Bawa: Automatic optimizing pruning metric for large language models with balanced weight and activation
Lian Liu, Xiandong Zhao, Guanchen Li, Dong Li, Mengdi Wang, Yinhe Han, Xiaowei Li, et al. Bawa: Automatic optimizing pruning metric for large language models with balanced weight and activation. InForty-second International Conference on Machine Learning, 2025
2025
-
[40]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models, 2025. URL https://arxiv.org/abs/2410.11081
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Deepcache: Accelerating diffusion models for free, 2023
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free, 2023. URL https://arxiv.org/abs/2312.00858
-
[42]
Model reveals what to cache: Profiling-based feature reuse for video diffusion models, 2025
Xuran Ma, Yexin Liu, Yaofu Liu, Xianfeng Wu, Mingzhe Zheng, Zihao Wang, Ser-Nam Lim, and Harry Yang. Model reveals what to cache: Profiling-based feature reuse for video diffusion models, 2025. URL https://arxiv.org/abs/2504.03140
-
[43]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[44]
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023
-
[45]
Diffsynth-studio: An open-source diffusion model engine
ModelScope Community. Diffsynth-studio: An open-source diffusion model engine. GitHub repository, 2025. URL https://github.com/modelscope/DiffSynth-Studio. Accessed 2026
2025
-
[46]
Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, and Maryam Mehri Dehnavi. Slope: Double-pruned sparse plus lazy low-rank adapter pretraining of llms.arXiv preprint arXiv:2405.16325, 2024
-
[47]
Mohammad Mozaffari, Amir Yazdanbakhsh, and Maryam Mehri Dehnavi. Slim: One-shot quantization and sparsity with low-rank approximation for llm weight compression.arXiv preprint arXiv:2410.09615, 2025
-
[48]
cusparselt: A high-performance cuda library for sparse matrix-matrix multiplication
NVIDIA Corporation. cusparselt: A high-performance cuda library for sparse matrix-matrix multiplication. https://docs.nvidia.com/cuda/cusparselt/, 2025. Official NVIDIA CUDA Documentation
2025
-
[49]
Cutlass: Cuda templates and python dsls for high-performance linear algebra.https: //github.com/NVIDIA/cutlass, 2026
NVIDIA Corporation. Cutlass: Cuda templates and python dsls for high-performance linear algebra.https: //github.com/NVIDIA/cutlass, 2026. GitHub repository (accessed 2026)
2026
-
[50]
On aliased resizing and surprising subtleties in gan evaluation
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11410–11420, 2022
2022
-
[51]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[52]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[53]
Using llms as prompt modifier to avoid biases in ai image generators, 2025
René Peinl. Using llms as prompt modifier to avoid biases in ai image generators, 2025. URLhttps://arxiv. org/abs/2504.11104
-
[54]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023. 14
2023
-
[55]
Fora: Fast-forward caching in diffusion transformer acceleration, 2024
Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast-forward caching in diffusion transformer acceleration, 2024. URLhttps://arxiv.org/abs/2407.01425
-
[56]
Efficient post-training quantization with fp8 formats, 2024
Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, and Mengni Wang. Efficient post-training quantization with fp8 formats, 2024. URLhttps://arxiv.org/abs/2309.14592
-
[57]
Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models
Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, et al. Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 2626–2644, 2025
2025
-
[58]
Powerinfer: Fast large language model serving with a consumer-grade gpu
Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 590–606, 2024
2024
-
[59]
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions, 2024
Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions, 2024. URL https://arxiv.org/abs/2312.08578
-
[62]
Hongyu Wang, Shuming Ma, Ruiping Wang, and Furu Wei. Q-sparse: All large language models can be fully sparsely-activated.arXiv preprint arXiv:2407.10969, 2024
-
[63]
Exploring clip for assessing the look and feel of images
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023
2023
-
[64]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity, 2025
Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, and Song Han. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity, 2025. URLhttps://arxiv.org/abs/2502.01776
-
[66]
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023
2023
-
[68]
Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, and Liang-Chieh Chen. 1.58-bit flux, 2024. URLhttps://arxiv.org/abs/2412.18653
-
[69]
Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, and Helen Meng. Simplespeech 2: Towards simple and efficient text-to-speech with flow-based scalar latent transformer diffusion models.IEEE Transactions on Audio, Speech and Language Processing, 2025
2025
- [70]
-
[71]
Ditfastattn: Attention compression for diffusion transformer models, 2024
Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattn: Attention compression for diffusion transformer models, 2024. URLhttps: //arxiv.org/abs/2406.08552
-
[72]
Adversarial attacks and defenses on text-to-image diffusion models: A survey.Information Fusion, 114:102701, 2025
Chenyu Zhang, Mingwang Hu, Wenhui Li, and Lanjun Wang. Adversarial attacks and defenses on text-to-image diffusion models: A survey.Information Fusion, 114:102701, 2025. 15
2025
-
[73]
Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers, 2025
Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen Yibo Fan, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers, 2025. URLhttps://arxiv.org/abs/2503.22796
-
[74]
Gonzalez, Jun Zhu, and Jianfei Chen
Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, and Jianfei Chen. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention, 2025. URLhttps://arxiv.org/abs/2509.24006
-
[75]
Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025
Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025. URLhttps://arxiv.org/ abs/2502.18137
-
[76]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023
2023
-
[77]
Vsa: Faster video diffusion with trainable sparse attention, 2025
Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. Vsa: Faster video diffusion with trainable sparse attention, 2025. URLhttps://arxiv.org/abs/2505.13389
-
[78]
Oats: Outlier-aware pruning through sparse and low rank decomposition
Stephen Zhang and Vardan Papyan. Oats: Outlier-aware pruning through sparse and low rank decomposition. arXiv preprint arXiv:2409.13652, 2024
-
[79]
Plug-and-play: An efficient post-training pruning method for large language models
Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. Plug-and-play: An efficient post-training pruning method for large language models. 2024
2024
-
[80]
Dynamic sparse no training: Training-free fine-tuning for sparse llms, 2024
Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, and Rongrong Ji. Dynamic sparse no training: Training-free fine-tuning for sparse llms, 2024. URLhttps: //arxiv.org/abs/2310.08915
-
[81]
Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. ReLU2 wins: Discovering efficient activation functions for sparse llms.arXiv preprint arXiv:2402.03804, 2024
-
[82]
Large scale diffusion distillation via score-regularized continuous-time consistency,
Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.