arxiv: 2605.14191 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

Zhuojin Li , Hsin-Pai Cheng , Hong Cai , Shizhong Han , Fatih Porikli

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion transformerstoken pruningspatial coherenceefficient inferenceself-attention optimizationimage generationvideo generation

0 comments

The pith

CoReDiT prunes redundant tokens in diffusion transformers via spatial coherence scores and reconstructs their outputs from neighbors to cut computation while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoReDiT as a structured pruning method for diffusion transformers that estimates local redundancy in the token lattice with a linear-time spatial coherence score. It skips high-coherence tokens during self-attention and rebuilds the skipped outputs by aggregating information from spatially neighboring retained tokens. A progressive block-adaptive schedule gradually increases pruning and gives larger budgets to steps and blocks with higher redundancy. The approach is tested on backbones such as PixArt-α and MagicDrive-V2. It delivers substantial FLOPs savings and speedups on both cloud and mobile hardware without degrading visual quality.

Core claim

CoReDiT establishes that spatial coherence in the latent token lattice reliably marks redundant tokens that can be skipped in DiT self-attention, with coherence-guided aggregation of neighboring retained tokens restoring dense outputs and avoiding visual discontinuities. A progressive, block-adaptive pruning schedule that ramps up pruning and allocates larger budgets to high-redundancy blocks and denoising steps further improves efficiency. Across state-of-the-art diffusion backbones, this yields up to 55 percent self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs while maintaining high visual quality and increasing on-device memory headroom.

What carries the argument

The spatial coherence score, a linear-time metric of local redundancy in the token lattice, decides which tokens to prune from self-attention and guides neighbor-based aggregation to reconstruct their attention outputs.

If this is right

Up to 55% reduction in self-attention FLOPs across diffusion transformer backbones.
Inference speedups reaching 1.33x on cloud GPUs and 1.72x on mobile NPUs.
No loss in visual quality for image and video generation tasks.
Increased on-device memory headroom that supports higher-resolution outputs.
Works across multiple state-of-the-art models including PixArt-α and MagicDrive-V2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reconstruction step from spatial neighbors could preserve local image structures more reliably than global token averaging methods used in other pruning schemes.
Progressive pruning schedules that adapt per block and per denoising step may transfer to efficiency gains in non-diffusion transformer architectures for vision or language.
Memory savings on device could open real-time generation applications that current full models cannot support at high resolutions.

Load-bearing premise

The linear-time spatial coherence score accurately flags redundant tokens so that their removal and neighbor-based reconstruction introduce no perceptible visual artifacts or quality loss across prompts and resolutions.

What would settle it

Generate images with and without CoReDiT pruning on a large, diverse prompt set at multiple resolutions and compare perceptual metrics such as FID scores or side-by-side human ratings for visible artifacts.

Figures

Figures reproduced from arXiv: 2605.14191 by Fatih Porikli, Hong Cai, Hsin-Pai Cheng, Shizhong Han, Zhuojin Li.

**Figure 2.** Figure 2: Token selection motivation and visualization. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Issues and proposed improvements. (a) Naively skip [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: SC score evolution across transformer blocks and de [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Coherence-guided progressive pruning across trans [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Progressive pruning dynamics for PixArt- [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison to PixArt-α-1024. Model (ratio) FLOPs Reduction Latency Image Quality Self-Attn End-to-end Self-Attn End-to-end FID ↓ CLIP ↑ IS ↑ PixArt-Σ-2048 - - 4.62s 6.68s 26.0 31.4 37.49 CoReDiT (r = 23%) + distillation -32% -25% 3.61s (-22%) 5.67s (-15%) 28.0 31.4 36.52 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Per-block latency comparison on Qualcomm Snap [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-{\alpha} and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoReDiT shows a workable way to cut DiT self-attention cost by pruning on local spatial coherence and patching the gaps with neighbor reconstruction, but the gains rest on an approximation that may not always capture global dependencies.

read the letter

CoReDiT prunes tokens in diffusion transformers by scoring spatial coherence in linear time, skips the redundant ones, and reconstructs their attention outputs from neighboring tokens. It adds a progressive pruning schedule that adapts per block and step. This gets up to 55% less self-attention compute and speedups of 1.33x on GPUs and 1.72x on mobile NPUs, with visual quality holding up on the tested models. The new part is tying the pruning decision directly to a coherence score computed on the latent lattice, then using that same coherence to guide how to fill in the skipped attention values from neighbors. The block-adaptive schedule is also a nice touch for gradually increasing the pruning as redundancy grows. It does a good job showing results on current backbones like PixArt-α and MagicDrive-V2, and checking both high-end and on-device hardware. The memory savings for higher resolution is a practical bonus. The soft spot is the reconstruction step. Since self-attention in DiTs is global, dropping a token based only on local spatial coherence risks losing long-range dependencies, especially later in denoising. The neighbor aggregation is a reasonable hack, but without any analysis bounding the error or showing when it fails, it's hard to know how far this generalizes beyond the reported cases. I'd also like to see error bars and more varied test prompts to be sure the quality claim is robust. This paper is for people building efficient generative models for real hardware. It has enough concrete engineering to warrant a serious referee, though revisions would likely focus on strengthening the validation of the approximation. Recommendation: send it through peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces CoReDiT, a structured token pruning method for Diffusion Transformers. It computes a linear-time spatial coherence score over the latent token lattice to identify and skip high-coherence (redundant) tokens during self-attention, then reconstructs the skipped outputs by coherence-guided aggregation of neighboring retained tokens. A progressive, block-adaptive pruning schedule gradually increases the pruning budget across blocks and denoising timesteps. Experiments on PixArt-α and MagicDrive-V2 report up to 55% self-attention FLOPs reduction with 1.33× GPU and 1.72× NPU speedups while claiming to preserve visual quality and enable higher-resolution on-device generation.

Significance. If the central efficiency-quality tradeoff holds under rigorous validation, CoReDiT would offer a practical, training-free acceleration path for DiT inference that directly addresses on-device memory and latency constraints. The combination of local coherence scoring with neighbor reconstruction is a concrete algorithmic contribution that could be adopted by existing DiT pipelines.

major comments (3)

[§3.2] §3.2, coherence score definition: the linear-time local spatial coherence metric is presented as sufficient to identify tokens whose removal does not degrade global self-attention outputs, yet no error bound or correlation analysis with attention importance (especially in mid-to-late denoising steps) is provided; this directly underpins the 55% FLOPs claim.
[§4] §4, experimental results: the reported 1.33×/1.72× speedups and “high visual quality” are stated without error bars, multiple random seeds, exact per-block pruning ratios, or ablations isolating the reconstruction step; the abstract numbers therefore lack the quantitative support needed to substantiate the central efficiency claim.
[§3.3] §3.3, progressive schedule: the block-adaptive budget allocation is described as increasing pruning where redundancy is higher, but the manuscript supplies neither the precise redundancy estimator used for allocation nor any sensitivity analysis showing that the schedule remains stable across different prompt distributions or resolutions.

minor comments (2)

[Figure 3] Figure 3: the coherence-map visualization would benefit from an explicit color scale and a side-by-side comparison of attention maps before and after pruning.
[Related Work] Related-work section: the discussion of prior token-pruning methods (ToMe, etc.) is brief; adding a quantitative comparison table would clarify the novelty of the coherence-guided reconstruction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and proposed revisions to strengthen the empirical and analytical support in the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2, coherence score definition: the linear-time local spatial coherence metric is presented as sufficient to identify tokens whose removal does not degrade global self-attention outputs, yet no error bound or correlation analysis with attention importance (especially in mid-to-late denoising steps) is provided; this directly underpins the 55% FLOPs claim.

Authors: We agree that a formal error bound is absent and would strengthen the theoretical grounding. The current approach is heuristic and validated empirically via quality metrics. In revision we will add correlation plots between coherence scores and attention importance (including mid-to-late timesteps) to better justify the 55% FLOPs reduction. revision: yes
Referee: [§4] §4, experimental results: the reported 1.33×/1.72× speedups and “high visual quality” are stated without error bars, multiple random seeds, exact per-block pruning ratios, or ablations isolating the reconstruction step; the abstract numbers therefore lack the quantitative support needed to substantiate the central efficiency claim.

Authors: We concur that additional statistical rigor and ablations are required. The revision will report means and standard deviations over multiple seeds, include a table of exact per-block pruning ratios, and add an ablation isolating the reconstruction step's contribution to quality and speedup. revision: yes
Referee: [§3.3] §3.3, progressive schedule: the block-adaptive budget allocation is described as increasing pruning where redundancy is higher, but the manuscript supplies neither the precise redundancy estimator used for allocation nor any sensitivity analysis showing that the schedule remains stable across different prompt distributions or resolutions.

Authors: The redundancy estimator is the block-wise average coherence score (Section 3.3). We will make this explicit and add sensitivity experiments across prompt sets and resolutions in the supplement to confirm schedule stability. revision: partial

Circularity Check

0 steps flagged

CoReDiT's pruning and reconstruction method is an independent algorithmic construction with no circular reduction to inputs

full rationale

The paper defines a spatial coherence score explicitly from the latent token lattice in linear time, uses it to identify and skip redundant tokens, and specifies neighbor-based aggregation for reconstruction. These steps are presented as a new structured pruning framework without any self-citations that bear the central claim, without fitted parameters renamed as predictions, and without ansatzes or uniqueness theorems imported from prior author work. The derivation chain consists of concrete, computable operations (coherence estimation, progressive block-adaptive scheduling, and coherence-guided aggregation) that do not reduce by construction to the target speedups or quality metrics; the method is self-contained as a proposed technique rather than a tautological restatement of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that spatial coherence predicts redundancy and that neighbor aggregation can faithfully reconstruct skipped attention outputs; no free parameters are explicitly named in the abstract but pruning budgets and coherence thresholds are implied tuning knobs.

free parameters (1)

pruning budget and coherence threshold
The progressive schedule and decision to skip tokens depend on tunable budgets and thresholds that are not detailed in the abstract.

axioms (1)

domain assumption High spatial coherence indicates redundant tokens whose attention outputs can be reconstructed from neighbors without quality loss
This premise underpins both the pruning decision and the reconstruction step.

pith-pipeline@v0.9.0 · 5510 in / 1263 out tokens · 65106 ms · 2026-05-15T04:44:28.788273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

[1]

Token merging for fast sta- ble diffusion

Daniel Bolya and Judy Hoffman. Token merging for fast sta- ble diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603,

work page
[2]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 5

work page 2020
[4]

Exploring diffusion transformer designs via grafting

Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, et al. Exploring diffusion transformer designs via grafting. arXiv preprint arXiv:2506.05340, 2025. 2, 4

work page arXiv 2025
[5]

Flexdit: Dynamic token density control for diffusion trans- former.arXiv preprint arXiv:2412.06028, 2024

Shuning Chang, Pichao Wang, Jiasheng Tang, and Yi Yang. Flexdit: Dynamic token density control for diffusion trans- former.arXiv preprint arXiv:2412.06028, 2024. 2, 4, 5

work page arXiv 2024
[6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024. 5

work page 2024
[8]

Structural pruning for diffusion models

Gongfan Fang, Xinyin Ma, and Xinchao Wang. Structural pruning for diffusion models. InAdvances in Neural Infor- mation Processing Systems, 2023. 2, 6, 7

work page 2023
[9]

Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive con- trol.arXiv preprint arXiv:2411.13807, 2024

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhen- guo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive con- trol.arXiv preprint arXiv:2411.13807, 2024. 5, 7

work page arXiv 2024
[10]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

work page 2017
[11]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

work page 2020
[12]

Learned token pruning for transformers

Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. Learned token pruning for transformers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 784–794, 2022. 1, 3

work page 2022
[13]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 5

work page 2024
[14]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022. 6

work page arXiv 2022
[15]

Not all patches are what you need: Expediting vision transformers via token reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganiza- tions.arXiv preprint arXiv:2202.07800, 2022. 3

work page arXiv 2022
[16]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6

work page 2014
[17]

Clear: Conv-like linearization revs pre-trained diffusion transform- ers up.arXiv preprint arXiv:2412.16112, 2024

Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Clear: Conv-like linearization revs pre-trained diffusion transform- ers up.arXiv preprint arXiv:2412.16112, 2024. 5, 6

work page arXiv 2024
[18]

Toma: Token merge with attention for diffusion models

Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, and Shengjie Wang. Toma: Token merge with attention for diffusion models. arXiv preprint arXiv:2509.10918, 2025. 2

work page arXiv 2025
[19]

Fastercache: Training-free video diffusion model acceleration with high quality.arXiv preprint arXiv:2410.19355, 2024

Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K Wong. Fastercache: Training-free video diffusion model acceleration with high quality.arXiv preprint arXiv:2410.19355, 2024. 2

work page arXiv 2024
[20]

Deepcache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024. 6, 7

work page 2024
[21]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[22]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

work page 2021
[23]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

work page
[24]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

work page 2022
[25]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 2

work page 2015
[26]

Im- proved techniques for training gans.Advances in neural in- formation processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. Im- proved techniques for training gans.Advances in neural in- formation processing systems, 29, 2016. 6

work page 2016
[27]

Tosa: Token selective atten- tion for efficient vision transformers.arXiv preprint arXiv:2406.08816, 2024

Manish Kumar Singh, Rajeev Yasarla, Hong Cai, Mingu Lee, and Fatih Porikli. Tosa: Token selective atten- tion for efficient vision transformers.arXiv preprint arXiv:2406.08816, 2024. 1, 3

work page arXiv 2024
[28]

Todo: Token downsampling for efficient generation of high-resolution im- ages.arXiv preprint arXiv:2402.13573, 2024

Ethan Smith, Nayan Saxena, and Aninda Saha. Todo: Token downsampling for efficient generation of high-resolution im- ages.arXiv preprint arXiv:2402.13573, 2024. 2

work page arXiv 2024
[29]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004
[31]

Cache me if you can: Accelerating diffusion models through block caching

Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, et al. Cache me if you can: Accelerating diffusion models through block caching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6211– 6220, 2024. 2

work page 2024
[32]

Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025. 2

work page arXiv 2025
[33]

Global vision trans- former pruning with hessian-aware saliency

Huanrui Yang, Hongxu Yin, Maying Shen, Pavlo Molchanov, Hai Li, and Jan Kautz. Global vision trans- former pruning with hessian-aware saliency. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18547–18557, 2023. 5

work page 2023
[34]

Layer-and timestep-adaptive dif- ferentiable token compression ratios for efficient diffusion transformers

Haoran You, Connelly Barnes, Yuqian Zhou, Yan Kang, Zhenbang Du, Wei Zhou, Lingzhi Zhang, Yotam Nitzan, Xi- aoyang Liu, Zhe Lin, et al. Layer-and timestep-adaptive dif- ferentiable token compression ratios for efficient diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18072–18082, 2025. 1, 2, 4, 5

work page 2025
[35]

VSA: Faster Video Diffusion with Trainable Sparse Attention, October 2025

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. Vsa: Faster video diffusion with trainable sparse attention. arXiv preprint arXiv:2505.13389, 2025. 2

work page arXiv 2025
[36]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

work page 2018
[37]

Effort- less efficiency: Low-cost pruning of diffusion models.arXiv preprint arXiv:2412.02852, 2024

Yang Zhang, Er Jin, Yanfei Dong, Ashkan Khakzar, Philip Torr, Johannes Stegmaier, and Kenji Kawaguchi. Effort- less efficiency: Low-cost pruning of diffusion models.arXiv preprint arXiv:2412.02852, 2024. 6, 7

work page arXiv 2024
[38]

Dynamic diffusion transformer.ICLR, 2025

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yib- ing Song, Gao Huang, Fan Wang, and Yang You. Dynamic diffusion transformer.ICLR, 2025. 2

work page 2025
[39]

elephant

Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Lin- feng Zhang. Accelerating diffusion transformers with token- wise feature caching.arXiv preprint arXiv:2410.05317,

work page arXiv