pith. machine review for the scientific record. sign in

arxiv: 2605.14191 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion transformerstoken pruningspatial coherenceefficient inferenceself-attention optimizationimage generationvideo generation
0
0 comments X

The pith

CoReDiT prunes redundant tokens in diffusion transformers via spatial coherence scores and reconstructs their outputs from neighbors to cut computation while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoReDiT as a structured pruning method for diffusion transformers that estimates local redundancy in the token lattice with a linear-time spatial coherence score. It skips high-coherence tokens during self-attention and rebuilds the skipped outputs by aggregating information from spatially neighboring retained tokens. A progressive block-adaptive schedule gradually increases pruning and gives larger budgets to steps and blocks with higher redundancy. The approach is tested on backbones such as PixArt-α and MagicDrive-V2. It delivers substantial FLOPs savings and speedups on both cloud and mobile hardware without degrading visual quality.

Core claim

CoReDiT establishes that spatial coherence in the latent token lattice reliably marks redundant tokens that can be skipped in DiT self-attention, with coherence-guided aggregation of neighboring retained tokens restoring dense outputs and avoiding visual discontinuities. A progressive, block-adaptive pruning schedule that ramps up pruning and allocates larger budgets to high-redundancy blocks and denoising steps further improves efficiency. Across state-of-the-art diffusion backbones, this yields up to 55 percent self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs while maintaining high visual quality and increasing on-device memory headroom.

What carries the argument

The spatial coherence score, a linear-time metric of local redundancy in the token lattice, decides which tokens to prune from self-attention and guides neighbor-based aggregation to reconstruct their attention outputs.

If this is right

  • Up to 55% reduction in self-attention FLOPs across diffusion transformer backbones.
  • Inference speedups reaching 1.33x on cloud GPUs and 1.72x on mobile NPUs.
  • No loss in visual quality for image and video generation tasks.
  • Increased on-device memory headroom that supports higher-resolution outputs.
  • Works across multiple state-of-the-art models including PixArt-α and MagicDrive-V2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reconstruction step from spatial neighbors could preserve local image structures more reliably than global token averaging methods used in other pruning schemes.
  • Progressive pruning schedules that adapt per block and per denoising step may transfer to efficiency gains in non-diffusion transformer architectures for vision or language.
  • Memory savings on device could open real-time generation applications that current full models cannot support at high resolutions.

Load-bearing premise

The linear-time spatial coherence score accurately flags redundant tokens so that their removal and neighbor-based reconstruction introduce no perceptible visual artifacts or quality loss across prompts and resolutions.

What would settle it

Generate images with and without CoReDiT pruning on a large, diverse prompt set at multiple resolutions and compare perceptual metrics such as FID scores or side-by-side human ratings for visible artifacts.

Figures

Figures reproduced from arXiv: 2605.14191 by Fatih Porikli, Hong Cai, Hsin-Pai Cheng, Shizhong Han, Zhuojin Li.

Figure 1
Figure 1. Figure 1: Workflow of CoReDiT. In each transformer block, input [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Token selection motivation and visualization. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Issues and proposed improvements. (a) Naively skip [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SC score evolution across transformer blocks and de [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Coherence-guided progressive pruning across trans [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Progressive pruning dynamics for PixArt- [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison to PixArt-α-1024. Model (ratio) FLOPs Reduction Latency Image Quality Self-Attn End-to-end Self-Attn End-to-end FID ↓ CLIP ↑ IS ↑ PixArt-Σ-2048 - - 4.62s 6.68s 26.0 31.4 37.49 CoReDiT (r = 23%) + distillation -32% -25% 3.61s (-22%) 5.67s (-15%) 28.0 31.4 36.52 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-block latency comparison on Qualcomm Snap [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-{\alpha} and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CoReDiT, a structured token pruning method for Diffusion Transformers. It computes a linear-time spatial coherence score over the latent token lattice to identify and skip high-coherence (redundant) tokens during self-attention, then reconstructs the skipped outputs by coherence-guided aggregation of neighboring retained tokens. A progressive, block-adaptive pruning schedule gradually increases the pruning budget across blocks and denoising timesteps. Experiments on PixArt-α and MagicDrive-V2 report up to 55% self-attention FLOPs reduction with 1.33× GPU and 1.72× NPU speedups while claiming to preserve visual quality and enable higher-resolution on-device generation.

Significance. If the central efficiency-quality tradeoff holds under rigorous validation, CoReDiT would offer a practical, training-free acceleration path for DiT inference that directly addresses on-device memory and latency constraints. The combination of local coherence scoring with neighbor reconstruction is a concrete algorithmic contribution that could be adopted by existing DiT pipelines.

major comments (3)
  1. [§3.2] §3.2, coherence score definition: the linear-time local spatial coherence metric is presented as sufficient to identify tokens whose removal does not degrade global self-attention outputs, yet no error bound or correlation analysis with attention importance (especially in mid-to-late denoising steps) is provided; this directly underpins the 55% FLOPs claim.
  2. [§4] §4, experimental results: the reported 1.33×/1.72× speedups and “high visual quality” are stated without error bars, multiple random seeds, exact per-block pruning ratios, or ablations isolating the reconstruction step; the abstract numbers therefore lack the quantitative support needed to substantiate the central efficiency claim.
  3. [§3.3] §3.3, progressive schedule: the block-adaptive budget allocation is described as increasing pruning where redundancy is higher, but the manuscript supplies neither the precise redundancy estimator used for allocation nor any sensitivity analysis showing that the schedule remains stable across different prompt distributions or resolutions.
minor comments (2)
  1. [Figure 3] Figure 3: the coherence-map visualization would benefit from an explicit color scale and a side-by-side comparison of attention maps before and after pruning.
  2. [Related Work] Related-work section: the discussion of prior token-pruning methods (ToMe, etc.) is brief; adding a quantitative comparison table would clarify the novelty of the coherence-guided reconstruction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and proposed revisions to strengthen the empirical and analytical support in the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2, coherence score definition: the linear-time local spatial coherence metric is presented as sufficient to identify tokens whose removal does not degrade global self-attention outputs, yet no error bound or correlation analysis with attention importance (especially in mid-to-late denoising steps) is provided; this directly underpins the 55% FLOPs claim.

    Authors: We agree that a formal error bound is absent and would strengthen the theoretical grounding. The current approach is heuristic and validated empirically via quality metrics. In revision we will add correlation plots between coherence scores and attention importance (including mid-to-late timesteps) to better justify the 55% FLOPs reduction. revision: yes

  2. Referee: [§4] §4, experimental results: the reported 1.33×/1.72× speedups and “high visual quality” are stated without error bars, multiple random seeds, exact per-block pruning ratios, or ablations isolating the reconstruction step; the abstract numbers therefore lack the quantitative support needed to substantiate the central efficiency claim.

    Authors: We concur that additional statistical rigor and ablations are required. The revision will report means and standard deviations over multiple seeds, include a table of exact per-block pruning ratios, and add an ablation isolating the reconstruction step's contribution to quality and speedup. revision: yes

  3. Referee: [§3.3] §3.3, progressive schedule: the block-adaptive budget allocation is described as increasing pruning where redundancy is higher, but the manuscript supplies neither the precise redundancy estimator used for allocation nor any sensitivity analysis showing that the schedule remains stable across different prompt distributions or resolutions.

    Authors: The redundancy estimator is the block-wise average coherence score (Section 3.3). We will make this explicit and add sensitivity experiments across prompt sets and resolutions in the supplement to confirm schedule stability. revision: partial

Circularity Check

0 steps flagged

CoReDiT's pruning and reconstruction method is an independent algorithmic construction with no circular reduction to inputs

full rationale

The paper defines a spatial coherence score explicitly from the latent token lattice in linear time, uses it to identify and skip redundant tokens, and specifies neighbor-based aggregation for reconstruction. These steps are presented as a new structured pruning framework without any self-citations that bear the central claim, without fitted parameters renamed as predictions, and without ansatzes or uniqueness theorems imported from prior author work. The derivation chain consists of concrete, computable operations (coherence estimation, progressive block-adaptive scheduling, and coherence-guided aggregation) that do not reduce by construction to the target speedups or quality metrics; the method is self-contained as a proposed technique rather than a tautological restatement of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that spatial coherence predicts redundancy and that neighbor aggregation can faithfully reconstruct skipped attention outputs; no free parameters are explicitly named in the abstract but pruning budgets and coherence thresholds are implied tuning knobs.

free parameters (1)
  • pruning budget and coherence threshold
    The progressive schedule and decision to skip tokens depend on tunable budgets and thresholds that are not detailed in the abstract.
axioms (1)
  • domain assumption High spatial coherence indicates redundant tokens whose attention outputs can be reconstructed from neighbors without quality loss
    This premise underpins both the pruning decision and the reconstruction step.

pith-pipeline@v0.9.0 · 5510 in / 1263 out tokens · 65106 ms · 2026-05-15T04:44:28.788273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

  1. [1]

    Token merging for fast sta- ble diffusion

    Daniel Bolya and Judy Hoffman. Token merging for fast sta- ble diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603,

  2. [2]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 1, 3

  3. [3]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 5

  4. [4]

    Exploring diffusion transformer designs via grafting

    Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, et al. Exploring diffusion transformer designs via grafting. arXiv preprint arXiv:2506.05340, 2025. 2, 4

  5. [5]

    Flexdit: Dynamic token density control for diffusion trans- former.arXiv preprint arXiv:2412.06028, 2024

    Shuning Chang, Pichao Wang, Jiasheng Tang, and Yi Yang. Flexdit: Dynamic token density control for diffusion trans- former.arXiv preprint arXiv:2412.06028, 2024. 2, 4, 5

  6. [6]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 5, 8

  7. [7]

    Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024. 5

  8. [8]

    Structural pruning for diffusion models

    Gongfan Fang, Xinyin Ma, and Xinchao Wang. Structural pruning for diffusion models. InAdvances in Neural Infor- mation Processing Systems, 2023. 2, 6, 7

  9. [9]

    Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive con- trol.arXiv preprint arXiv:2411.13807, 2024

    Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhen- guo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive con- trol.arXiv preprint arXiv:2411.13807, 2024. 5, 7

  10. [10]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

  11. [11]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  12. [12]

    Learned token pruning for transformers

    Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. Learned token pruning for transformers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 784–794, 2022. 1, 3

  13. [13]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 5

  14. [14]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022. 6

  15. [15]

    Not all patches are what you need: Expediting vision transformers via token reorganizations

    Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganiza- tions.arXiv preprint arXiv:2202.07800, 2022. 3

  16. [16]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6

  17. [17]

    Clear: Conv-like linearization revs pre-trained diffusion transform- ers up.arXiv preprint arXiv:2412.16112, 2024

    Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Clear: Conv-like linearization revs pre-trained diffusion transform- ers up.arXiv preprint arXiv:2412.16112, 2024. 5, 6

  18. [18]

    Toma: Token merge with attention for diffusion models

    Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, and Shengjie Wang. Toma: Token merge with attention for diffusion models. arXiv preprint arXiv:2509.10918, 2025. 2

  19. [19]

    Fastercache: Training-free video diffusion model acceleration with high quality.arXiv preprint arXiv:2410.19355, 2024

    Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K Wong. Fastercache: Training-free video diffusion model acceleration with high quality.arXiv preprint arXiv:2410.19355, 2024. 2

  20. [20]

    Deepcache: Accelerating diffusion models for free

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024. 6, 7

  21. [21]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  22. [22]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

  23. [23]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

  24. [24]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  25. [25]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 2

  26. [26]

    Im- proved techniques for training gans.Advances in neural in- formation processing systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. Im- proved techniques for training gans.Advances in neural in- formation processing systems, 29, 2016. 6

  27. [27]

    Tosa: Token selective atten- tion for efficient vision transformers.arXiv preprint arXiv:2406.08816, 2024

    Manish Kumar Singh, Rajeev Yasarla, Hong Cai, Mingu Lee, and Fatih Porikli. Tosa: Token selective atten- tion for efficient vision transformers.arXiv preprint arXiv:2406.08816, 2024. 1, 3

  28. [28]

    Todo: Token downsampling for efficient generation of high-resolution im- ages.arXiv preprint arXiv:2402.13573, 2024

    Ethan Smith, Nayan Saxena, and Aninda Saha. Todo: Token downsampling for efficient generation of high-resolution im- ages.arXiv preprint arXiv:2402.13573, 2024. 2

  29. [29]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

  30. [30]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

  31. [31]

    Cache me if you can: Accelerating diffusion models through block caching

    Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, et al. Cache me if you can: Accelerating diffusion models through block caching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6211– 6220, 2024. 2

  32. [32]

    Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

    Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025. 2

  33. [33]

    Global vision trans- former pruning with hessian-aware saliency

    Huanrui Yang, Hongxu Yin, Maying Shen, Pavlo Molchanov, Hai Li, and Jan Kautz. Global vision trans- former pruning with hessian-aware saliency. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18547–18557, 2023. 5

  34. [34]

    Layer-and timestep-adaptive dif- ferentiable token compression ratios for efficient diffusion transformers

    Haoran You, Connelly Barnes, Yuqian Zhou, Yan Kang, Zhenbang Du, Wei Zhou, Lingzhi Zhang, Yotam Nitzan, Xi- aoyang Liu, Zhe Lin, et al. Layer-and timestep-adaptive dif- ferentiable token compression ratios for efficient diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18072–18082, 2025. 1, 2, 4, 5

  35. [35]

    VSA: Faster Video Diffusion with Trainable Sparse Attention, October 2025

    Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. Vsa: Faster video diffusion with trainable sparse attention. arXiv preprint arXiv:2505.13389, 2025. 2

  36. [36]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

  37. [37]

    Effort- less efficiency: Low-cost pruning of diffusion models.arXiv preprint arXiv:2412.02852, 2024

    Yang Zhang, Er Jin, Yanfei Dong, Ashkan Khakzar, Philip Torr, Johannes Stegmaier, and Kenji Kawaguchi. Effort- less efficiency: Low-cost pruning of diffusion models.arXiv preprint arXiv:2412.02852, 2024. 6, 7

  38. [38]

    Dynamic diffusion transformer.ICLR, 2025

    Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yib- ing Song, Gao Huang, Fan Wang, and Yang You. Dynamic diffusion transformer.ICLR, 2025. 2

  39. [39]

    elephant

    Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Lin- feng Zhang. Accelerating diffusion transformers with token- wise feature caching.arXiv preprint arXiv:2410.05317,