pith. machine review for the scientific record. sign in

arxiv: 2604.02979 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive video generationvideo diffusion modelstraining-free accelerationTaylor extrapolationselective computationcache and recompute
0
0 comments X

The pith

Autoregressive video diffusion models can run up to 4.73 times faster by deciding per frame whether to cache, extrapolate, or fully recompute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the high cost of autoregressive video diffusion, where each new frame requires many denoising steps and prior methods only allow all-or-nothing reuse of earlier results. SCOPE adds an intermediate prediction option that uses Taylor extrapolation on the noise level to approximate the missing computation, backed by an error analysis that keeps the approximation stable. It also limits work to the current active frame interval rather than processing every frame uniformly. The result is substantially lower wall-clock time on real models while the output quality stays close to the unaccelerated baseline.

Core claim

SCOPE is a training-free framework that replaces binary cache-or-recompute decisions with a tri-modal scheduler over cache, predict, and recompute. Prediction is performed by noise-level Taylor extrapolation equipped with explicit stability controls derived from error propagation analysis. Selective computation further restricts all operations to the active frame interval under the asynchronous AR noise schedule. On MAGI-1 and SkyReels-V2 this produces up to 4.73x speedup with quality comparable to the original full-computation output.

What carries the argument

Tri-modal scheduler over cache, predict via noise-level Taylor extrapolation, and recompute, paired with selective computation restricted to the active frame interval.

If this is right

  • Longer video sequences become feasible within the same compute budget.
  • Quality metrics stay comparable to the baseline full model on the evaluated datasets.
  • The method outperforms all prior training-free acceleration baselines on the tested models.
  • No retraining or architecture changes are required for the reported speedups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-frame decision logic could be tested on other sequential generation tasks such as audio or 3D scene synthesis.
  • Combining the scheduler with existing model distillation techniques might produce multiplicative speed gains.
  • If the extrapolation error remains bounded at higher frame counts, the approach could support interactive video editing loops.

Load-bearing premise

Noise-level Taylor extrapolation with the given stability controls produces intermediate frames whose visual quality matches full recomputation under the asynchronous noise schedules of autoregressive video models.

What would settle it

Side-by-side generation of the same long video prompt with SCOPE versus the unmodified model, followed by a measurable drop in perceptual quality scores or human preference rates that exceeds the paper's reported margin.

Figures

Figures reproduced from arXiv: 2604.02979 by Fanshuai Meng, Hanshuai Cui, Weijia Jia, Wei Zhao, Zhiqing Tang, Zhi Yao.

Figure 1
Figure 1. Figure 1: SCOPE achieves large speedup while preserving [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: At Frame 13 over steps 54–59: (a) Taylor prediction [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Conceptual step-matrix view of asynchronous au [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of SCOPE for accelerating autoregressive video diffusion by reducing redundant computation in both the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of all methods on representative prompts from the two models. The left two groups show [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tri-modal decision timeline comparison between [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Peak GPU memory comparison under short and [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Denoising time of Original versus SCOPE at varying [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Few-step reduced-step Original comparison at matched runtime targets [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
read the original abstract

Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SCOPE, a training-free acceleration method for autoregressive video diffusion models. It introduces a tri-modal scheduler (cache/predict/recompute) that uses noise-level Taylor extrapolation with stability controls derived from error-propagation analysis to handle intermediate cases between reuse and full recomputation, plus selective computation restricted to active frame intervals under asynchronous AR noise schedules. Experiments on MAGI-1 and SkyReels-V2 report up to 4.73× speedup while claiming quality comparable to the full baseline, outperforming prior training-free methods.

Significance. If the extrapolation step reliably preserves temporal coherence without visible artifacts, the work would offer a practical, training-free route to faster long-form video generation that directly targets AR-specific inefficiencies such as varying per-frame noise levels. The explicit error analysis and absence of learned parameters are positive features that distinguish it from heuristic caching approaches.

major comments (2)
  1. [§4.2] §4.2 (Error Propagation Analysis): The stability controls for noise-level Taylor extrapolation are derived under per-frame error assumptions; the analysis does not explicitly bound compounding temporal inconsistencies that arise when asynchronous noise levels are assigned to simultaneously generated frames, which directly threatens the central claim of comparable quality.
  2. [§5.1] §5.1 and Table 3: Aggregate metrics (FVD, CLIP) are reported as comparable, yet no per-video qualitative inspection or temporal coherence metric (e.g., frame-to-frame optical flow consistency) is provided to verify that extrapolation does not introduce flickering or coherence loss on the reported 4.73× speedup runs.
minor comments (2)
  1. The abstract states that SCOPE 'outperforms all training-free baselines,' but the introduction and experimental section should explicitly enumerate the exact baselines and their reported speedups for direct comparison.
  2. Figure captions for the qualitative results should include the exact noise schedule parameters and cache/predict/recompute decisions used for the displayed frames.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The two major comments identify important gaps in our error analysis and evaluation that we can address through targeted extensions. Below we respond point by point and indicate the planned revisions.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Error Propagation Analysis): The stability controls for noise-level Taylor extrapolation are derived under per-frame error assumptions; the analysis does not explicitly bound compounding temporal inconsistencies that arise when asynchronous noise levels are assigned to simultaneously generated frames, which directly threatens the central claim of comparable quality.

    Authors: We appreciate this observation. Section 4.2 derives per-frame stability bounds for the Taylor extrapolation step under the assumption of independent denoising trajectories. We acknowledge that the current derivation does not explicitly quantify error accumulation across frames that receive different noise levels within the same asynchronous generation step. In the revised manuscript we will extend the analysis to derive an upper bound on temporal inconsistency under asynchronous schedules, incorporating the maximum noise-level difference within a co-generated block and showing that the stability controls remain sufficient to keep accumulated error below the threshold used for the recompute decision. revision: yes

  2. Referee: [§5.1] §5.1 and Table 3: Aggregate metrics (FVD, CLIP) are reported as comparable, yet no per-video qualitative inspection or temporal coherence metric (e.g., frame-to-frame optical flow consistency) is provided to verify that extrapolation does not introduce flickering or coherence loss on the reported 4.73× speedup runs.

    Authors: We agree that aggregate metrics alone leave open the possibility of localized temporal artifacts. In the revised version we will add (i) a per-video qualitative inspection section with representative frames and difference maps at the 4.73× operating point, (ii) a new temporal coherence metric based on frame-to-frame optical-flow consistency (mean endpoint error and percentage of inconsistent pixels), and (iii) side-by-side visual comparisons against the full baseline and prior training-free methods. These additions will be placed in §5.1 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents SCOPE as a training-free framework whose tri-modal scheduler (cache/predict/recompute) and noise-level Taylor extrapolation with stability controls are derived from explicit error propagation analysis rather than from fitted parameters or self-referential definitions. No load-bearing step reduces by construction to its own inputs; selective computation restricts execution to active intervals via stated rules, and extrapolation fills gaps with mathematically justified controls. Empirical results on external benchmarks (MAGI-1, SkyReels-V2) provide independent validation of the 4.73x speedup claim without circular reduction to internal fits or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that Taylor extrapolation can safely approximate denoising steps across noise levels; no free parameters or new entities are mentioned in the abstract.

axioms (1)
  • domain assumption Taylor series expansion around noise levels can approximate denoising outputs with bounded error under stability controls
    Invoked to enable the predict mode between cache and recompute.

pith-pipeline@v0.9.0 · 5486 in / 1194 out tokens · 45718 ms · 2026-05-13T20:40:17.417835+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 10 internal anchors

  1. [1]

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu

  2. [2]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22669–22679

  3. [3]

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. 2024. Lumiere: A space- time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers. 1–11

  4. [4]

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127(2023)

  5. [5]

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feicht- enhofer, and Judy Hoffman. 2023. Token Merging: Your ViT But Faster. InThe Eleventh International Conference on Learning Representations

  6. [6]

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. 2024. Video generation models as world simulators.OpenAI Blog1, 8 (2024), 1

  7. [7]

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. 2024. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems37, 24081–24125

  8. [8]

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. 2025. Skyreels- v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074(2025)

  9. [9]

    Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos- Savvas Bouganis, Yiren Zhao, and Tao Chen. 2024. Δ-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers.arXiv preprint arXiv:2406.01125

  10. [10]

    Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InThe Twelfth International Conference on Learning Represen- tations

  11. [11]

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35, 16344–16359

  12. [12]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in neural information processing systems34, 8780– 8794

  13. [13]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al . 2021. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. InInternational Conference on Learning Representations

  14. [14]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

  15. [15]

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2024. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. InThe Twelfth International Conference on Learning Representations

  16. [16]

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. 2025. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference. 15733–15744

  17. [17]

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. 2022. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221(2022)

  18. [18]

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303(2022)

  19. [19]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33, 6840–6851

  20. [20]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models.Advances in neu- ral information processing systems35, 8633–8646

  21. [21]

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2023. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transform- ers. InThe Eleventh International Conference on Learning Representations

  22. [22]

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuan- han Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21807– 21818

  23. [23]

    Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. 2025. Adaptive caching for faster video generation with diffusion transformers. (2025), 15240–15252

  24. [24]

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. 2024. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems37, 56424–56445

  25. [25]

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations (ICLR)

  26. [26]

    Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. 2025. Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model. InProceedings of the Computer Vision and Pattern Recognition Conference. 7353–7363

  27. [27]

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. 2025. From reusing to forecasting: Accelerating diffusion models with taylorseers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15853– 15863

  28. [28]

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, and Linfeng Zhang. 2025. Speca: Accelerating diffusion transformers with speculative feature caching. InProceedings of the 33rd ACM International Conference on Multimedia. 10024–10033

  29. [29]

    Xingchao Liu, Chengyue Gong, et al. 2023. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InThe Eleventh International Conference on Learning Representations

  30. [30]

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. 2023. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InThe Twelfth International Conference on Learning Representations

  31. [31]

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. 2023. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378(2023)

  32. [32]

    Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K Wong. 2024. Fastercache: Training-free video diffusion model acceleration with high quality.arXiv preprint arXiv:2410.19355(2024)

  33. [33]

    Xinyin Ma, Gongfan Fang, Michael Bi Mi, and Xinchao Wang. 2024. Learning-to- cache: Accelerating diffusion transformer via layer caching.Advances in Neural Information Processing Systems37, 133282–133304

  34. [34]

    Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al . 2026. Flow caching for autoregressive video generation.arXiv preprint arXiv:2602.10825(2026)

  35. [35]

    Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffu- sion probabilistic models. InInternational conference on machine learning. PMLR, 8162–8171

  36. [36]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

  37. [37]

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. 2024. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720 (2024)

  38. [38]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  39. [39]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  40. [40]

    Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. InInternational Conference on Learning Representations

  41. [41]

    Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150(2019)

  42. [42]

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2023. Make-A-Video: Text-to-Video Generation without Text-Video Data. InThe Eleventh International Conference on Learning Representations

  43. [43]

    Junhyuk So, Jungwon Lee, and Eunhyeok Park. 2024. Frdiff: Feature reuse for universal training-free acceleration of diffusion models. InEuropean Conference on Computer Vision. Springer, 328–344

  44. [44]

    Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. InInternational Conference on Learning Representations. MM ’26, October 26–30, 2026, Melbourne, Australia Hanshuai Cui et al

  45. [45]

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. 2023. Consistency models. InProceedings of the 40th International Conference on Machine Learning. 32211–32252

  46. [46]

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Ste- fano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. InInternational Conference on Learning Repre- sentations

  47. [47]

    Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. 2025. Ar-diffusion: Asyn- chronous video generation with auto-regressive diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference. 7364–7373

  48. [48]

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. 2025. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211(2025)

  49. [49]

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37, 84839–84865

  50. [50]

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. 2024. Improving and gener- alizing flow-based generative models with minibatch optimal transport.Transac- tions on Machine Learning Research(2024), 1–34

  51. [51]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30

  52. [52]

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. 2025. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision133, 5 (2025), 3059–3078

  53. [53]

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

  54. [54]

    Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, et al. 2024. Cache me if you can: Accelerating diffusion models through block caching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6211–6220

  55. [55]

    Yiming Wu, Zhenghao Chen, Huan Wang, and Dong Xu. 2025. Individual Con- tent and Motion Dynamics Preserved Pruning for Video Diffusion Models. In Proceedings of the 33rd ACM International Conference on Multimedia. 9714–9723

  56. [56]

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. Imagereward: Learning and evaluating human prefer- ences for text-to-image generation.Advances in Neural Information Processing Systems36 (2023), 15903–15935

  57. [57]

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al . 2024. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072(2024)

  58. [58]

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. 2024. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6613–6623

  59. [59]

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. 2025. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22963–22974

  60. [60]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  61. [61]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  62. [62]

    Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, and Jürgen Schmidhuber. 2024. Cross-attention makes inference cumbersome in text-to-image diffusion models.arXiv e-prints(2024), arXiv–2404

  63. [63]

    Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. 2024. Real-time video generation with pyramid attention broadcast.arXiv preprint arXiv:2408.12588 (2024)

  64. [64]

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)