arxiv: 2604.02979 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

Hanshuai Cui , Zhiqing Tang , Zhi Yao , Fanshuai Meng , Weijia Jia , Wei Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressive video generationvideo diffusion modelstraining-free accelerationTaylor extrapolationselective computationcache and recompute

0 comments

The pith

Autoregressive video diffusion models can run up to 4.73 times faster by deciding per frame whether to cache, extrapolate, or fully recompute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the high cost of autoregressive video diffusion, where each new frame requires many denoising steps and prior methods only allow all-or-nothing reuse of earlier results. SCOPE adds an intermediate prediction option that uses Taylor extrapolation on the noise level to approximate the missing computation, backed by an error analysis that keeps the approximation stable. It also limits work to the current active frame interval rather than processing every frame uniformly. The result is substantially lower wall-clock time on real models while the output quality stays close to the unaccelerated baseline.

Core claim

SCOPE is a training-free framework that replaces binary cache-or-recompute decisions with a tri-modal scheduler over cache, predict, and recompute. Prediction is performed by noise-level Taylor extrapolation equipped with explicit stability controls derived from error propagation analysis. Selective computation further restricts all operations to the active frame interval under the asynchronous AR noise schedule. On MAGI-1 and SkyReels-V2 this produces up to 4.73x speedup with quality comparable to the original full-computation output.

What carries the argument

Tri-modal scheduler over cache, predict via noise-level Taylor extrapolation, and recompute, paired with selective computation restricted to the active frame interval.

If this is right

Longer video sequences become feasible within the same compute budget.
Quality metrics stay comparable to the baseline full model on the evaluated datasets.
The method outperforms all prior training-free acceleration baselines on the tested models.
No retraining or architecture changes are required for the reported speedups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-frame decision logic could be tested on other sequential generation tasks such as audio or 3D scene synthesis.
Combining the scheduler with existing model distillation techniques might produce multiplicative speed gains.
If the extrapolation error remains bounded at higher frame counts, the approach could support interactive video editing loops.

Load-bearing premise

Noise-level Taylor extrapolation with the given stability controls produces intermediate frames whose visual quality matches full recomputation under the asynchronous noise schedules of autoregressive video models.

What would settle it

Side-by-side generation of the same long video prompt with SCOPE versus the unmodified model, followed by a measurable drop in perceptual quality scores or human preference rates that exceeds the paper's reported margin.

Figures

Figures reproduced from arXiv: 2604.02979 by Fanshuai Meng, Hanshuai Cui, Weijia Jia, Wei Zhao, Zhiqing Tang, Zhi Yao.

**Figure 2.** Figure 2: At Frame 13 over steps 54–59: (a) Taylor prediction [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Conceptual step-matrix view of asynchronous au [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of SCOPE for accelerating autoregressive video diffusion by reducing redundant computation in both the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of all methods on representative prompts from the two models. The left two groups show [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Tri-modal decision timeline comparison between [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 9.** Figure 9: Peak GPU memory comparison under short and [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 8.** Figure 8: Denoising time of Original versus SCOPE at varying [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 11.** Figure 11: Few-step reduced-step Original comparison at matched runtime targets [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

read the original abstract

Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCOPE adds a tri-modal scheduler with Taylor extrapolation to speed up AR video diffusion in a training-free way, but the extrapolation's handling of async noise needs more checks to confirm no hidden quality loss.

read the letter

The main thing here is that SCOPE moves past binary cache-or-recompute choices by adding a predict mode that uses noise-level Taylor extrapolation to approximate some denoising steps, plus selective computation on only the active frame interval. This directly targets the asynchronous noise levels that come with autoregressive video generation, where prior methods treated the whole interval the same. The reported 4.73x speedup on MAGI-1 and SkyReels-V2 while claiming comparable output is concrete and the training-free design makes it straightforward to try on existing models. That part is useful and shows they thought through the AR-specific constraints rather than applying a generic cache trick. The error propagation analysis is mentioned as backing the stability controls, which is better than just fitting parameters inside the paper. The soft spot is exactly the one in the stress-test note: whether the extrapolation accurately fills the gap without compounding temporal inconsistencies across frames with mismatched noise levels. Aggregate metrics like FVD or CLIP can miss flickering or coherence drops that show up in longer sequences, and the abstract does not give enough ablation detail to rule that out. If the full paper has frame-by-frame visual checks and long-video results, that would settle it; otherwise the central claim stays plausible but not fully locked down. This is for people working on efficient generative video in computer vision. It shows honest engagement with the problem and the literature on acceleration, so it deserves a serious referee even if revisions on the evaluation side are likely. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SCOPE, a training-free acceleration method for autoregressive video diffusion models. It introduces a tri-modal scheduler (cache/predict/recompute) that uses noise-level Taylor extrapolation with stability controls derived from error-propagation analysis to handle intermediate cases between reuse and full recomputation, plus selective computation restricted to active frame intervals under asynchronous AR noise schedules. Experiments on MAGI-1 and SkyReels-V2 report up to 4.73× speedup while claiming quality comparable to the full baseline, outperforming prior training-free methods.

Significance. If the extrapolation step reliably preserves temporal coherence without visible artifacts, the work would offer a practical, training-free route to faster long-form video generation that directly targets AR-specific inefficiencies such as varying per-frame noise levels. The explicit error analysis and absence of learned parameters are positive features that distinguish it from heuristic caching approaches.

major comments (2)

[§4.2] §4.2 (Error Propagation Analysis): The stability controls for noise-level Taylor extrapolation are derived under per-frame error assumptions; the analysis does not explicitly bound compounding temporal inconsistencies that arise when asynchronous noise levels are assigned to simultaneously generated frames, which directly threatens the central claim of comparable quality.
[§5.1] §5.1 and Table 3: Aggregate metrics (FVD, CLIP) are reported as comparable, yet no per-video qualitative inspection or temporal coherence metric (e.g., frame-to-frame optical flow consistency) is provided to verify that extrapolation does not introduce flickering or coherence loss on the reported 4.73× speedup runs.

minor comments (2)

The abstract states that SCOPE 'outperforms all training-free baselines,' but the introduction and experimental section should explicitly enumerate the exact baselines and their reported speedups for direct comparison.
Figure captions for the qualitative results should include the exact noise schedule parameters and cache/predict/recompute decisions used for the displayed frames.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The two major comments identify important gaps in our error analysis and evaluation that we can address through targeted extensions. Below we respond point by point and indicate the planned revisions.

read point-by-point responses

Referee: [§4.2] §4.2 (Error Propagation Analysis): The stability controls for noise-level Taylor extrapolation are derived under per-frame error assumptions; the analysis does not explicitly bound compounding temporal inconsistencies that arise when asynchronous noise levels are assigned to simultaneously generated frames, which directly threatens the central claim of comparable quality.

Authors: We appreciate this observation. Section 4.2 derives per-frame stability bounds for the Taylor extrapolation step under the assumption of independent denoising trajectories. We acknowledge that the current derivation does not explicitly quantify error accumulation across frames that receive different noise levels within the same asynchronous generation step. In the revised manuscript we will extend the analysis to derive an upper bound on temporal inconsistency under asynchronous schedules, incorporating the maximum noise-level difference within a co-generated block and showing that the stability controls remain sufficient to keep accumulated error below the threshold used for the recompute decision. revision: yes
Referee: [§5.1] §5.1 and Table 3: Aggregate metrics (FVD, CLIP) are reported as comparable, yet no per-video qualitative inspection or temporal coherence metric (e.g., frame-to-frame optical flow consistency) is provided to verify that extrapolation does not introduce flickering or coherence loss on the reported 4.73× speedup runs.

Authors: We agree that aggregate metrics alone leave open the possibility of localized temporal artifacts. In the revised version we will add (i) a per-video qualitative inspection section with representative frames and difference maps at the 4.73× operating point, (ii) a new temporal coherence metric based on frame-to-frame optical-flow consistency (mean endpoint error and percentage of inconsistent pixels), and (iii) side-by-side visual comparisons against the full baseline and prior training-free methods. These additions will be placed in §5.1 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents SCOPE as a training-free framework whose tri-modal scheduler (cache/predict/recompute) and noise-level Taylor extrapolation with stability controls are derived from explicit error propagation analysis rather than from fitted parameters or self-referential definitions. No load-bearing step reduces by construction to its own inputs; selective computation restricts execution to active intervals via stated rules, and extrapolation fills gaps with mathematically justified controls. Empirical results on external benchmarks (MAGI-1, SkyReels-V2) provide independent validation of the 4.73x speedup claim without circular reduction to internal fits or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that Taylor extrapolation can safely approximate denoising steps across noise levels; no free parameters or new entities are mentioned in the abstract.

axioms (1)

domain assumption Taylor series expansion around noise levels can approximate denoising outputs with bounded error under stability controls
Invoked to enable the predict mode between cache and recompute.

pith-pipeline@v0.9.0 · 5486 in / 1194 out tokens · 45718 ms · 2026-05-13T20:40:17.417835+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 10 internal anchors

[1]

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu

work page
[2]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22669–22679

work page
[3]

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. 2024. Lumiere: A space- time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers. 1–11

work page 2024
[4]

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feicht- enhofer, and Judy Hoffman. 2023. Token Merging: Your ViT But Faster. InThe Eleventh International Conference on Learning Representations

work page 2023
[6]

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. 2024. Video generation models as world simulators.OpenAI Blog1, 8 (2024), 1

work page 2024
[7]

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. 2024. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems37, 24081–24125

work page 2024
[8]

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. 2025. Skyreels- v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074(2025)

work page internal anchor Pith review arXiv 2025
[9]

Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos- Savvas Bouganis, Yiren Zhao, and Tao Chen. 2024. Δ-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers.arXiv preprint arXiv:2406.01125

work page arXiv 2024
[10]

Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InThe Twelfth International Conference on Learning Represen- tations

work page 2024
[11]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35, 16344–16359

work page 2022
[12]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in neural information processing systems34, 8780– 8794

work page 2021
[13]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al . 2021. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. InInternational Conference on Learning Representations

work page 2021
[14]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

work page 2024
[15]

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2024. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. InThe Twelfth International Conference on Learning Representations

work page 2024
[16]

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. 2025. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference. 15733–15744

work page 2025
[17]

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. 2022. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221(2022)

work page internal anchor Pith review arXiv 2022
[18]

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33, 6840–6851

work page 2020
[20]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models.Advances in neu- ral information processing systems35, 8633–8646

work page 2022
[21]

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2023. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transform- ers. InThe Eleventh International Conference on Learning Representations

work page 2023
[22]

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuan- han Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21807– 21818

work page 2024
[23]

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. 2025. Adaptive caching for faster video generation with diffusion transformers. (2025), 15240–15252

work page 2025
[24]

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. 2024. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems37, 56424–56445

work page 2024
[25]

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations (ICLR)

work page 2023
[26]

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. 2025. Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model. InProceedings of the Computer Vision and Pattern Recognition Conference. 7353–7363

work page 2025
[27]

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. 2025. From reusing to forecasting: Accelerating diffusion models with taylorseers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15853– 15863

work page 2025
[28]

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, and Linfeng Zhang. 2025. Speca: Accelerating diffusion transformers with speculative feature caching. InProceedings of the 33rd ACM International Conference on Multimedia. 10024–10033

work page 2025
[29]

Xingchao Liu, Chengyue Gong, et al. 2023. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InThe Eleventh International Conference on Learning Representations

work page 2023
[30]

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. 2023. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InThe Twelfth International Conference on Learning Representations

work page 2023
[31]

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. 2023. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K Wong. 2024. Fastercache: Training-free video diffusion model acceleration with high quality.arXiv preprint arXiv:2410.19355(2024)

work page arXiv 2024
[33]

Xinyin Ma, Gongfan Fang, Michael Bi Mi, and Xinchao Wang. 2024. Learning-to- cache: Accelerating diffusion transformer via layer caching.Advances in Neural Information Processing Systems37, 133282–133304

work page 2024
[34]

Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al . 2026. Flow caching for autoregressive video generation.arXiv preprint arXiv:2602.10825(2026)

work page arXiv 2026
[35]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffu- sion probabilistic models. InInternational conference on machine learning. PMLR, 8162–8171

work page 2021
[36]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

work page 2023
[37]

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. 2024. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[39]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022
[40]

Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. InInternational Conference on Learning Representations

work page 2022
[41]

Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[42]

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2023. Make-A-Video: Text-to-Video Generation without Text-Video Data. InThe Eleventh International Conference on Learning Representations

work page 2023
[43]

Junhyuk So, Jungwon Lee, and Eunhyeok Park. 2024. Frdiff: Feature reuse for universal training-free acceleration of diffusion models. InEuropean Conference on Computer Vision. Springer, 328–344

work page 2024
[44]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. InInternational Conference on Learning Representations. MM ’26, October 26–30, 2026, Melbourne, Australia Hanshuai Cui et al

work page 2021
[45]

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. 2023. Consistency models. InProceedings of the 40th International Conference on Machine Learning. 32211–32252

work page 2023
[46]

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Ste- fano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. InInternational Conference on Learning Repre- sentations

work page 2021
[47]

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. 2025. Ar-diffusion: Asyn- chronous video generation with auto-regressive diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference. 7364–7373

work page 2025
[48]

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. 2025. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37, 84839–84865

work page 2024
[50]

Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. 2024. Improving and gener- alizing flow-based generative models with minibatch optimal transport.Transac- tions on Machine Learning Research(2024), 1–34

work page 2024
[51]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30

work page 2017
[52]

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. 2025. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision133, 5 (2025), 3059–3078

work page 2025
[53]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing13, 4 (2004), 600–612

work page 2004
[54]

Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, et al. 2024. Cache me if you can: Accelerating diffusion models through block caching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6211–6220

work page 2024
[55]

Yiming Wu, Zhenghao Chen, Huan Wang, and Dong Xu. 2025. Individual Con- tent and Motion Dynamics Preserved Pruning for Video Diffusion Models. In Proceedings of the 33rd ACM International Conference on Multimedia. 9714–9723

work page 2025
[56]

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. Imagereward: Learning and evaluating human prefer- ences for text-to-image generation.Advances in Neural Information Processing Systems36 (2023), 15903–15935

work page 2023
[57]

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al . 2024. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. 2024. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6613–6623

work page 2024
[59]

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. 2025. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22963–22974

work page 2025
[60]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

work page
[61]

InProceedings of the IEEE conference on computer vision and pattern recognition

The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

work page
[62]

Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, and Jürgen Schmidhuber. 2024. Cross-attention makes inference cumbersome in text-to-image diffusion models.arXiv e-prints(2024), arXiv–2404

work page 2024
[63]

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. 2024. Real-time video generation with pyramid attention broadcast.arXiv preprint arXiv:2408.12588 (2024)

work page arXiv 2024
[64]

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024