pith. sign in

arxiv: 2605.18346 · v1 · pith:PBZWH4EXnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

Pith reviewed 2026-05-20 11:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords autoregressive video diffusionKV cache compressionattention-based selectiontraining-free accelerationper-frame selectionhead importance estimation
0
0 comments X

The pith

Focused Forcing selects per-frame and per-head KV caches to accelerate autoregressive video diffusion without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the growing KV cache size problem in long-horizon autoregressive video diffusion by replacing coarse, shared history selection with fine-grained choices that vary for each newly generated frame and each attention head. It shows that frames generated together can rely on different past frames, that attention scores shift with temporal distance, and that heads suffer unequal quality loss when masked, so a combined attention-plus-diversity score plus head-importance budgeting should keep more useful context while discarding the rest. If correct, this yields faster generation that also improves visual quality and text alignment across different autoregressive setups.

Core claim

Focused Forcing is a training-free method that, for each generated frame, keeps the most relevant and distinctive historical frames by merging their attention scores with diversity scores, then gives larger cache budgets to heads whose masking causes greater generation degradation.

What carries the argument

Focused Forcing: per-generated-frame selection that combines attention scores with diversity scores of historical frames, paired with explicit estimation of per-head importance to set unequal cache budgets.

If this is right

  • Up to 1.48× end-to-end acceleration is obtained across multiple autoregressive video diffusion paradigms.
  • Visual quality and text alignment improve rather than degrade.
  • The method works without any retraining or fine-tuning.
  • Selection decisions are made separately for each frame inside a generation chunk instead of once for the whole chunk.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-frame and per-head logic could be tested on autoregressive image or audio diffusion models that also maintain growing context caches.
  • Further speed gains might appear if the diversity scoring step is approximated with cheaper features for very long videos.
  • Combining Focused Forcing with existing quantization or pruning of the kept KV entries could compound the efficiency benefit.

Load-bearing premise

Combining attention scores with diversity scores and assigning budgets by estimated head importance will preserve quality better than uniform or attention-only selection.

What would settle it

Running the same long video generation task with Focused Forcing and with uniform attention-based selection, then measuring lower PSNR, LPIPS, or text-alignment scores under Focused Forcing, would show the claim is false.

Figures

Figures reproduced from arXiv: 2605.18346 by Dongrui Liu, Evelyn Zhang, Hao Lin, Jiacheng Liu, Jiehang Huang, Linfeng Zhang, Peiliang Cai, Ruiqi Zhang, Shikang Zheng, Weile Mo, Yue Ma.

Figure 1
Figure 1. Figure 1: Overview of Focused Forcing. (a) Existing methods use shared history selection, attention￾only scoring, and coarse head budgeting, illustrated here with uniform budgets. (b) Focused Forcing uses per-frame history selection, content-aware scoring, and head-adaptive budgets. (c) Focused Forcing achieves up to 1.48× speedup across multiple paradigms while preserving quality. Abstract Recent advances in autore… view at source ↗
Figure 2
Figure 2. Figure 2: Attention is generated-frame-dependent and relative-temporal-distance-sensitive. Rows denote query frames in the current generated chunk, and columns denote historical key frames. Attention varies across query frames and changes with relative temporal distance. Attention patterns in autoregressive video diffusion suggest that history selection should be performed at a finer granularity. As shown in [PITH_… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Focused Forcing. (a) We estimate head importance by masking each head and measuring the DM loss, then allocate larger KV budgets to more important heads. (b) For each query frame and head, we score historical frames by combining attention scores with their diversity scores. (c) The selected QKV rows are packed and computed by variable-length FlashAttention. 3.1 Preliminary Autoregressive Video … view at source ↗
Figure 4
Figure 4. Figure 4: Head-wise importance is non-uniform. Each bar counts the number of heads falling into its DM-loss interval, and the vertical lines mark the minimum, median, and maximum values. The DM loss distribution in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Diversity scores highlight changing regions. Historical frames with high diversity scores often contain regions that differ from the average historical representation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on 30s video generation across different autoregressive paradigms. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of inference acceleration methods on 30s video generation. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study on the KV budget. 20 28.00 28.25 28.50 28.75 29.00 76.0 76.5 80.0 77.0 77.5 78.0 78.5 79.0 79.5 0.0 0.2 0.4 0.6 0.8 1.0 Attention Weight Visual Quality Text Alignment Visual Quality Text Alignment Self Forcing 76.58 28.02 [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative examples comparing long-horizon consistency across different paradigms. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative examples comparing long-horizon consistency across different paradigms. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative examples comparing long-horizon consistency across different paradigms. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative examples comparing long-horizon consistency across different paradigms. [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative examples comparing quality preservation across acceleration methods. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative examples comparing quality preservation across acceleration methods. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative examples comparing quality preservation across acceleration methods. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative examples comparing quality preservation across acceleration methods. [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
read the original abstract

Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Focused Forcing, a training-free KV cache compression method for autoregressive video diffusion. Motivated by observations that frames within a generation chunk depend on distinct history, attention scores vary with relative temporal distance, and heads degrade unequally under masking, the method selects historical frames per generated frame by combining attention scores with diversity scores and allocates per-head budgets according to estimated importance. It reports up to 1.48× end-to-end acceleration while improving visual quality and text alignment across multiple autoregressive paradigms.

Significance. If the empirical claims hold under rigorous controls, the work would offer a practical, training-free route to scaling long-horizon autoregressive video generation by reducing KV cache footprint without quality regression. The per-frame, per-head granularity and explicit use of diversity alongside attention distinguish it from prior uniform or attention-only selection heuristics.

major comments (2)
  1. [Abstract and results] The headline performance claim (1.48× acceleration with quality gains) is presented in the abstract without any description of experimental setup, datasets, baselines, chunk sizes, or statistical testing. This information is load-bearing for assessing whether the attention-plus-diversity selection plus head-importance budgeting actually outperforms uniform/attention-only alternatives or merely reflects particular test conditions.
  2. [Motivation and method] The motivation section establishes that frames in the same chunk can depend on distinct history and that heads show unequal degradation, yet the manuscript does not quantify whether the chosen diversity metric (feature variance or similar) correlates with downstream generation impact better than attention alone. Without an ablation isolating this combination, the superiority claim risks being an artifact of the evaluation protocol.
minor comments (2)
  1. [Method] Notation for the diversity score and head-importance estimator should be introduced with explicit formulas rather than descriptive text only.
  2. [Figures] Figure captions should state the exact metrics (e.g., FID, CLIP score) and number of samples used for the reported quality improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback, which has helped us identify areas for improvement in the presentation of our work. Below, we respond to each major comment and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and results] The headline performance claim (1.48× acceleration with quality gains) is presented in the abstract without any description of experimental setup, datasets, baselines, chunk sizes, or statistical testing. This information is load-bearing for assessing whether the attention-plus-diversity selection plus head-importance budgeting actually outperforms uniform/attention-only alternatives or merely reflects particular test conditions.

    Authors: We thank the referee for this observation. The abstract is designed as a concise summary, while comprehensive details on datasets, baselines, chunk sizes, and statistical testing (including multiple random seeds with reported variance) are provided in Section 4. To improve accessibility, we will revise the abstract to include a brief clause referencing the evaluation across standard video benchmarks and multiple autoregressive paradigms. This provides necessary context without exceeding typical abstract length constraints. revision: partial

  2. Referee: [Motivation and method] The motivation section establishes that frames in the same chunk can depend on distinct history and that heads show unequal degradation, yet the manuscript does not quantify whether the chosen diversity metric (feature variance or similar) correlates with downstream generation impact better than attention alone. Without an ablation isolating this combination, the superiority claim risks being an artifact of the evaluation protocol.

    Authors: We agree that further quantification would strengthen the motivation. The section already includes quantitative masking experiments demonstrating unequal head degradation and frame-specific history dependencies. In the revision, we will add a dedicated ablation study comparing attention-only, diversity-only, and combined selection strategies, along with correlation analysis between diversity scores and downstream quality metrics (e.g., visual and alignment scores). This will substantiate that the combination provides benefits beyond attention alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained and empirically motivated

full rationale

The paper reports direct observations on frame-specific dependencies within chunks, distance-dependent attention scores, and unequal head degradation under masking. It then defines Focused Forcing as a training-free heuristic that combines per-frame attention scores with diversity scores and allocates budgets according to estimated head importance. No equations, fitted parameters, or self-citation chains are shown that reduce the selection rule or the reported 1.48× speedup-plus-quality claim back to the motivating observations by construction. The central results are presented as outcomes of empirical evaluation across multiple autoregressive paradigms, leaving the method independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard attention mechanisms in diffusion transformers and the empirical observations stated in the abstract; no new entities or fitted parameters are introduced in the provided text.

axioms (1)
  • domain assumption Attention scores and diversity metrics can be computed from the model's existing forward pass without additional training.
    Implicit in the training-free claim and the use of attention scores for selection.

pith-pipeline@v0.9.0 · 5814 in / 1277 out tokens · 26099 ms · 2026-05-20T11:16:11.545605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 17 internal anchors

  1. [1]

    Monarchrt: Efficient attention for real-time video generation, 2026

    Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, and Beidi Chen. Monarchrt: Efficient attention for real-time video generation, 2026. URLhttps://arxiv.org/abs/2602.12271

  2. [2]

    MAGI-1: Autoregressive Video Generation at Scale

    Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shuchen...

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  4. [4]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

  5. [5]

    Lesa: Learnable stage-aware predictors for diffusion model acceleration, 2026

    Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, and Linfeng Zhang. Lesa: Learnable stage-aware predictors for diffusion model acceleration, 2026. URL https:// arxiv.org/abs/2602.20497

  6. [6]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  7. [7]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...

  8. [8]

    Context forcing: Consistent autoregressive video generation with long context,

    Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

  9. [9]

    Self-forcing++: Towards minute-scale high-quality video generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=DzvPiqh23f

  10. [10]

    Flashattention-2: Faster attention with better parallelism and work partition- ing

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partition- ing. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, edi- tors,International Conference on Learning Representations, volume 2024, pages 35549– 35562, 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024/file/ 98ed250b203d1ac6b24bbcf263e3...

  11. [11]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems, volume 35, pages 16344–16359. Curran Associates, Inc., 2022. URL https://proceeding...

  12. [12]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=JE9tCwe3lp. 10

  13. [13]

    Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference

    Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and WANG Jiannan. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  14. [14]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models, 2025. URLhttps://arxiv.org/abs/2410.12557

  15. [15]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling, 2025. URLhttps://arxiv.org/abs/2505.13447

  16. [16]

    Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026

    Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026

  17. [17]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  18. [18]

    Ltx-2: Efficient joint audio-visual foundation model, 2026

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...

  19. [19]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239

  20. [20]

    arXiv preprint arXiv:2401.08671 (2024)

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yux- iong He. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed- inference, 2024. URLhttps://arxiv.org/abs/2401.08671

  21. [21]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=mSiN7i0BYH

  22. [22]

    VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...

  23. [23]

    Distrifusion: Distributed parallel inference for high-resolution diffusion models

    Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming- Yu Liu, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  24. [24]

    Timestep embedding tells: It’s time to cache for video diffusion model, 2024

    Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model.arXiv preprint arXiv:2411.19108, 2024

  25. [25]

    From reusing to forecasting: Accelerating diffusion models with taylorseers

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15853–15863, October 2025

  26. [26]

    Rolling forcing: Autoregressive long video diffusion in real time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=IAyzXjbfwo. 11

  27. [27]

    Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025. ISSN 2731-5398. doi: 10.1007/s11633-025-1562-4. URL http://dx.doi.org/10.1007/s11633-025-1562-4

  28. [28]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  29. [29]

    Deepcache: Accelerating diffusion models for free

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  30. [30]

    Follow your pose: Pose-guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4117–4125, 2024

  31. [31]

    Con- trollable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

    Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Bingyuan Wang, Qinghe Wang, Xuanhua He, Hongfa Wang, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

  32. [32]

    Follow- your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

    Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zexuan Yan, Zhifeng Li, Sirui Han, Chenyang Qi, et al. Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

  33. [33]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

  34. [34]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Ja- gadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Si...

  35. [35]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations (ICLR), 2022

  36. [36]

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...

  37. [37]

    Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,

    Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast- forward caching in diffusion transformer acceleration, 2024. URL https://arxiv.org/abs/ 2407.01425

  38. [38]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021

  39. [39]

    Hunyuan-gamecraft-2: Instruction- following interactive game world model, 2026

    Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction- following interactive game world model, 2026. URLhttps://arxiv.org/abs/2511.23429

  40. [40]

    Advancing Open-source World Models

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

  41. [41]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  42. [42]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, Peizhen Zhang, Peng Chen, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xiao He, Xin Li, Xinchi Deng, Xuefei Zhe, Yang Li, Yanxin Long, Yuanbo Peng, Yue Wu, Yuhong Liu, Zhenyu Wang, Zuozhuo Dai, Bo Peng, Coo...

  43. [43]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 21875–21895, 2024. URL https://proceedings.iclr.cc/paper_files/ paper/2024/fi...

  44. [44]

    Duoattention: Efficient long-context LLM inference with retrieval and streaming 13 heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, junxian guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming 13 heads. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=cFu7ze7xUm

  45. [45]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  46. [46]

    Longlive: Real-time interactive long video generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying-Cong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=nCAODkpsPJ

  47. [47]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  48. [48]

    H., Nam, J., Yoon, H., and Kim, S

    Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025

  49. [49]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Frédo Du- rand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 47455–4748...

  50. [50]

    Freeman, and Taesung Park

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6613–6623, June 2024

  51. [51]

    Freeman, Fredo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, June 2025

  52. [52]

    Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

    Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

  53. [53]

    Sageat- tention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization,

    Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageat- tention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization,

  54. [54]

    URLhttps://arxiv.org/abs/2411.10958

  55. [55]

    Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration, 2025

    Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration, 2025. URL https://arxiv. org/abs/2410.02367

  56. [56]

    Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026

    Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026

  57. [57]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in 14...

  58. [58]

    Unipc: A unified predictor- corrector framework for fast sampling of diffusion models

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor- corrector framework for fast sampling of diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  59. [59]

    Relax forcing: Relaxed kv-memory for consistent long video generation,

    Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation, 2026. URL https:// arxiv.org/abs/2603.21366

  60. [60]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

  61. [61]

    Accelerating diffusion transformers with token-wise feature caching

    Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, and Linfeng Zhang. Accelerating diffusion transformers with token-wise feature caching. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=yYZbZGo4ei. 15 A Limitations Focused Forcing is designed as a training-free KV selection method for efficient...