pith. sign in

arxiv: 2605.30083 · v1 · pith:RGULKJBYnew · submitted 2026-05-28 · 💻 cs.CV

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

Pith reviewed 2026-06-29 08:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords KV cacheautoregressive video generationtraining-freefuture-awarequery distributiontoken merginglong video synthesisRoPE
0
0 comments X

The pith

The pre-RoPE query distribution stays stable enough during autoregressive video generation that historical statistics can predict future query needs for KV cache management.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive video models generate frames one by one while reusing past computations in a KV cache. The cache grows large and can drop important tokens if importance is judged only by current context. The paper finds that the base query distribution before rotary embeddings barely changes over time. This lets the method build a proxy for what future queries will look like using only past data. Scoring and merging tokens against this proxy keeps the cache focused on what will matter later without any retraining.

Core claim

Although RoPE-modulated queries evolve across autoregressive steps, the underlying canonical pre-RoPE query distribution remains remarkably stable throughout the video generation process. This approximate stationarity implies that future query distributions are estimable from historical statistics, enabling principled future-aware cache decisions without any additional training. Future Forcing constructs a future query proxy from historical statistics, scores KV cache tokens by their importance under this proxy, and merges redundant token pairs within the affine subspace induced by the future query.

What carries the argument

Future query proxy constructed from historical pre-RoPE query statistics that scores KV tokens and defines merging subspaces.

Load-bearing premise

The canonical pre-RoPE query distribution must stay close enough to stationary that early statistics reliably estimate later query requirements.

What would settle it

Compare the distribution of pre-RoPE queries computed at the start versus the end of a long video generation; large shifts would mean the historical proxy no longer matches future needs.

Figures

Figures reproduced from arXiv: 2605.30083 by Chen Gao, Cong Wang, Hanxin Zhu, Jiayi Luo, Jiayu Chen, Junhao Liu, Qingyun Sun, Qiyan Liu, Tengyang Wang, Xiaobin Hu, Zhibo Chen.

Figure 1
Figure 1. Figure 1: Qualitative comparison of 30s long video generation across different KV cache strategies. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Query distribution across latent frames for different autoregressive video generation models. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Future Forcing, which constructs future query proxies from stable pre-RoPE [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Efficiency and mem￾ory study of our Future Forcing. C F + D F C F + F F(O urs) R F + D F R F + F F(O urs) C F 0s 60s R F 0s 60s Prompt: A corgi wearing sunglasses walks on the beach of a tropical island Prompt: A toy robot wearing blue jeans and a white t shirt taking a pleasant stroll in Mumbai India during a winter storm [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison of Future Forcing and baselines for 60-second long-video generation. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Query distribution visualizations on LongLive across two representative query dimensions. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Query distribution visualizations on Reward-Forcing across two representative query [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Query distribution visualizations on Rolling-Forcing across two representative query [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional ablation results Reward-Forcing [Lu et al., 2025], and Rolling-Forcing [Liu et al., 2025a]. For each model, we visualize query distributions at two representative query dimensions, as shown in [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional efficiency and memory consumption analysis under different AR video [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Contribution of the custom Triton kernel to inference efficiency. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional visualization results for Causal-Forcing in 30-second video generation. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional visualization results for Reward-Forcing in 30-second video generation. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional visualization results for Self-Forcing in 30-second video generation. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional visualization results for Reward-Forcing in 60-second video generation. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional visualization results for Self-Forcing in 60-second video generation. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Pre-RoPE and RoPE-modulated query distributions under [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
read the original abstract

Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation steps. Nevertheless, its growth with generation length introduces increasing memory and error accumulation, limiting the scalability of AR models to even longer sequences. Existing KV cache compression methods mitigate this issue by selectively retaining only video tokens deemed important. However, most existing methods assess token importance using short-horizon signals derived from the current or historical generation context, making these methods prone to overlooking tokens that appear unimportant at early steps but later become critical for future frames. In this work, we identify an important property of trained AR video models: although RoPE-modulated queries evolve across autoregressive steps, the underlying canonical pre-RoPE query distribution remains remarkably stable throughout the video generation process. This approximate stationarity implies that future query distributions are estimable from historical statistics, enabling principled future-aware cache decisions without any additional training. Building on this insight, we propose Future Forcing, a training-free future-aware KV cache policy for AR video generation. Specifically, Future Forcing first constructs a future query proxy from historical statistics, then scores KV cache tokens by their importance under this proxy, and finally merges redundant token pairs within the affine subspace induced by the future query. Extensive experiments show that Future Forcing improves long-horizon consistency under limited KV caches, achieving up to 1.49 improvement in subject consistency on VBench-Long for 60s generation over existing AR video KV cache policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies an approximate stationarity property in the canonical (pre-RoPE) query distribution of trained autoregressive video models, despite the evolution of RoPE-modulated queries across generation steps. This property is used to construct a future query proxy from historical statistics, enabling a training-free KV cache policy (Future Forcing) that scores token importance under the proxy and merges redundant pairs in the induced affine subspace. Experiments on VBench-Long report improvements in long-horizon consistency (up to 1.49 in subject consistency for 60s generation) over prior AR video KV cache policies.

Significance. If the stationarity holds across models and the proxy-derived importance scores correlate with those from true future queries, the method offers a practical way to improve memory efficiency and reduce error accumulation in long AR video synthesis without retraining. The training-free aspect and use of an empirical distributional property are notable strengths if the proxy is shown to preserve relevant attention patterns.

major comments (2)
  1. [Abstract] The central claim requires that the historical proxy not only matches marginal statistics but also produces token-importance rankings close to those obtained from actual future queries (i.e., that inner products or attention patterns with cached keys are preserved). The abstract presents stationarity as an empirical observation enabling the proxy but provides no quantitative evidence (e.g., correlation coefficients or ranking agreement on long sequences) that marginal stability suffices for this ranking task.
  2. [Abstract (method description)] Without reported ablations or verification on the proxy construction (e.g., how historical statistics are aggregated into the future query proxy and whether the affine-subspace merge preserves the necessary geometry), it is unclear whether the reported consistency gains are attributable to the future-aware component or to other implementation choices.
minor comments (1)
  1. [Abstract] The abstract states 'extensive experiments' but does not reference specific tables, figures, or metrics beyond the single 1.49 subject-consistency number; adding cross-model and cross-length results would strengthen the stationarity claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of major revision. We address the two major comments point by point below and will revise the manuscript to strengthen the presentation of evidence for the proxy.

read point-by-point responses
  1. Referee: [Abstract] The central claim requires that the historical proxy not only matches marginal statistics but also produces token-importance rankings close to those obtained from actual future queries (i.e., that inner products or attention patterns with cached keys are preserved). The abstract presents stationarity as an empirical observation enabling the proxy but provides no quantitative evidence (e.g., correlation coefficients or ranking agreement on long sequences) that marginal stability suffices for this ranking task.

    Authors: We agree that the abstract does not contain quantitative metrics such as correlation coefficients or ranking agreement to link the observed stationarity directly to preserved token-importance rankings. The manuscript reports end-to-end consistency gains but does not include these specific proxy-validation statistics in the abstract or main text. In the revision we will add a concise reference in the abstract and a new paragraph (with correlation and rank-agreement numbers computed on held-out long sequences) in Section 3 to demonstrate that the proxy rankings align with those from true future queries. revision: yes

  2. Referee: [Abstract (method description)] Without reported ablations or verification on the proxy construction (e.g., how historical statistics are aggregated into the future query proxy and whether the affine-subspace merge preserves the necessary geometry), it is unclear whether the reported consistency gains are attributable to the future-aware component or to other implementation choices.

    Authors: We concur that the absence of ablations on proxy aggregation and the affine-subspace merge leaves the source of the gains ambiguous. The current manuscript presents only the final policy and overall results. We will add a dedicated ablation subsection in the experiments that varies the aggregation window, the choice of historical moments, and the merge geometry, thereby isolating the contribution of the future-aware proxy. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents the stationarity of the canonical pre-RoPE query distribution as an empirical observation identified in trained AR video models, which then motivates the construction of a future query proxy for the KV cache policy. This observation is treated as an external property rather than derived from or reduced to the proposed Future Forcing method itself. The subsequent steps (proxy construction, token scoring, and merging) follow from this observation without any self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation chain remains self-contained against external benchmarks, with the stationarity serving as an independent premise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption of query stationarity; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption The canonical pre-RoPE query distribution remains approximately stationary during autoregressive video generation.
    This stationarity is the load-bearing observation that allows historical statistics to serve as a future proxy.

pith-pipeline@v0.9.1-grok · 5858 in / 1150 out tokens · 24856 ms · 2026-06-29T08:27:57.438092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 39 canonical work pages · 16 internal anchors

  1. [1]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1 0 (8): 0 1, 2024

  2. [2]

    Genie: Generative interactive environments

    Jake Bruce et al. Genie: Generative interactive environments. In ICML, 2024

  3. [3]

    Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. COLM, 2025

  4. [4]

    Past-and future-informed kv cache policy with salience estimation in autoregressive video diffusion

    Hanmo Chen, Chenghao Xu, Xu Yang, Xuan Chen, and Cheng Deng. Past-and future-informed kv cache policy with salience estimation in autoregressive video diffusion. arXiv preprint arXiv:2601.21896, 2026 a

  5. [5]

    Context forcing: Consistent autoregressive video generation with long context

    Shuo Chen et al. Context forcing: Consistent autoregressive video generation with long context. arXiv preprint arXiv:2602.06028, 2026 b

  6. [6]

    Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters

    Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters. arXiv preprint arXiv:2505.20156, 2025

  7. [7]

    Flash A ttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024

  8. [8]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In NeurIPS, 2022

  9. [9]

    Veo 3 technical report

    Google DeepMind. Veo 3 technical report. Technical report, Google DeepMind, 2024

  10. [10]

    Autoregressive Video Generation without Vector Quantization

    Haoge Deng et al. Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169, 2024

  11. [11]

    Efficient autoregressive video diffusion with dummy head

    Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head. arXiv preprint arXiv:2601.20499, 2026

  12. [12]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen et al. LTX-Video : Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

  13. [13]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. ICLR, 2023

  14. [14]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. NeurIPS, 2025

  15. [15]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. CVPR, 2024

  16. [16]

    Videoar: Autoregressive video generation via next-frame & scale prediction

    Longbin Ji et al. Videoar: Autoregressive video generation via next-frame & scale prediction. arXiv preprint arXiv:2601.05966, 2026

  17. [17]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin et al. Pyramidal flow matching for efficient video generative modeling. ICLR, 2025

  18. [18]

    Videopoet: A large language model for zero-shot video generation

    Dan Kondratyuk et al. Videopoet: A large language model for zero-shot video generation. ICML, 2024

  19. [19]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

  20. [20]

    Train short, inference long: Training-free horizon extension for autoregressive video generation

    Jia Li et al. Train short, inference long: Training-free horizon extension for autoregressive video generation. arXiv preprint arXiv:2602.14027, 2026 a

  21. [21]

    Packcache: A training-free acceleration method for unified autoregressive video generation via compact kv-cache.arXiv preprint arXiv:2601.04359,

    Kunyang Li, Mubarak Shah, and Yuzhang Shang. Packcache: A training-free acceleration method for unified autoregressive video generation via compact kv-cache. arXiv preprint arXiv:2601.04359, 2026 b

  22. [22]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161, 2025 a

  23. [23]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. NeurIPS, 2023

  24. [24]

    Vrope: Rotary position embedding for video large language models

    Zikang Liu, Longteng Guo, Yepeng Tang, Junxian Cai, Kai Ma, Xi Chen, and Jing Liu. Vrope: Rotary position embedding for video large language models. arXiv preprint arXiv:2502.11664, 2025 b

  25. [25]

    Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    Yunhong Lu et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678, 2025

  26. [26]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    Guoqing Ma et al. Step-Video-T2V technical report: The practice, challenges, and future of video foundation model. arXiv preprint arXiv:2502.10248, 2025 a

  27. [27]

    Latte : Latent diffusion transformer for video generation

    Xin Ma et al. Latte : Latent diffusion transformer for video generation. TMLR, 2025 b

  28. [28]

    Flow caching for autoregressive video generation

    Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, et al. Flow caching for autoregressive video generation. arXiv preprint arXiv:2602.10825, 2026

  29. [29]

    Packforcing: Short video training suffices for long video sampling and long context inference.arXiv preprint arXiv:2603.25730, 2026

    Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, and Kaipeng Zhang. Packforcing: Short video training suffices for long video sampling and long context inference. arXiv preprint arXiv:2603.25730, 2026

  30. [30]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak et al. Movie Gen : A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

  31. [31]

    Kv cache quantization for self-forcing video generation: A 33-method empirical study

    Suraj Ranganath, Vaishak Menon, and Anish Patnaik. Kv cache quantization for self-forcing video generation: A 33-method empirical study. arXiv preprint arXiv:2603.27469, 2026

  32. [32]

    Fast autoregressive video diffusion and world models with temporal cache compression and sparse attention

    Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, and Rami Ben-Ari. Fast autoregressive video diffusion and world models with temporal cache compression and sparse attention. arXiv preprint arXiv:2602.01801, 2026

  33. [33]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2021

  34. [34]

    Ar-diffusion: Asynchronous video generation with auto-regressive diffusion

    Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. arXiv preprint arXiv:2503.07418, 2025

  35. [35]

    Genmo Team. Mochi 1. https://github.com/genmoai/models, 2024

  36. [36]

    Kling-Omni Technical Report

    Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report. arXiv preprint arXiv:2512.16776, 2025

  37. [37]

    Phenaki: Variable length video generation from open domain textual descriptions

    Ruben Villegas et al. Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023

  38. [38]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

  39. [39]

    Image quality assessment: From error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13 0 (4): 0 600--612, 2004

  40. [40]

    Videorope: What makes for good video rotary position embedding? ICML, 2025

    Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. Videorope: What makes for good video rotary position embedding? ICML, 2025

  41. [41]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870, 2025 a

  42. [42]

    Pack and force your memory: Long-form and consistent video generation

    Xiaofei Wu et al. Pack and force your memory: Long-form and consistent video generation. arXiv preprint arXiv:2510.01784, 2025 b

  43. [43]

    Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

    Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, and Kurt Keutzer. Quant videogen: Auto-regressive long video generation via 2-bit kv-cache quantization. arXiv preprint arXiv:2602.02958, 2026

  44. [44]

    Progressive autoregressive video diffusion models

    Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models. arXiv preprint arXiv:2410.08151, 2024

  45. [45]

    Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

    Boxun Xu, Yuming Du, Zichang Liu, Siyu Yang, Ziyang Jiang, Siqi Yan, Rajasi Saha, Albert Pumarola, Wenchen Wang, and Peng Li. Sparse forcing: Native trainable sparse attention for real-time autoregressive diffusion video generation. arXiv preprint arXiv:2604.21221, 2026

  46. [46]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

  47. [47]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang et al. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025 a

  48. [48]

    Anchor forcing: Anchor memory and tri-region rope for interactive streaming video diffusion

    Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng-Tao Jiang. Anchor forcing: Anchor memory and tri-region rope for interactive streaming video diffusion. arXiv preprint arXiv:2603.13405, 2026

  49. [49]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. ICLR, 2025 b

  50. [50]

    Deep forcing: Training-free long video generation with deep sink and participative compression

    Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025

  51. [51]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin et al. From slow bidirectional to fast autoregressive video diffusion models. In CVPR, 2025

  52. [52]

    Videomar: Autoregressive video generation with continuous tokens

    Hu Yu et al. Videomar: Autoregressive video generation with continuous tokens. arXiv preprint arXiv:2506.14168, 2025 a

  53. [53]

    Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

    Yifei Yu et al. Videossm: Autoregressive long video generation with hybrid state-space memory. arXiv preprint arXiv:2512.04519, 2025 b

  54. [54]

    Helios: Real real-time long video generation model

    Shenghai Yuan et al. Helios: Real real-time long video generation model. arXiv preprint arXiv:2603.04379, 2026

  55. [55]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018

  56. [56]

    Cam: Cache merging for memory-efficient llms inference

    Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient llms inference. In ICML, 2024

  57. [57]

    H _2 o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, Zhangyang Wang, and Beidi Chen. H _2 o: Heavy-hitter oracle for efficient generative inference of large language models. NeurIPS, 2023

  58. [58]

    Relax forcing: Relaxed kv-memory for consistent long video generation

    Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation. arXiv preprint arXiv:2603.21366, 2026

  59. [59]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng et al. Open-Sora : Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

  60. [60]

    Taming teacher forcing for masked autoregressive video generation

    Deyu Zhou et al. Taming teacher forcing for masked autoregressive video generation. arXiv preprint arXiv:2501.12389, 2025

  61. [61]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    Hongzhou Zhu et al. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation. arXiv preprint arXiv:2602.02214, 2026

  62. [62]

    Hiar: Efficient autoregressive long video generation via hierarchical denoising

    Kai Zou et al. Hiar: Efficient autoregressive long video generation via hierarchical denoising. arXiv preprint arXiv:2603.08703, 2026