pith. machine review for the scientific record. sign in

arxiv: 2605.09681 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords KV cache compressionautoregressive video diffusionattention head specializationhybrid pruningefficient generationSelf ForcingLongLive
0
0 comments X

The pith

Dividing attention heads into static and dynamic categories enables hybrid KV cache compression that reduces memory by 30% and speeds autoregressive video diffusion up to 2.82x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive video diffusion models generate frames sequentially but accumulate large KV caches from past frames, creating memory and speed bottlenecks that hinder real-time use. The paper establishes that attention heads show stable, distinct patterns: some focus on frame transitions and fidelity while others handle motion consistency. This allows a hybrid compression method called Forcing-KV that applies structured pruning to static heads and segment-wise similarity pruning to dynamic heads. If correct, the approach maintains generation quality while delivering over 29 frames per second on one H200 GPU and substantial speedups at both 480P and 1080P resolutions. A sympathetic reader would care because it directly addresses the scalability barrier for long-horizon, streaming video synthesis.

Core claim

By observing that attention heads in mainstream AR diffusion models exhibit markedly distinct and stable attention patterns across samples and denoising steps, the authors divide heads into static heads that focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads that govern inter-frame motion and consistency. Forcing-KV then applies structured static pruning to the former and dynamic pruning based on segment-wise similarity to the latter, achieving over 29 fps generation speed on a single NVIDIA H200 GPU with 30% cache memory reduction, 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P, and 2.82x speedup at 1080P while preserving output.

What carries the argument

The hybrid KV cache compression strategy that classifies heads by functional specialization and applies tailored pruning: structured static pruning for static heads and segment-wise similarity pruning for dynamic heads.

If this is right

  • Generation reaches over 29 frames per second on a single H200 GPU while cutting KV cache memory by 30%.
  • Speedups of 1.35x and 1.50x are realized on LongLive and Self Forcing at 480P resolution.
  • The speedup scales to 2.82x at 1080P resolution with no reported quality loss.
  • The method preserves output quality across the tested autoregressive video diffusion setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same head-specialization principle could be tested on non-video autoregressive diffusion models to see whether similar memory savings appear.
  • If the classification holds across training runs, it might allow pre-computed pruning masks that further reduce runtime overhead.
  • Extending the dynamic pruning to longer context windows could support even higher-resolution or longer video sequences on fixed hardware.

Load-bearing premise

Attention heads in mainstream AR diffusion models have markedly distinct patterns and roles that stay stable across samples and denoising steps, allowing reliable division into static and dynamic categories.

What would settle it

Running the hybrid pruning on multiple AR video models at different resolutions and observing consistent drops in motion consistency or visual fidelity would falsify the claim that the head classification supports lossless compression.

Figures

Figures reproduced from arXiv: 2605.09681 by Huan Li, Jun Zhang, Qin Yang, Shuiyang Mao, Wei Liu, Wenhan Luo, XiTai Jin, Yicheng Ji, Ying Qin, Zhizhou Zhong.

Figure 1
Figure 1. Figure 1: Overview of FORCING-KV. We apply static structural pruning and dynamic similarity pruning to different heads, accelerating inference, reducing cache memory while improving quality. 1 Introduction Autoregressive (AR) video diffusion [1–6] has recently emerged as a compelling paradigm for efficient, streaming text-to-video generation. Unlike conventional bidirectional video diffusion arXiv:2605.09681v1 [cs.C… view at source ↗
Figure 2
Figure 2. Figure 2: Attention head patterns in AR video diffusion models. Static heads focus on intra-frame [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: (a-c) Gradually masking contextual information for dynamic heads leads to a progressive decline in dynamic degree and consistency, while masking the transition frame for static heads causes a sharp rise in chunk discontinuity, revealing different functional emphases. (d) The cosine similarity of key states of adjacent frames across different autoregressive steps and different frame segments. Right: (… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of FORCING-KV. We perform offline head profiling to classify attention heads into Static and Dynamic. During inference, static heads are pruned leveraging the structural pattern, while dynamic heads are pruned adaptively based on segment-wise similarity of adjacent frames. For simplicity, we use one frame per chunk as an example. 4.1 Offline Head Profiling Given the consistent functional behaviors… view at source ↗
Figure 5
Figure 5. Figure 5: User study [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaling FORCING-KV on Self Forcing with attention window size and resolution. pronounced as the attention window and resolution grow. In [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dynamic score comparison of random token pruning, uniform token pruning, and our proposed similarity pruning. 6 Conclusion We presented FORCING-KV, a hybrid KV cache compression framework for autoregressive video diffusion models. We begin by uncovering a universal head specialization pattern shared across mainstream autoregressive video diffusion models, which naturally motivates our compression strategy.… view at source ↗
Figure 8
Figure 8. Figure 8: Case study of optical flow difference variations for a 30-second (~480 frames) video with [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Attention patterns of Wan2.1 [10], SkyReels-V2 [5], Longlive [2], and Self Forcing [1]. We conduct experiments across bidirectional and autoregressive video diffusion models, including both many-step and few-step variants, such as Wan2.1 [10], SkyReels-V2 [5], LongLive [2], and Self Forcing [1], as shown in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Protocol and screenshot of the user study. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of VBench [46] scores. We compare FORCING-KV with its base model. FORCING-KV achieves near-lossless performance in terms of total score, and further outperforms the baseline on metrics such as dynamic degree, overall consistency, and appearance style, demonstrating its advantage. J Where Are Static and Dynamic Heads Located? We consider the head distribution in AR diffusion models to be an i… view at source ↗
Figure 12
Figure 12. Figure 12: Head distribution across layers. K Trend of the Proportion of Self-Attention In [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Variation in the proportion of total runtime occupied by self-attention. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Quality example of 60-second interactive video on Longlive. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Quality examples on Longlive 5s and 30s. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Quality examples on Self Forcing 5s and 30s. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
read the original abstract

Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Forcing-KV, a hybrid KV cache compression method for autoregressive video diffusion models. It reports an empirical observation that attention heads exhibit markedly distinct patterns and functional roles that remain stable across samples and denoising steps, allowing division into static heads (focused on transitions and intra-frame fidelity) and dynamic heads (governing inter-frame motion and consistency). Structured static pruning is applied to the former and segment-wise similarity-based dynamic pruning to the latter, yielding reported results of over 29 FPS on a single H200 GPU, 30% cache memory reduction, speedups of 1.35x/1.50x at 480P and 2.82x at 1080P, while maintaining output quality. Code and demos are provided.

Significance. If the stability of the head categorization and quality preservation hold under quantitative scrutiny, the work could meaningfully advance scalable real-time long video generation in AR diffusion models by mitigating KV cache memory and compute bottlenecks, with direct applicability to models like LongLive and Self Forcing.

major comments (3)
  1. [§3] §3 (Empirical Study of Head-wise Functional Specialization): The central assumption that attention head patterns 'remain stable across samples and denoising steps' enabling reliable static/dynamic division is load-bearing for the hybrid strategy, yet the manuscript provides only qualitative observations without quantitative support such as cross-sample consistency scores, categorization variance statistics, or timestep-robustness metrics.
  2. [§4] §4 (Forcing-KV Hybrid Compression): The exact criteria, thresholds, and similarity metric (e.g., no explicit equation for segment-wise similarity or pruning ratio selection) used to categorize heads and apply pruning are insufficiently formalized, making it impossible to verify how the reported 30% cache reduction and speedups are achieved without under-compression or quality loss.
  3. [§5] §5 (Experiments): The claim of 'maintaining output quality' is not supported by specific quantitative metrics (e.g., no reported FVD, FID, or perceptual scores), ablation tables on pruning ratios per head category, or failure-case analysis, leaving the tradeoff between compression and fidelity unverifiable despite the speed/memory numbers.
minor comments (2)
  1. [Abstract] Abstract: The speedup figures (1.35x on LongLive, 1.50x on Self Forcing) would benefit from explicit baseline model versions and resolution settings to allow direct comparison.
  2. [§4] Notation: The distinction between 'static pruning' and 'dynamic pruning' could be clarified with a small table summarizing the two strategies side-by-side.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the quantitative foundations, formalization, and experimental validation of Forcing-KV.

read point-by-point responses
  1. Referee: [§3] §3 (Empirical Study of Head-wise Functional Specialization): The central assumption that attention head patterns 'remain stable across samples and denoising steps' enabling reliable static/dynamic division is load-bearing for the hybrid strategy, yet the manuscript provides only qualitative observations without quantitative support such as cross-sample consistency scores, categorization variance statistics, or timestep-robustness metrics.

    Authors: We agree that the stability claim in §3 would benefit from quantitative backing. In the revised manuscript we will add (i) cross-sample consistency scores measuring the fraction of heads that receive the same static/dynamic label across 50+ diverse inputs, (ii) categorization variance statistics (mean and std of label flips), and (iii) timestep-robustness metrics that track label stability over the full denoising trajectory. These metrics will be reported in a new table and will directly support the hybrid pruning design. revision: yes

  2. Referee: [§4] §4 (Forcing-KV Hybrid Compression): The exact criteria, thresholds, and similarity metric (e.g., no explicit equation for segment-wise similarity or pruning ratio selection) used to categorize heads and apply pruning are insufficiently formalized, making it impossible to verify how the reported 30% cache reduction and speedups are achieved without under-compression or quality loss.

    Authors: We acknowledge that §4 lacks explicit equations. We will insert the precise mathematical definitions: the segment-wise similarity metric (cosine similarity between averaged KV features of consecutive segments), the head-classification threshold (derived from attention entropy and motion magnitude), and the per-category pruning-ratio selection rule. These additions will make the 30 % memory reduction and reported speedups fully reproducible and verifiable. revision: yes

  3. Referee: [§5] §5 (Experiments): The claim of 'maintaining output quality' is not supported by specific quantitative metrics (e.g., no reported FVD, FID, or perceptual scores), ablation tables on pruning ratios per head category, or failure-case analysis, leaving the tradeoff between compression and fidelity unverifiable despite the speed/memory numbers.

    Authors: We accept that the quality-preservation claim requires stronger quantitative evidence. In the revision we will report Fréchet Video Distance (FVD) and Fréchet Inception Distance (FID) on standard benchmarks, add ablation tables that vary pruning ratios independently for static and dynamic heads, and include a failure-case analysis section that discusses edge cases where quality degrades. These changes will make the speed–quality trade-off transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation and measured results remain independent

full rationale

The paper's core chain consists of an empirical observation of head-wise attention patterns, followed by a categorization into static/dynamic heads and a hybrid pruning strategy whose performance (FPS, cache reduction, speedups) is reported via direct measurement on benchmarks. No equations, fitted parameters, or self-citations are shown that would make the reported outcomes equivalent to the inputs by construction. The stability claim is presented as an observation supporting the method rather than a tautological redefinition, and the speed claims are external experimental outcomes rather than derived predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one key domain assumption about stable head specialization; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Attention heads exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps.
    This assumption underpins the division into static and dynamic heads and the choice of pruning strategies.

pith-pipeline@v0.9.0 · 5605 in / 1308 out tokens · 90009 ms · 2026-05-12T03:50:56.653537+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion,

    X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion,”NeurIPS, 2025

  2. [2]

    Longlive: Real-time Interactive Long Video Generation,

    S. Yang, W. Huang, R. Chu, Y . Xiao, Y . Zhao, X. Wang, M. Li, E. Xie, Y . Chen, Y . Luet al., “Longlive: Real-time Interactive Long Video Generation,”ICLR, 2026

  3. [3]

    From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,

    T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,” in2025 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2025, pp. 22 963–22 974

  4. [4]

    MAGI-1: Autoregressive Video Generation at Scale

    H. Teng, H. Jia, L. Sun, L. Li, M. Li, M. Tang, S. Han, T. Zhang, W. Zhang, W. Luoet al., “Magi-1: Autoregressive video generation at scale,”arXiv preprint arXiv:2505.13211, 2025

  5. [5]

    SkyReels-V2: Infinite-length Film Generative Model

    G. Chen, D. Lin, J. Yang, C. Lin, J. Zhu, M. Fan, H. Zhang, S. Chen, Z. Chen, C. Maet al., “Skyreels-v2: Infinite-length film generative model,”arXiv preprint arXiv:2504.13074, 2025

  6. [6]

    Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models,

    L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala, “Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  7. [7]

    Scalable Diffusion Models with Transformers,

    W. Peebles and S. Xie, “Scalable Diffusion Models with Transformers,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 4172–4182

  8. [8]

    Open-Sora Plan: Open-Source Large Video Generation Model,

    B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, T. Jia, J. Zhang, Z. Tang, Y . Pang, B. She, C. Yan, Z. Hu, X. Dong, L. Chen, Z. Pan, X. Zhou, S. Dong, Y . Tian, and L. Yuan, “Open-Sora Plan: Open-Source Large Video Generation Model,” 2024

  9. [9]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models,

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y . Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y . Li, Y . Chen, Y . Cui, Y . Peng, Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y . ...

  10. [10]

    Wan: Open and Advanced Large-Scale Video Generative Models

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang et al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  11. [11]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation,

    J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh, “Self-Forcing++: Towards Minute-Scale High-Quality Video Generation,”ICLR, 2026

  12. [12]

    Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,

    Y . Lu, Y . Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhuet al., “Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,”CVPR, 2026

  13. [13]

    Rolling forcing: Autoregressive long video diffusion in real time,

    K. Liu, W. Hu, J. Xu, Y . Shan, and S. Lu, “Rolling forcing: Autoregressive long video diffusion in real time,”ICLR, 2026

  14. [14]

    arXiv preprint arXiv:2512.05081 (2025)

    J. Yi, W. Jang, P. H. Cho, J. Nam, H. Yoon, and S. Kim, “Deep forcing: Training-free long video generation with deep sink and participative compression,”arXiv preprint arXiv:2512.05081, 2025

  15. [15]

    Infinity-rope: Action- controllable infinite video generation emerges from autoregressive self-rollout,

    H. Yesiltepe, T. H. S. Meral, A. K. Akan, K. Oktay, and P. Yanardag, “Infinity-rope: Action- controllable infinite video generation emerges from autoregressive self-rollout,”CVPR, 2025

  16. [16]

    arXiv preprint arXiv:2512.21734 (2025)

    S. Xiao, X. Zhang, D. Meng, Q. Wang, P. Zhang, and B. Zhang, “Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation,” arXiv preprint arXiv:2512.21734, 2025

  17. [17]

    Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

    Y . Huang, H. Guo, F. Wu, S. Zhang, S. Huang, Q. Gan, L. Liu, S. Zhao, E. Chen, J. Liuet al., “Live avatar: Streaming real-time audio-driven avatar generation with infinite length,”arXiv preprint arXiv:2512.04677, 2025. 11

  18. [18]

    Stable Video Infinity: Infinite-Length Video Generation with Error Recycling,

    W. Li, W. Pan, P.-C. Luan, Y . Gao, and A. Alahi, “Stable Video Infinity: Infinite-Length Video Generation with Error Recycling,” inInternational Conference on Learning Representations, 2026

  19. [19]

    arXiv preprint arXiv:2601.16914 (2026)

    J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh, “LoL: Longer than Longer, Scaling Video Generation to Hour,”arXiv preprint arXiv:2601.16914, 2026

  20. [20]

    Past-and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion,

    H. Chen, C. Xu, X. Yang, X. Chen, and C. Deng, “Past-and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion,”arXiv preprint arXiv:2601.21896, 2026

  21. [21]

    Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention,

    C. Lv, Y . Shi, Y . Huang, R. Gong, S. Ren, and W. Wang, “Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention,”arXiv preprint arXiv:2602.04789, 2026

  22. [22]

    Monar- chRT: Efficient Attention for Real-Time Video Generation,

    K. Agarwal, Z. Chen, C. Luo, Y . Chen, H. Zheng, X. Huang, A. Rudra, and B. Chen, “Monar- chRT: Efficient Attention for Real-Time Video Generation,”arXiv preprint arXiv:2602.12271, 2026

  23. [23]

    Flow Caching for Autoregressive Video Generation,

    Y . Ma, X. Zheng, J. Xu, X. Xu, F. Ling, X. Zheng, H. Kuang, H. Li, X. WANG, X. Xiao et al., “Flow Caching for Autoregressive Video Generation,” inThe Fourteenth International Conference on Learning Representations, 2026

  24. [24]

    Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention,

    D. Samuel, I. Tzachor, M. Levy, M. Green, G. Chechik, and R. Ben-Ari, “Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention,” arXiv preprint arXiv:2602.01801, 2026

  25. [25]

    Efficient Autoregressive Video Diffusion with Dummy Head,

    H. Guo, Z. Jia, J. Li, B. Li, Y . Cai, J. Wang, Y . Li, and Y . Lu, “Efficient Autoregressive Video Diffusion with Dummy Head,”arXiv preprint arXiv:2601.20499, 2026

  26. [26]

    From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,

    T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,” inCVPR, 2025

  27. [27]

    Krea Realtime 14B: Real-time Video Generation,

    E. Millon, “Krea Realtime 14B: Real-time Video Generation,” 2025

  28. [28]

    Pathwise Test-Time Correction for Autoregressive Long Video Generation,

    X. Xiang, Z. Duan, G. Zhang, H. Zhang, Z. Gao, J. Wu, S. Zhang, T. Wang, Q. Fan, and C. Guo, “Pathwise Test-Time Correction for Autoregressive Long Video Generation,”arXiv preprint arXiv:2602.05871, 2026

  29. [29]

    Mode Seeking meets Mean Seeking for Fast Long Video Generation,

    S. Cai, W. Nie, C. Liu, J. Berner, L. Zhang, N. Ma, H. Chen, M. Agrawala, L. Guibas, G. Wetzsteinet al., “Mode Seeking meets Mean Seeking for Fast Long Video Generation,” arXiv preprint arXiv:2602.24289, 2026

  30. [30]

    Longcat-video technical report

    M. L. Team, X. Cai, Q. Huang, Z. Kang, H. Li, S. Liang, L. Ma, S. Ren, X. Wei, R. Xie, and T. Zhang, “LongCat-Video Technical Report,”arXiv preprint arXiv:2510.22200, 2025

  31. [31]

    Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379,

    S. Yuan, Y . Yin, Z. Li, X. Huang, X. Yang, and L. Yuan, “Helios: Real Real-Time Long Video Generation Model,”arXiv preprint arXiv:2603.04379, 2026

  32. [32]

    Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity,

    H. Xi, S. Yang, Y . Zhao, C. Xu, M. Li, X. Li, Y . Lin, H. Cai, J. Zhang, D. Liet al., “Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity,” in Forty-second International Conference on Machine Learning, 2025

  33. [33]

    Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation,

    S. Yang, H. Xi, Y . Zhao, M. Li, J. Zhang, H. Cai, Y . Lin, X. Li, C. Xu, K. Penget al., “Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  34. [34]

    Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

    B. Xu, Y . Du, Z. Liu, S. Yang, Z. Jiang, S. Yan, R. Saha, A. Pumarola, W. Wang, and P. Li, “Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation,”arXiv preprint arXiv:2604.21221, 2026

  35. [35]

    Sana-video: Efficient video generation with block linear diffusion transformer,

    J. Chen, Y . Zhao, J. Yu, R. Chu, J. Chen, S. Yang, X. Wang, Y . Pan, D. Zhou, H. Linget al., “Sana-video: Efficient video generation with block linear diffusion transformer,”ICLR, 2026. 12

  36. [36]

    Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization,

    J. Zhang, H. Huang, P. Zhang, J. Wei, J. Zhu, and J. Chen, “Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization,” inInternational Conference on Machine Learning (ICML), 2025

  37. [37]

    Timestep Em- bedding Tells: It’s Time to Cache for Video Diffusion Model,

    F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan, “Timestep Em- bedding Tells: It’s Time to Cache for Video Diffusion Model,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2025, pp. 7353–7363

  38. [38]

    Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing,

    K. Gao, J. Shi, H. Zhang, C. Wang, J. Xiao, and L. Chen, “Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 18 550–18 565

  39. [39]

    KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study,

    S. Ranganath, V . Menon, and A. Patnaik, “KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study,” 2026

  40. [40]

    Efficient Streaming Language Models with Attention Sinks,

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient Streaming Language Models with Attention Sinks,” inThe Twelfth International Conference on Learning Representations, 2023

  41. [41]

    Atlas", and C. Beidi, “H2o: Heavy-hitter oracle for efficient generative inference of large language models,

    Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. Ré, C. Barrett, W. Zhangyang, "Atlas", and C. Beidi, “H2o: Heavy-hitter oracle for efficient generative inference of large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 34 661–34 710, 2023

  42. [42]

    DuoAttention: Efficient Long- Context LLM Inference with Retrieval and Streaming Heads,

    G. Xiao, J. Tang, J. Zuo, S. Yang, H. Tang, Y . Fu, S. Hanet al., “DuoAttention: Efficient Long- Context LLM Inference with Retrieval and Streaming Heads,” inThe Thirteenth International Conference on Learning Representations, 2024

  43. [43]

    HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

    B. Zeng, F. Ren, J. Zhang, X. Gu, K. Chen, L. Shou, and H. Li, “HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference,”arXiv preprint arXiv:2604.05887, 2026

  44. [44]

    Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression,

    K. Li, Z. Chen, C.-Y . Yang, and J.-N. Hwang, “Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  45. [45]

    Head-aware kv cache compression for efficient visual autoregressive modeling,

    Z. Qin, Y . Lv, M. Lin, H. Guo, Z. Zhang, D. Zou, and W. Lin, “Head-aware kv cache compression for efficient visual autoregressive modeling,”arXiv preprint arXiv:2504.09261, 2025

  46. [46]

    VBench: Comprehensive benchmark suite for video generative models,

    Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y . Wang, X. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu, “VBench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  47. [47]

    Movie Gen: A Cast of Media Foundation Models

    A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuanget al., “Movie gen: A cast of media foundation models,”arXiv preprint arXiv:2410.13720, 2024

  48. [48]

    VBench++: Comprehensive and versatile benchmark suite for video generative models,

    Z. Huang, F. Zhang, X. Xu, Y . He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y . Jiang, Y . Wang, X. Chen, Y .-C. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu, “VBench++: Comprehensive and versatile benchmark suite for video generative models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  49. [49]

    Causal forcing: Autoregressivediffusiondistillationdonerightforhigh-qualityreal-timeinteractivevideogeneration

    H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation,”arXiv preprint arXiv:2602.02214, 2026

  50. [50]

    RAFT: Recurrent All-Pairs Field Transforms for Optical Flow,

    Z. Teed and J. Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow,” in European Conference on Computer Vision, 2020, pp. 402–419. 13 A Chunk Discontinuity Definition of chunk discontinuity.In Section 3 and Section 5, we usechunk discontinuityto quantify transitions across chunks. To ensure fairness and validity in evaluation, we define chu...