pith. sign in

arxiv: 2607.00712 · v1 · pith:OTNR6V3Inew · submitted 2026-07-01 · 💻 cs.CV · cs.MM

Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption

Pith reviewed 2026-07-02 14:10 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords autoregressive video generationKV cache compressionlocal attentionparametric absorptionmemory efficiencystreaming generative modelsvideo generation
0
0 comments X

The pith

Instance-specific weight modulations let autoregressive video models switch layers to local attention and halve their KV cache with near-lossless quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive streaming models for long video generation face memory overload from linearly growing KV caches. Rather than dropping tokens and breaking dependencies, the method absorbs historical context directly into a subset of the model's weights. A brief warmup phase tracks output differences between full and local attention, after which a closed-form least-squares solve produces an instance-specific adjustment that lets layers transition to local attention. Experiments on models from 1.3B to 14B parameters show this removes up to half the cache while keeping visual quality intact. The work reframes cache compression as parametric distillation instead of token pruning.

Core claim

During a warmup phase, output discrepancies between global and local attention are monitored; at the transition point a closed-form least-squares problem yields an instance-specific weight modulation that compensates for the missing historical context, allowing selected layers to switch from full attention to local attention and thereby removing up to 50% of the KV cache without significant temporal flickering or identity loss.

What carries the argument

Instance-Specific Parametric Absorption (ISPA), which uses a least-squares solution on warmup discrepancies to modulate weights and enable the shift to local attention.

If this is right

  • KV cache size can be reduced by up to 50% on models ranging from 1.3B to 14B parameters.
  • Visual quality remains near-lossless in autoregressive video generation.
  • The approach applies across different autoregressive architectures for streaming video.
  • Cache management can shift from discarding tokens to distilling context into weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same absorption step could be tested on non-video long-context tasks such as audio or text generation.
  • Allowing the transition point to move dynamically during inference rather than at a fixed warmup end might further reduce average cache size.
  • Repeating the least-squares adjustment at multiple points in a very long sequence could handle cases where early compensation drifts.

Load-bearing premise

The closed-form least-squares weight adjustment derived from the warmup phase will continue to compensate for missing historical context throughout generation without introducing new temporal artifacts or identity drift.

What would settle it

Generate long videos with the method and check whether temporal flickering or identity changes appear after the warmup phase on varied content.

Figures

Figures reproduced from arXiv: 2607.00712 by Hayden Kwok-Hay So, Jia Li, Jiao Dai, Jizhong Han, Xiangxiang Chu, Xiaomeng Fu, Yiming Hu, Yong Wang.

Figure 1
Figure 1. Figure 1: The overall framework of ISPA. ISPA consists of three phases: (1) Dual￾Stream Warmup: During the first Twarm frames, F-Layers collect full attention out￾puts Afull and local outputs Aloc. (2) Layer Absorb: At frame Twarm, we select K layers and convert them to L-Layer (Local Layer). (3) Memory-Efficient Inference: The selected layers permanently evict historical KV caches and utilize W + ∆W to compensate f… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Qualitative comparison of 30-second long-form video generation. Right: GPU memory usage comparison. outputs. The combination between parametric absorption (for identity) and sink tokens (for stability) allows ISPA to match the quality of full-cache inference while maintaining a constant memory footprint. The impact of the number of L-Layers K [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on core components. Top: Our full method maintains identity consistency and visual sharpness. Middle: Removing parametric adaptation (No Ab￾sorb) leads to subject morphing over time. Bottom: Omitting initial KV pairs (No Sink) causes oversmoothness. 0.60 0.62 0.64 0.66 Aesthetic Quality 0.64 0.66 0.68 0.70 0.72 Image Quality 0.79 0.80 0.81 0.82 0.83 Quality Score 0s 5s 10s 15s 20s 25s Time (… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on KV cache eviction ratios over time. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of warm-up duration (Twarm). We compare parametric adaptation using different initial frame counts. Twarm ≥ 12 effectively prevents information star￾vation, matching or exceeding the full-cache baseline’s long-term stability. /aʊ/ silence /ɪ/ /ɔː/ /v/ /eɪ/ /uː/ LiveAvatar ISPA “Once upon a time, there exists a ……” Input Frame Input Speech [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generalization to Speech-to-Video (S2V) model. We integrate ISPA into the LiveAvatar for streaming talking-head synthesis. Given a single reference image and a continuous speech input, ISPA (bottom row) maintains nearly identical visual quality and viseme accuracy compared to the full-cache baseline (top row). to-video model (LiveAvatar). Speech-to-video synthesis is a demanding stream￾ing task where the m… view at source ↗
Figure 7
Figure 7. Figure 7: Compatibility with Post-Training Quantization (PTQ). (a). Visual comparison between the standard ISPA and its quantized version (ISPA+W8A8). (b). Overall latency breakdown. (c). Quantitative stability analysis based on generation quality. attention blocks (achieving 1.89× within the operator). By further integrating W8A8 quantization for the linear layers, we reach an aggregate speedup of 1.86× compared to… view at source ↗
read the original abstract

Autoregressive (AR) streaming models have emerged as a powerful paradigm for long video generation. However, the linearly growing Key-Value (KV) cache poses a significant bottleneck, leading to memory overload and degraded inference throughput. A common compression method is to drop redundant KV tokens, which often breaks long-range dependencies, resulting in temporal flickering and identity loss. In this paper, we propose Instance-Specific Parametric Absorption (ISPA), a novel framework that shifts the KV cache compression from discarding to distilling. The core idea is to transit a subset of layers from Full-Attention (F-Layers) to memory-efficient Local-Attention (L-Layers) by "absorbing" historical context into the model's weights. Specifically, during a brief warmup phase, ISPA monitors the output discrepancy between global and local attention. At the transition point, we solve a closed-form least-squares problem to compute an instance-specific weight modulation that compensates for the missing history. Experiments across architectures (1.3B to 14B) demonstrate that ISPA can remove up to 50\% of the KV cache with near-lossless visual quality. We hope this perspective encourages future work to explore parametric memory consolidation beyond external token-level cache management for streaming generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Instance-Specific Parametric Absorption (ISPA) to address KV cache growth in autoregressive video generation. During a brief warmup, output discrepancies between full-attention and local-attention layers are monitored; at transition, a closed-form least-squares problem yields an instance-specific weight modulation that allows selected layers to switch to local attention while absorbing historical context. Experiments on 1.3B–14B models report up to 50% KV cache reduction with near-lossless visual quality.

Significance. If the central claim holds, ISPA offers a parametric alternative to token-dropping KV compression that preserves long-range dependencies without external cache management. The closed-form least-squares computation and scaling experiments across model sizes (1.3B to 14B) are positive features that could be reproducible if the formulation is supplied.

major comments (2)
  1. [Abstract] Abstract and method description: the central claim that a single closed-form least-squares modulation computed from warmup-phase discrepancies suffices to compensate for missing KV history for the remainder of an arbitrarily long video lacks any derivation, stationarity bound, or post-transition validation. This directly undermines the 'near-lossless' guarantee when content, motion, or identity statistics evolve after the transition point.
  2. [Abstract] Abstract: no explicit formulation of the least-squares objective, the precise discrepancy metric, the variables being solved for, or how the resulting modulation is injected into the layer weights is provided. Without these, the reported results cannot be verified or reproduced, rendering the 'closed-form' and 'parameter-free' aspects unverifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater clarity and validation in the ISPA method. We address each point below and will revise the manuscript to improve reproducibility and strengthen the empirical support for the central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the central claim that a single closed-form least-squares modulation computed from warmup-phase discrepancies suffices to compensate for missing KV history for the remainder of an arbitrarily long video lacks any derivation, stationarity bound, or post-transition validation. This directly undermines the 'near-lossless' guarantee when content, motion, or identity statistics evolve after the transition point.

    Authors: We agree that the manuscript does not include a formal derivation or stationarity bound proving the modulation remains effective for arbitrarily long videos under evolving statistics. The provided experiments demonstrate near-lossless quality on sequences substantially longer than the warmup phase across multiple model scales, but these are empirical. We will add explicit post-transition validation on extended videos (e.g., 2x–4x the original lengths) with varying motion and identity changes, plus a short discussion of the implicit stationarity assumption during the brief warmup. A theoretical bound is beyond the current scope but could be explored in future work. revision: partial

  2. Referee: [Abstract] Abstract: no explicit formulation of the least-squares objective, the precise discrepancy metric, the variables being solved for, or how the resulting modulation is injected into the layer weights is provided. Without these, the reported results cannot be verified or reproduced, rendering the 'closed-form' and 'parameter-free' aspects unverifiable.

    Authors: The method section derives the closed-form solution by solving a least-squares problem that minimizes the L2 output discrepancy between full-attention and local-attention layers over the warmup tokens; the variables solved are an instance-specific diagonal scaling matrix applied multiplicatively to the query and key projection weights of the transitioned layer. We acknowledge the abstract omits these details and the method description could be more self-contained. We will revise the abstract to include a concise formulation statement and expand the method section with the exact objective, discrepancy metric, solved variables, and injection procedure to enable direct reproduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central mechanism computes an instance-specific weight modulation via closed-form least-squares on output discrepancies observed only during an explicit warmup phase, then applies the fixed modulation to enable local attention for the remainder of generation. This is an empirical proposal whose reported success (near-lossless quality after KV cache reduction) is measured on held-out video content rather than being identical to the warmup discrepancies by construction. No equations reduce the final quality metric to the fitted parameters, no self-citations are invoked as load-bearing uniqueness theorems, and the derivation chain contains no self-definitional steps or renamed known results. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; the least-squares compensation step implicitly assumes a linear relationship between attention outputs that may require instance-specific fitting, but no explicit free parameters or invented entities are stated.

axioms (1)
  • domain assumption Output discrepancy measured during warmup is a sufficient signal to compute a weight modulation that fully compensates for history loss.
    Invoked at the transition point from F-Layers to L-Layers.

pith-pipeline@v0.9.1-grok · 5773 in / 1263 out tokens · 29663 ms · 2026-07-02T14:10:39.194865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 26 canonical work pages · 18 internal anchors

  1. [1]

    In: International Conference on Machine Learning

    Behnam,P.,Fu,Y.,Zhao,R.,Tsai,P.A.,Yu,Z.,Tumanov,A.:Rocketkv:Accelerat- ing long-context llm inference via two-stage kv cache compression. In: International Conference on Machine Learning. pp. 3358–3392. PMLR (2025)

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  3. [3]

    SkyReels-V2: Infinite-length Film Generative Model

    Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025)

  4. [4]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

  5. [5]

    In: The Twelfth International Conference on Learning Representations

    Dao, T.: Flashattention-2: Faster attention with better parallelism and work par- titioning. In: The Twelfth International Conference on Learning Representations

  6. [6]

    Advances in neural information pro- cessing systems35, 16344–16359 (2022)

    Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)

  7. [7]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  8. [8]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Fu, X., Li, J.: Tcfg: Truncated classifier-free guidance for efficient and scalable text- to-image acceleration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18552–18562 (2025)

  9. [9]

    In: The Thirteenth International Conference on Learning Representations

    Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. In: The Thirteenth International Conference on Learning Representations

  10. [10]

    arXiv preprint arXiv:2601.20499 (2026)

    Guo, H., Jia, Z., Li, J., Li, B., Cai, Y., Wang, J., Li, Y., Lu, Y.: Efficient au- toregressive video diffusion with dummy head. arXiv preprint arXiv:2601.20499 (2026)

  11. [11]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  12. [12]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Hooper, C.R.C., Kim, S., Mohammadzadeh, H., Maheswaran, M., Zhao, S., Paik, J., Mahoney, M.W., Keutzer, K., Gholami, A.: Squeezed attention: Accelerating long context length llm inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 32631–32652 (2025)

  13. [13]

    arXiv preprint arXiv:2602.03747 (2026)

    Huang, J., Ye, Z., Hu, X., He, T., Zhang, G., Shi, S., Bian, J., Jiang, L.: Live: Long-horizon interactive video world modeling. arXiv preprint arXiv:2602.03747 (2026)

  14. [14]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

  15. [15]

    Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

    Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al.: Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677 (2025) Memory-Efficient Video Generation by Parametric Absorption 17

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, J., Fu, X., Gao, Y., Wang, J., Wang, X., So, H.K.H.: Rethinking conditioning in diffusion models: Dynamic token scheduling for efficient and aligned text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4160–4169 (2026)

  18. [18]

    arXiv preprint arXiv:2602.14027 (2026)

    Li, J., Fu, X., Peng, X., Chen, W., Zheng, Y., Zhao, T., Wang, J., Chen, F., Wang, X., So, H.K.H.: Train short, inference long: Training-free horizon extension for autoregressive video generation. arXiv preprint arXiv:2602.14027 (2026)

  19. [19]

    arXiv preprint arXiv:2510.09212 (2025)

    Li, W., Pan, W., Luan, P.C., Gao, Y., Alahi, A.: Stable video infinity: Infinite- length video generation with error recycling. arXiv preprint arXiv:2510.09212 (2025)

  20. [20]

    Advances in Neural Information Processing Systems37, 22947–22970 (2024)

    Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., Chen, D.: Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37, 22947–22970 (2024)

  21. [21]

    In: The Eleventh International Conference on Learning Rep- resentations

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Rep- resentations

  22. [22]

    Advances in Neural Information Processing Systems37, 139997–140031 (2024)

    Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., Zhuang, B.: Minicache: Kv cache compression in depth dimension for large language models. Advances in Neural Information Processing Systems37, 139997–140031 (2024)

  23. [23]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

  24. [24]

    In: The Eleventh International Conference on Learning Representations

    Liu, X., Gong, C., et al.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations

  25. [25]

    Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025)

  26. [26]

    In: European Conference on Computer Vision

    Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

  27. [27]

    Latte: Latent Diffusion Transformer for Video Generation

    Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)

  28. [28]

    arXiv preprint arXiv:2602.10825 (2026)

    Ma, Y., Zheng, X., Xu, J., Xu, X., Ling, F., Zheng, X., Kuang, H., Li, H., Wang, X., Xiao, X., et al.: Flow caching for autoregressive video generation. arXiv preprint arXiv:2602.10825 (2026)

  29. [29]

    com/krea-ai/realtime-video

    Millon,E.:Krearealtime14b:Real-timevideogeneration(2025),https://github. com/krea-ai/realtime-video

  30. [30]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  31. [31]

    Movie Gen: A Cast of Media Foundation Models

    Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 18 X. Fu et al

  32. [32]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  33. [33]

    In: The Twelfth International Conference on Learning Representations

    Song, Y., Dhariwal, P.: Improved techniques for training consistency models. In: The Twelfth International Conference on Learning Representations

  34. [34]

    In: Interna- tional Conference on Machine Learning

    Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Interna- tional Conference on Machine Learning. pp. 32211–32252. PMLR (2023)

  35. [35]

    In: Forty-first International Con- ference on Machine Learning

    Tang, J., Zhao, Y., Zhu, K., Xiao, G., Kasikci, B., Han, S.: Quest: Query-aware sparsity for efficient long-context llm inference. In: Forty-first International Con- ference on Machine Learning

  36. [36]

    Advancing Open-source World Models

    Team, R., Gao, Z., Wang, Q., Zeng, Y., Zhu, J., Cheng, K.L., Li, Y., Wang, H., Xu, Y., Ma, S., et al.: Advancing open-source world models. arXiv preprint arXiv:2601.20540 (2026)

  37. [37]

    MAGI-1: Autoregressive Video Generation at Scale

    Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)

  38. [38]

    In: International Conference on Learning Representations

    Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M.T., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual descriptions. In: International Conference on Learning Representations

  39. [39]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  40. [40]

    In: The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

    Wang, A., Chen, H., Tan, J., Zhang, K., Cai, X., Lin, Z., Han, J., Ding, G.: Pre- fixkv: Adaptive prefix kv cache is what vision instruction-following models need for efficient generation. In: The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

  41. [41]

    Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)

  42. [42]

    Wu, X., Zhang, G., Xu, Z., Zhou, Y., Lu, Q., He, X.: Pack and force your memory: Long-formandconsistentvideogeneration.arXivpreprintarXiv:2510.01784(2025)

  43. [43]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)

  44. [44]

    LongLive: Real-time Interactive Long Video Generation

    Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

  45. [45]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  46. [46]

    arXiv preprint arXiv:2511.20649 (2025)

    Yesiltepe, H., Meral, T.H.S., Akan, A.K., Oktay, K., Yanardag, P.: Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout. arXiv preprint arXiv:2511.20649 (2025)

  47. [47]

    arXiv preprint arXiv:2512.05081 (2025)

    Yi, J., Jang, W., Cho, P.H., Nam, J., Yoon, H., Kim, S.: Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081 (2025)

  48. [48]

    Advances in neural information processing systems37, 47455–47487 (2024) Memory-Efficient Video Generation by Parametric Absorption 19

    Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024) Memory-Efficient Video Generation by Parametric Absorption 19

  49. [49]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

  50. [50]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 22963– 22974 (2025)

  51. [51]

    In: International Conference on Machine Learning

    Yu, Z., Wang, Z., Fu, Y., Shi, H., Shaikh, K., Lin, Y.C.: Unveiling and harness- ing hidden attention sinks: Enhancing large language models without training through attention calibration. In: International Conference on Machine Learning. pp. 57659–57677. PMLR (2024)

  52. [52]

    In: International Conference on Machine Learning

    Zhang,J.,Huang,H.,Zhang,P.,Wei,J.,Zhu,J.,Chen,J.:Sageattention2:Efficient attention with thorough outlier smoothing and per-thread int4 quantization. In: International Conference on Machine Learning. pp. 75097–75119. PMLR (2025)

  53. [53]

    In: The Thirteenth International Conference on Learning Representations

    Zhang, J., Zhang, P., Zhu, J., Chen, J., et al.: Sageattention: Accurate 8-bit at- tention for plug-and-play inference acceleration. In: The Thirteenth International Conference on Learning Representations

  54. [54]

    Advances in Neural Information Processing Systems36, 34661–34710 (2023)

    Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., et al.: H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36, 34661–34710 (2023)

  55. [55]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)

  56. [56]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    Zhu, H., Zhao, M., He, G., Su, H., Li, C., Zhu, J.: Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video genera- tion. arXiv preprint arXiv:2602.02214 (2026)