Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption

Hayden Kwok-Hay So; Jia Li; Jiao Dai; Jizhong Han; Xiangxiang Chu; Xiaomeng Fu; Yiming Hu; Yong Wang

arxiv: 2607.00712 · v1 · pith:OTNR6V3Inew · submitted 2026-07-01 · 💻 cs.CV · cs.MM

Towards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption

Xiaomeng Fu , Jia Li , Yiming Hu , Yong Wang , Hayden Kwok-Hay So , Jiao Dai , Xiangxiang Chu , Jizhong Han This is my paper

Pith reviewed 2026-07-02 14:10 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords autoregressive video generationKV cache compressionlocal attentionparametric absorptionmemory efficiencystreaming generative modelsvideo generation

0 comments

The pith

Instance-specific weight modulations let autoregressive video models switch layers to local attention and halve their KV cache with near-lossless quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive streaming models for long video generation face memory overload from linearly growing KV caches. Rather than dropping tokens and breaking dependencies, the method absorbs historical context directly into a subset of the model's weights. A brief warmup phase tracks output differences between full and local attention, after which a closed-form least-squares solve produces an instance-specific adjustment that lets layers transition to local attention. Experiments on models from 1.3B to 14B parameters show this removes up to half the cache while keeping visual quality intact. The work reframes cache compression as parametric distillation instead of token pruning.

Core claim

During a warmup phase, output discrepancies between global and local attention are monitored; at the transition point a closed-form least-squares problem yields an instance-specific weight modulation that compensates for the missing historical context, allowing selected layers to switch from full attention to local attention and thereby removing up to 50% of the KV cache without significant temporal flickering or identity loss.

What carries the argument

Instance-Specific Parametric Absorption (ISPA), which uses a least-squares solution on warmup discrepancies to modulate weights and enable the shift to local attention.

If this is right

KV cache size can be reduced by up to 50% on models ranging from 1.3B to 14B parameters.
Visual quality remains near-lossless in autoregressive video generation.
The approach applies across different autoregressive architectures for streaming video.
Cache management can shift from discarding tokens to distilling context into weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same absorption step could be tested on non-video long-context tasks such as audio or text generation.
Allowing the transition point to move dynamically during inference rather than at a fixed warmup end might further reduce average cache size.
Repeating the least-squares adjustment at multiple points in a very long sequence could handle cases where early compensation drifts.

Load-bearing premise

The closed-form least-squares weight adjustment derived from the warmup phase will continue to compensate for missing historical context throughout generation without introducing new temporal artifacts or identity drift.

What would settle it

Generate long videos with the method and check whether temporal flickering or identity changes appear after the warmup phase on varied content.

Figures

Figures reproduced from arXiv: 2607.00712 by Hayden Kwok-Hay So, Jia Li, Jiao Dai, Jizhong Han, Xiangxiang Chu, Xiaomeng Fu, Yiming Hu, Yong Wang.

**Figure 1.** Figure 1: The overall framework of ISPA. ISPA consists of three phases: (1) DualStream Warmup: During the first Twarm frames, F-Layers collect full attention outputs Afull and local outputs Aloc. (2) Layer Absorb: At frame Twarm, we select K layers and convert them to L-Layer (Local Layer). (3) Memory-Efficient Inference: The selected layers permanently evict historical KV caches and utilize W + ∆W to compensate f… view at source ↗

**Figure 2.** Figure 2: Left: Qualitative comparison of 30-second long-form video generation. Right: GPU memory usage comparison. outputs. The combination between parametric absorption (for identity) and sink tokens (for stability) allows ISPA to match the quality of full-cache inference while maintaining a constant memory footprint. The impact of the number of L-Layers K [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study on core components. Top: Our full method maintains identity consistency and visual sharpness. Middle: Removing parametric adaptation (No Absorb) leads to subject morphing over time. Bottom: Omitting initial KV pairs (No Sink) causes oversmoothness. 0.60 0.62 0.64 0.66 Aesthetic Quality 0.64 0.66 0.68 0.70 0.72 Image Quality 0.79 0.80 0.81 0.82 0.83 Quality Score 0s 5s 10s 15s 20s 25s Time (… view at source ↗

**Figure 4.** Figure 4: Ablation on KV cache eviction ratios over time. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of warm-up duration (Twarm). We compare parametric adaptation using different initial frame counts. Twarm ≥ 12 effectively prevents information starvation, matching or exceeding the full-cache baseline’s long-term stability. /aʊ/ silence /ɪ/ /ɔː/ /v/ /eɪ/ /uː/ LiveAvatar ISPA “Once upon a time, there exists a ……” Input Frame Input Speech [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Generalization to Speech-to-Video (S2V) model. We integrate ISPA into the LiveAvatar for streaming talking-head synthesis. Given a single reference image and a continuous speech input, ISPA (bottom row) maintains nearly identical visual quality and viseme accuracy compared to the full-cache baseline (top row). to-video model (LiveAvatar). Speech-to-video synthesis is a demanding streaming task where the m… view at source ↗

**Figure 7.** Figure 7: Compatibility with Post-Training Quantization (PTQ). (a). Visual comparison between the standard ISPA and its quantized version (ISPA+W8A8). (b). Overall latency breakdown. (c). Quantitative stability analysis based on generation quality. attention blocks (achieving 1.89× within the operator). By further integrating W8A8 quantization for the linear layers, we reach an aggregate speedup of 1.86× compared to… view at source ↗

read the original abstract

Autoregressive (AR) streaming models have emerged as a powerful paradigm for long video generation. However, the linearly growing Key-Value (KV) cache poses a significant bottleneck, leading to memory overload and degraded inference throughput. A common compression method is to drop redundant KV tokens, which often breaks long-range dependencies, resulting in temporal flickering and identity loss. In this paper, we propose Instance-Specific Parametric Absorption (ISPA), a novel framework that shifts the KV cache compression from discarding to distilling. The core idea is to transit a subset of layers from Full-Attention (F-Layers) to memory-efficient Local-Attention (L-Layers) by "absorbing" historical context into the model's weights. Specifically, during a brief warmup phase, ISPA monitors the output discrepancy between global and local attention. At the transition point, we solve a closed-form least-squares problem to compute an instance-specific weight modulation that compensates for the missing history. Experiments across architectures (1.3B to 14B) demonstrate that ISPA can remove up to 50\% of the KV cache with near-lossless visual quality. We hope this perspective encourages future work to explore parametric memory consolidation beyond external token-level cache management for streaming generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ISPA's closed-form weight modulation from warmup discrepancies is a distinct angle on KV compression, but the abstract leaves the math and generalization unverified.

read the letter

The paper's main point is that instead of dropping KV tokens in autoregressive video models, you can run a short warmup, measure output gaps between full and local attention, then solve a closed-form least-squares problem to bake the missing history into instance-specific weight changes. This lets some layers switch to local attention and cut the cache by up to 50% while keeping visual quality close to the original.

What is new is the move from external token management to this parametric absorption step at layer transition. The experiments run the idea on models from 1.3B to 14B parameters and report the compression works with little visible degradation, which is a step beyond the token-dropping baselines mentioned.

The soft spots are real. The central claim rests on the warmup discrepancies being representative for the entire video. If motion or identity features shift after the transition point, the fixed modulation has no way to adapt, and the paper supplies no bound or test showing the solution stays valid. The abstract also gives no explicit least-squares equation, no definition of the quality metric, and no controls for post-transition content changes, so the "near-lossless" result cannot be checked from what is written.

This work is for people building streaming video generators who already fight KV memory limits. A reader working on efficient transformer inference would find the parametric angle worth examining, even if the current write-up is thin on verification.

I would send it to peer review. The idea is distinct enough and the scale of models tested is decent, but the authors need to supply the missing formulation and targeted failure cases before the claim can be taken as solid.

Referee Report

2 major / 0 minor

Summary. The paper proposes Instance-Specific Parametric Absorption (ISPA) to address KV cache growth in autoregressive video generation. During a brief warmup, output discrepancies between full-attention and local-attention layers are monitored; at transition, a closed-form least-squares problem yields an instance-specific weight modulation that allows selected layers to switch to local attention while absorbing historical context. Experiments on 1.3B–14B models report up to 50% KV cache reduction with near-lossless visual quality.

Significance. If the central claim holds, ISPA offers a parametric alternative to token-dropping KV compression that preserves long-range dependencies without external cache management. The closed-form least-squares computation and scaling experiments across model sizes (1.3B to 14B) are positive features that could be reproducible if the formulation is supplied.

major comments (2)

[Abstract] Abstract and method description: the central claim that a single closed-form least-squares modulation computed from warmup-phase discrepancies suffices to compensate for missing KV history for the remainder of an arbitrarily long video lacks any derivation, stationarity bound, or post-transition validation. This directly undermines the 'near-lossless' guarantee when content, motion, or identity statistics evolve after the transition point.
[Abstract] Abstract: no explicit formulation of the least-squares objective, the precise discrepancy metric, the variables being solved for, or how the resulting modulation is injected into the layer weights is provided. Without these, the reported results cannot be verified or reproduced, rendering the 'closed-form' and 'parameter-free' aspects unverifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater clarity and validation in the ISPA method. We address each point below and will revise the manuscript to improve reproducibility and strengthen the empirical support for the central claims.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the central claim that a single closed-form least-squares modulation computed from warmup-phase discrepancies suffices to compensate for missing KV history for the remainder of an arbitrarily long video lacks any derivation, stationarity bound, or post-transition validation. This directly undermines the 'near-lossless' guarantee when content, motion, or identity statistics evolve after the transition point.

Authors: We agree that the manuscript does not include a formal derivation or stationarity bound proving the modulation remains effective for arbitrarily long videos under evolving statistics. The provided experiments demonstrate near-lossless quality on sequences substantially longer than the warmup phase across multiple model scales, but these are empirical. We will add explicit post-transition validation on extended videos (e.g., 2x–4x the original lengths) with varying motion and identity changes, plus a short discussion of the implicit stationarity assumption during the brief warmup. A theoretical bound is beyond the current scope but could be explored in future work. revision: partial
Referee: [Abstract] Abstract: no explicit formulation of the least-squares objective, the precise discrepancy metric, the variables being solved for, or how the resulting modulation is injected into the layer weights is provided. Without these, the reported results cannot be verified or reproduced, rendering the 'closed-form' and 'parameter-free' aspects unverifiable.

Authors: The method section derives the closed-form solution by solving a least-squares problem that minimizes the L2 output discrepancy between full-attention and local-attention layers over the warmup tokens; the variables solved are an instance-specific diagonal scaling matrix applied multiplicatively to the query and key projection weights of the transitioned layer. We acknowledge the abstract omits these details and the method description could be more self-contained. We will revise the abstract to include a concise formulation statement and expand the method section with the exact objective, discrepancy metric, solved variables, and injection procedure to enable direct reproduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central mechanism computes an instance-specific weight modulation via closed-form least-squares on output discrepancies observed only during an explicit warmup phase, then applies the fixed modulation to enable local attention for the remainder of generation. This is an empirical proposal whose reported success (near-lossless quality after KV cache reduction) is measured on held-out video content rather than being identical to the warmup discrepancies by construction. No equations reduce the final quality metric to the fitted parameters, no self-citations are invoked as load-bearing uniqueness theorems, and the derivation chain contains no self-definitional steps or renamed known results. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; the least-squares compensation step implicitly assumes a linear relationship between attention outputs that may require instance-specific fitting, but no explicit free parameters or invented entities are stated.

axioms (1)

domain assumption Output discrepancy measured during warmup is a sufficient signal to compute a weight modulation that fully compensates for history loss.
Invoked at the transition point from F-Layers to L-Layers.

pith-pipeline@v0.9.1-grok · 5773 in / 1263 out tokens · 29663 ms · 2026-07-02T14:10:39.194865+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 26 canonical work pages · 18 internal anchors

[1]

In: International Conference on Machine Learning

Behnam,P.,Fu,Y.,Zhao,R.,Tsai,P.A.,Yu,Z.,Tumanov,A.:Rocketkv:Accelerat- ing long-context llm inference via two-stage kv cache compression. In: International Conference on Machine Learning. pp. 3358–3392. PMLR (2025)

2025
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

SkyReels-V2: Infinite-length Film Generative Model

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

In: The Twelfth International Conference on Learning Representations

Dao, T.: Flashattention-2: Faster attention with better parallelism and work par- titioning. In: The Twelfth International Conference on Learning Representations
[6]

Advances in neural information pro- cessing systems35, 16344–16359 (2022)

Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)

2022
[7]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024
[8]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Fu, X., Li, J.: Tcfg: Truncated classifier-free guidance for efficient and scalable text- to-image acceleration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18552–18562 (2025)

2025
[9]

In: The Thirteenth International Conference on Learning Representations

Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. In: The Thirteenth International Conference on Learning Representations
[10]

arXiv preprint arXiv:2601.20499 (2026)

Guo, H., Jia, Z., Li, J., Li, B., Cai, Y., Wang, J., Li, Y., Lu, Y.: Efficient au- toregressive video diffusion with dummy head. arXiv preprint arXiv:2601.20499 (2026)

work page arXiv 2026
[11]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[12]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Hooper, C.R.C., Kim, S., Mohammadzadeh, H., Maheswaran, M., Zhao, S., Paik, J., Mahoney, M.W., Keutzer, K., Gholami, A.: Squeezed attention: Accelerating long context length llm inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 32631–32652 (2025)

2025
[13]

arXiv preprint arXiv:2602.03747 (2026)

Huang, J., Ye, Z., Hu, X., He, T., Zhang, G., Shi, S., Bian, J., Jiang, L.: Live: Long-horizon interactive video world modeling. arXiv preprint arXiv:2602.03747 (2026)

work page arXiv 2026
[14]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al.: Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677 (2025) Memory-Efficient Video Generation by Parametric Absorption 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

2024
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, J., Fu, X., Gao, Y., Wang, J., Wang, X., So, H.K.H.: Rethinking conditioning in diffusion models: Dynamic token scheduling for efficient and aligned text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4160–4169 (2026)

2026
[18]

arXiv preprint arXiv:2602.14027 (2026)

Li, J., Fu, X., Peng, X., Chen, W., Zheng, Y., Zhao, T., Wang, J., Chen, F., Wang, X., So, H.K.H.: Train short, inference long: Training-free horizon extension for autoregressive video generation. arXiv preprint arXiv:2602.14027 (2026)

work page arXiv 2026
[19]

arXiv preprint arXiv:2510.09212 (2025)

Li, W., Pan, W., Luan, P.C., Gao, Y., Alahi, A.: Stable video infinity: Infinite- length video generation with error recycling. arXiv preprint arXiv:2510.09212 (2025)

work page arXiv 2025
[20]

Advances in Neural Information Processing Systems37, 22947–22970 (2024)

Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., Chen, D.: Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37, 22947–22970 (2024)

2024
[21]

In: The Eleventh International Conference on Learning Rep- resentations

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Rep- resentations
[22]

Advances in Neural Information Processing Systems37, 139997–140031 (2024)

Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., Zhuang, B.: Minicache: Kv cache compression in depth dimension for large language models. Advances in Neural Information Processing Systems37, 139997–140031 (2024)

2024
[23]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

In: The Eleventh International Conference on Learning Representations

Liu, X., Gong, C., et al.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations
[25]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

2024
[27]

Latte: Latent Diffusion Transformer for Video Generation

Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

arXiv preprint arXiv:2602.10825 (2026)

Ma, Y., Zheng, X., Xu, J., Xu, X., Ling, F., Zheng, X., Kuang, H., Li, H., Wang, X., Xiao, X., et al.: Flow caching for autoregressive video generation. arXiv preprint arXiv:2602.10825 (2026)

work page arXiv 2026
[29]

com/krea-ai/realtime-video

Millon,E.:Krearealtime14b:Real-timevideogeneration(2025),https://github. com/krea-ai/realtime-video

2025
[30]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[31]

Movie Gen: A Cast of Media Foundation Models

Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 18 X. Fu et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022
[33]

In: The Twelfth International Conference on Learning Representations

Song, Y., Dhariwal, P.: Improved techniques for training consistency models. In: The Twelfth International Conference on Learning Representations
[34]

In: Interna- tional Conference on Machine Learning

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Interna- tional Conference on Machine Learning. pp. 32211–32252. PMLR (2023)

2023
[35]

In: Forty-first International Con- ference on Machine Learning

Tang, J., Zhao, Y., Zhu, K., Xiao, G., Kasikci, B., Han, S.: Quest: Query-aware sparsity for efficient long-context llm inference. In: Forty-first International Con- ference on Machine Learning
[36]

Advancing Open-source World Models

Team, R., Gao, Z., Wang, Q., Zeng, Y., Zhu, J., Cheng, K.L., Li, Y., Wang, H., Xu, Y., Ma, S., et al.: Advancing open-source world models. arXiv preprint arXiv:2601.20540 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

MAGI-1: Autoregressive Video Generation at Scale

Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

In: International Conference on Learning Representations

Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M.T., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual descriptions. In: International Conference on Learning Representations
[39]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

In: The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

Wang, A., Chen, H., Tan, J., Zhang, K., Cai, X., Lin, Z., Han, J., Ding, G.: Pre- fixkv: Adaptive prefix kv cache is what vision instruction-following models need for efficient generation. In: The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems
[41]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Wu, X., Zhang, G., Xu, Z., Zhou, Y., Lu, Q., He, X.: Pack and force your memory: Long-formandconsistentvideogeneration.arXivpreprintarXiv:2510.01784(2025)

work page arXiv 2025
[43]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

LongLive: Real-time Interactive Long Video Generation

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

arXiv preprint arXiv:2511.20649 (2025)

Yesiltepe, H., Meral, T.H.S., Akan, A.K., Oktay, K., Yanardag, P.: Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout. arXiv preprint arXiv:2511.20649 (2025)

work page arXiv 2025
[47]

arXiv preprint arXiv:2512.05081 (2025)

Yi, J., Jang, W., Cho, P.H., Nam, J., Yoon, H., Kim, S.: Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081 (2025)

work page arXiv 2025
[48]

Advances in neural information processing systems37, 47455–47487 (2024) Memory-Efficient Video Generation by Parametric Absorption 19

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024) Memory-Efficient Video Generation by Parametric Absorption 19

2024
[49]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

2024
[50]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 22963– 22974 (2025)

2025
[51]

In: International Conference on Machine Learning

Yu, Z., Wang, Z., Fu, Y., Shi, H., Shaikh, K., Lin, Y.C.: Unveiling and harness- ing hidden attention sinks: Enhancing large language models without training through attention calibration. In: International Conference on Machine Learning. pp. 57659–57677. PMLR (2024)

2024
[52]

In: International Conference on Machine Learning

Zhang,J.,Huang,H.,Zhang,P.,Wei,J.,Zhu,J.,Chen,J.:Sageattention2:Efficient attention with thorough outlier smoothing and per-thread int4 quantization. In: International Conference on Machine Learning. pp. 75097–75119. PMLR (2025)

2025
[53]

In: The Thirteenth International Conference on Learning Representations

Zhang, J., Zhang, P., Zhu, J., Chen, J., et al.: Sageattention: Accurate 8-bit at- tention for plug-and-play inference acceleration. In: The Thirteenth International Conference on Learning Representations
[54]

Advances in Neural Information Processing Systems36, 34661–34710 (2023)

Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., et al.: H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36, 34661–34710 (2023)

2023
[55]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Zhu, H., Zhao, M., He, G., Su, H., Li, C., Zhu, J.: Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video genera- tion. arXiv preprint arXiv:2602.02214 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

In: International Conference on Machine Learning

Behnam,P.,Fu,Y.,Zhao,R.,Tsai,P.A.,Yu,Z.,Tumanov,A.:Rocketkv:Accelerat- ing long-context llm inference via two-stage kv cache compression. In: International Conference on Machine Learning. pp. 3358–3392. PMLR (2025)

2025

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

SkyReels-V2: Infinite-length Film Generative Model

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

In: The Twelfth International Conference on Learning Representations

Dao, T.: Flashattention-2: Faster attention with better parallelism and work par- titioning. In: The Twelfth International Conference on Learning Representations

[6] [6]

Advances in neural information pro- cessing systems35, 16344–16359 (2022)

Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)

2022

[7] [7]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024

[8] [8]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Fu, X., Li, J.: Tcfg: Truncated classifier-free guidance for efficient and scalable text- to-image acceleration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18552–18562 (2025)

2025

[9] [9]

In: The Thirteenth International Conference on Learning Representations

Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., Lin, M.: When attention sink emerges in language models: An empirical view. In: The Thirteenth International Conference on Learning Representations

[10] [10]

arXiv preprint arXiv:2601.20499 (2026)

Guo, H., Jia, Z., Li, J., Li, B., Cai, Y., Wang, J., Li, Y., Lu, Y.: Efficient au- toregressive video diffusion with dummy head. arXiv preprint arXiv:2601.20499 (2026)

work page arXiv 2026

[11] [11]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020

[12] [12]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Hooper, C.R.C., Kim, S., Mohammadzadeh, H., Maheswaran, M., Zhao, S., Paik, J., Mahoney, M.W., Keutzer, K., Gholami, A.: Squeezed attention: Accelerating long context length llm inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 32631–32652 (2025)

2025

[13] [13]

arXiv preprint arXiv:2602.03747 (2026)

Huang, J., Ye, Z., Hu, X., He, T., Zhang, G., Shi, S., Bian, J., Jiang, L.: Live: Long-horizon interactive video world modeling. arXiv preprint arXiv:2602.03747 (2026)

work page arXiv 2026

[14] [14]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al.: Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677 (2025) Memory-Efficient Video Generation by Parametric Absorption 17

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

2024

[17] [17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, J., Fu, X., Gao, Y., Wang, J., Wang, X., So, H.K.H.: Rethinking conditioning in diffusion models: Dynamic token scheduling for efficient and aligned text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4160–4169 (2026)

2026

[18] [18]

arXiv preprint arXiv:2602.14027 (2026)

Li, J., Fu, X., Peng, X., Chen, W., Zheng, Y., Zhao, T., Wang, J., Chen, F., Wang, X., So, H.K.H.: Train short, inference long: Training-free horizon extension for autoregressive video generation. arXiv preprint arXiv:2602.14027 (2026)

work page arXiv 2026

[19] [19]

arXiv preprint arXiv:2510.09212 (2025)

Li, W., Pan, W., Luan, P.C., Gao, Y., Alahi, A.: Stable video infinity: Infinite- length video generation with error recycling. arXiv preprint arXiv:2510.09212 (2025)

work page arXiv 2025

[20] [20]

Advances in Neural Information Processing Systems37, 22947–22970 (2024)

Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., Chen, D.: Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37, 22947–22970 (2024)

2024

[21] [21]

In: The Eleventh International Conference on Learning Rep- resentations

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Rep- resentations

[22] [22]

Advances in Neural Information Processing Systems37, 139997–140031 (2024)

Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., Zhuang, B.: Minicache: Kv cache compression in depth dimension for large language models. Advances in Neural Information Processing Systems37, 139997–140031 (2024)

2024

[23] [23]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

In: The Eleventh International Conference on Learning Representations

Liu, X., Gong, C., et al.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations

[25] [25]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

2024

[27] [27]

Latte: Latent Diffusion Transformer for Video Generation

Ma, X., Wang, Y., Chen, X., Jia, G., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

arXiv preprint arXiv:2602.10825 (2026)

Ma, Y., Zheng, X., Xu, J., Xu, X., Ling, F., Zheng, X., Kuang, H., Li, H., Wang, X., Xiao, X., et al.: Flow caching for autoregressive video generation. arXiv preprint arXiv:2602.10825 (2026)

work page arXiv 2026

[29] [29]

com/krea-ai/realtime-video

Millon,E.:Krearealtime14b:Real-timevideogeneration(2025),https://github. com/krea-ai/realtime-video

2025

[30] [30]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023

[31] [31]

Movie Gen: A Cast of Media Foundation Models

Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 18 X. Fu et al

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022

[33] [33]

In: The Twelfth International Conference on Learning Representations

Song, Y., Dhariwal, P.: Improved techniques for training consistency models. In: The Twelfth International Conference on Learning Representations

[34] [34]

In: Interna- tional Conference on Machine Learning

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Interna- tional Conference on Machine Learning. pp. 32211–32252. PMLR (2023)

2023

[35] [35]

In: Forty-first International Con- ference on Machine Learning

Tang, J., Zhao, Y., Zhu, K., Xiao, G., Kasikci, B., Han, S.: Quest: Query-aware sparsity for efficient long-context llm inference. In: Forty-first International Con- ference on Machine Learning

[36] [36]

Advancing Open-source World Models

Team, R., Gao, Z., Wang, Q., Zeng, Y., Zhu, J., Cheng, K.L., Li, Y., Wang, H., Xu, Y., Ma, S., et al.: Advancing open-source world models. arXiv preprint arXiv:2601.20540 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

MAGI-1: Autoregressive Video Generation at Scale

Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

In: International Conference on Learning Representations

Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M.T., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual descriptions. In: International Conference on Learning Representations

[39] [39]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

In: The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

Wang, A., Chen, H., Tan, J., Zhang, K., Cai, X., Lin, Z., Han, J., Ding, G.: Pre- fixkv: Adaptive prefix kv cache is what vision instruction-following models need for efficient generation. In: The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

[41] [41]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Wu, X., Zhang, G., Xu, Z., Zhou, Y., Lu, Q., He, X.: Pack and force your memory: Long-formandconsistentvideogeneration.arXivpreprintarXiv:2510.01784(2025)

work page arXiv 2025

[43] [43]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

LongLive: Real-time Interactive Long Video Generation

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

arXiv preprint arXiv:2511.20649 (2025)

Yesiltepe, H., Meral, T.H.S., Akan, A.K., Oktay, K., Yanardag, P.: Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout. arXiv preprint arXiv:2511.20649 (2025)

work page arXiv 2025

[47] [47]

arXiv preprint arXiv:2512.05081 (2025)

Yi, J., Jang, W., Cho, P.H., Nam, J., Yoon, H., Kim, S.: Deep forcing: Training-free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081 (2025)

work page arXiv 2025

[48] [48]

Advances in neural information processing systems37, 47455–47487 (2024) Memory-Efficient Video Generation by Parametric Absorption 19

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024) Memory-Efficient Video Generation by Parametric Absorption 19

2024

[49] [49]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

2024

[50] [50]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 22963– 22974 (2025)

2025

[51] [51]

In: International Conference on Machine Learning

Yu, Z., Wang, Z., Fu, Y., Shi, H., Shaikh, K., Lin, Y.C.: Unveiling and harness- ing hidden attention sinks: Enhancing large language models without training through attention calibration. In: International Conference on Machine Learning. pp. 57659–57677. PMLR (2024)

2024

[52] [52]

In: International Conference on Machine Learning

Zhang,J.,Huang,H.,Zhang,P.,Wei,J.,Zhu,J.,Chen,J.:Sageattention2:Efficient attention with thorough outlier smoothing and per-thread int4 quantization. In: International Conference on Machine Learning. pp. 75097–75119. PMLR (2025)

2025

[53] [53]

In: The Thirteenth International Conference on Learning Representations

Zhang, J., Zhang, P., Zhu, J., Chen, J., et al.: Sageattention: Accurate 8-bit at- tention for plug-and-play inference acceleration. In: The Thirteenth International Conference on Learning Representations

[54] [54]

Advances in Neural Information Processing Systems36, 34661–34710 (2023)

Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., et al.: H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36, 34661–34710 (2023)

2023

[55] [55]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Zhu, H., Zhao, M., He, G., Su, H., Li, C., Zhu, J.: Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video genera- tion. arXiv preprint arXiv:2602.02214 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026