Recognition: 2 theorem links
· Lean TheoremForcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Pith reviewed 2026-05-12 03:50 UTC · model grok-4.3
The pith
Dividing attention heads into static and dynamic categories enables hybrid KV cache compression that reduces memory by 30% and speeds autoregressive video diffusion up to 2.82x.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By observing that attention heads in mainstream AR diffusion models exhibit markedly distinct and stable attention patterns across samples and denoising steps, the authors divide heads into static heads that focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads that govern inter-frame motion and consistency. Forcing-KV then applies structured static pruning to the former and dynamic pruning based on segment-wise similarity to the latter, achieving over 29 fps generation speed on a single NVIDIA H200 GPU with 30% cache memory reduction, 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P, and 2.82x speedup at 1080P while preserving output.
What carries the argument
The hybrid KV cache compression strategy that classifies heads by functional specialization and applies tailored pruning: structured static pruning for static heads and segment-wise similarity pruning for dynamic heads.
If this is right
- Generation reaches over 29 frames per second on a single H200 GPU while cutting KV cache memory by 30%.
- Speedups of 1.35x and 1.50x are realized on LongLive and Self Forcing at 480P resolution.
- The speedup scales to 2.82x at 1080P resolution with no reported quality loss.
- The method preserves output quality across the tested autoregressive video diffusion setups.
Where Pith is reading between the lines
- The same head-specialization principle could be tested on non-video autoregressive diffusion models to see whether similar memory savings appear.
- If the classification holds across training runs, it might allow pre-computed pruning masks that further reduce runtime overhead.
- Extending the dynamic pruning to longer context windows could support even higher-resolution or longer video sequences on fixed hardware.
Load-bearing premise
Attention heads in mainstream AR diffusion models have markedly distinct patterns and roles that stay stable across samples and denoising steps, allowing reliable division into static and dynamic categories.
What would settle it
Running the hybrid pruning on multiple AR video models at different resolutions and observing consistent drops in motion consistency or visual fidelity would falsify the claim that the head classification supports lossless compression.
Figures
read the original abstract
Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Forcing-KV, a hybrid KV cache compression method for autoregressive video diffusion models. It reports an empirical observation that attention heads exhibit markedly distinct patterns and functional roles that remain stable across samples and denoising steps, allowing division into static heads (focused on transitions and intra-frame fidelity) and dynamic heads (governing inter-frame motion and consistency). Structured static pruning is applied to the former and segment-wise similarity-based dynamic pruning to the latter, yielding reported results of over 29 FPS on a single H200 GPU, 30% cache memory reduction, speedups of 1.35x/1.50x at 480P and 2.82x at 1080P, while maintaining output quality. Code and demos are provided.
Significance. If the stability of the head categorization and quality preservation hold under quantitative scrutiny, the work could meaningfully advance scalable real-time long video generation in AR diffusion models by mitigating KV cache memory and compute bottlenecks, with direct applicability to models like LongLive and Self Forcing.
major comments (3)
- [§3] §3 (Empirical Study of Head-wise Functional Specialization): The central assumption that attention head patterns 'remain stable across samples and denoising steps' enabling reliable static/dynamic division is load-bearing for the hybrid strategy, yet the manuscript provides only qualitative observations without quantitative support such as cross-sample consistency scores, categorization variance statistics, or timestep-robustness metrics.
- [§4] §4 (Forcing-KV Hybrid Compression): The exact criteria, thresholds, and similarity metric (e.g., no explicit equation for segment-wise similarity or pruning ratio selection) used to categorize heads and apply pruning are insufficiently formalized, making it impossible to verify how the reported 30% cache reduction and speedups are achieved without under-compression or quality loss.
- [§5] §5 (Experiments): The claim of 'maintaining output quality' is not supported by specific quantitative metrics (e.g., no reported FVD, FID, or perceptual scores), ablation tables on pruning ratios per head category, or failure-case analysis, leaving the tradeoff between compression and fidelity unverifiable despite the speed/memory numbers.
minor comments (2)
- [Abstract] Abstract: The speedup figures (1.35x on LongLive, 1.50x on Self Forcing) would benefit from explicit baseline model versions and resolution settings to allow direct comparison.
- [§4] Notation: The distinction between 'static pruning' and 'dynamic pruning' could be clarified with a small table summarizing the two strategies side-by-side.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the quantitative foundations, formalization, and experimental validation of Forcing-KV.
read point-by-point responses
-
Referee: [§3] §3 (Empirical Study of Head-wise Functional Specialization): The central assumption that attention head patterns 'remain stable across samples and denoising steps' enabling reliable static/dynamic division is load-bearing for the hybrid strategy, yet the manuscript provides only qualitative observations without quantitative support such as cross-sample consistency scores, categorization variance statistics, or timestep-robustness metrics.
Authors: We agree that the stability claim in §3 would benefit from quantitative backing. In the revised manuscript we will add (i) cross-sample consistency scores measuring the fraction of heads that receive the same static/dynamic label across 50+ diverse inputs, (ii) categorization variance statistics (mean and std of label flips), and (iii) timestep-robustness metrics that track label stability over the full denoising trajectory. These metrics will be reported in a new table and will directly support the hybrid pruning design. revision: yes
-
Referee: [§4] §4 (Forcing-KV Hybrid Compression): The exact criteria, thresholds, and similarity metric (e.g., no explicit equation for segment-wise similarity or pruning ratio selection) used to categorize heads and apply pruning are insufficiently formalized, making it impossible to verify how the reported 30% cache reduction and speedups are achieved without under-compression or quality loss.
Authors: We acknowledge that §4 lacks explicit equations. We will insert the precise mathematical definitions: the segment-wise similarity metric (cosine similarity between averaged KV features of consecutive segments), the head-classification threshold (derived from attention entropy and motion magnitude), and the per-category pruning-ratio selection rule. These additions will make the 30 % memory reduction and reported speedups fully reproducible and verifiable. revision: yes
-
Referee: [§5] §5 (Experiments): The claim of 'maintaining output quality' is not supported by specific quantitative metrics (e.g., no reported FVD, FID, or perceptual scores), ablation tables on pruning ratios per head category, or failure-case analysis, leaving the tradeoff between compression and fidelity unverifiable despite the speed/memory numbers.
Authors: We accept that the quality-preservation claim requires stronger quantitative evidence. In the revision we will report Fréchet Video Distance (FVD) and Fréchet Inception Distance (FID) on standard benchmarks, add ablation tables that vary pruning ratios independently for static and dynamic heads, and include a failure-case analysis section that discusses edge cases where quality degrades. These changes will make the speed–quality trade-off transparent. revision: yes
Circularity Check
No significant circularity; empirical observation and measured results remain independent
full rationale
The paper's core chain consists of an empirical observation of head-wise attention patterns, followed by a categorization into static/dynamic heads and a hybrid pruning strategy whose performance (FPS, cache reduction, speedups) is reported via direct measurement on benchmarks. No equations, fitted parameters, or self-citations are shown that would make the reported outcomes equivalent to the inputs by construction. The stability claim is presented as an observation supporting the method rather than a tautological redefinition, and the speed claims are external experimental outcomes rather than derived predictions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention heads exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We observe that attention heads ... exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. ... static heads ... dynamic heads ... hybrid KV cache compression strategy
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
segment-wise cosine similarity between corresponding segments ... evict the top-k segments ... with the highest similarity values
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
Reference graph
Works this paper leans on
-
[1]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion,
X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion,”NeurIPS, 2025
work page 2025
-
[2]
Longlive: Real-time Interactive Long Video Generation,
S. Yang, W. Huang, R. Chu, Y . Xiao, Y . Zhao, X. Wang, M. Li, E. Xie, Y . Chen, Y . Luet al., “Longlive: Real-time Interactive Long Video Generation,”ICLR, 2026
work page 2026
-
[3]
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,
T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,” in2025 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2025, pp. 22 963–22 974
work page 2025
-
[4]
MAGI-1: Autoregressive Video Generation at Scale
H. Teng, H. Jia, L. Sun, L. Li, M. Li, M. Tang, S. Han, T. Zhang, W. Zhang, W. Luoet al., “Magi-1: Autoregressive video generation at scale,”arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review arXiv 2025
-
[5]
SkyReels-V2: Infinite-length Film Generative Model
G. Chen, D. Lin, J. Yang, C. Lin, J. Zhu, M. Fan, H. Zhang, S. Chen, Z. Chen, C. Maet al., “Skyreels-v2: Infinite-length film generative model,”arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review arXiv 2025
-
[6]
Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models,
L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala, “Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[7]
Scalable Diffusion Models with Transformers,
W. Peebles and S. Xie, “Scalable Diffusion Models with Transformers,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 4172–4182
work page 2023
-
[8]
Open-Sora Plan: Open-Source Large Video Generation Model,
B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, T. Jia, J. Zhang, Z. Tang, Y . Pang, B. She, C. Yan, Z. Hu, X. Dong, L. Chen, Z. Pan, X. Zhou, S. Dong, Y . Tian, and L. Yuan, “Open-Sora Plan: Open-Source Large Video Generation Model,” 2024
work page 2024
-
[9]
HunyuanVideo: A Systematic Framework For Large Video Generative Models,
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y . Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y . Li, Y . Chen, Y . Cui, Y . Peng, Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y . ...
work page 2025
-
[10]
Wan: Open and Advanced Large-Scale Video Generative Models
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang et al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation,
J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh, “Self-Forcing++: Towards Minute-Scale High-Quality Video Generation,”ICLR, 2026
work page 2026
-
[12]
Y . Lu, Y . Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhuet al., “Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,”CVPR, 2026
work page 2026
-
[13]
Rolling forcing: Autoregressive long video diffusion in real time,
K. Liu, W. Hu, J. Xu, Y . Shan, and S. Lu, “Rolling forcing: Autoregressive long video diffusion in real time,”ICLR, 2026
work page 2026
-
[14]
arXiv preprint arXiv:2512.05081 (2025)
J. Yi, W. Jang, P. H. Cho, J. Nam, H. Yoon, and S. Kim, “Deep forcing: Training-free long video generation with deep sink and participative compression,”arXiv preprint arXiv:2512.05081, 2025
-
[15]
H. Yesiltepe, T. H. S. Meral, A. K. Akan, K. Oktay, and P. Yanardag, “Infinity-rope: Action- controllable infinite video generation emerges from autoregressive self-rollout,”CVPR, 2025
work page 2025
-
[16]
arXiv preprint arXiv:2512.21734 (2025)
S. Xiao, X. Zhang, D. Meng, Q. Wang, P. Zhang, and B. Zhang, “Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation,” arXiv preprint arXiv:2512.21734, 2025
-
[17]
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Y . Huang, H. Guo, F. Wu, S. Zhang, S. Huang, Q. Gan, L. Liu, S. Zhao, E. Chen, J. Liuet al., “Live avatar: Streaming real-time audio-driven avatar generation with infinite length,”arXiv preprint arXiv:2512.04677, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Stable Video Infinity: Infinite-Length Video Generation with Error Recycling,
W. Li, W. Pan, P.-C. Luan, Y . Gao, and A. Alahi, “Stable Video Infinity: Infinite-Length Video Generation with Error Recycling,” inInternational Conference on Learning Representations, 2026
work page 2026
-
[19]
arXiv preprint arXiv:2601.16914 (2026)
J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh, “LoL: Longer than Longer, Scaling Video Generation to Hour,”arXiv preprint arXiv:2601.16914, 2026
-
[20]
Past-and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion,
H. Chen, C. Xu, X. Yang, X. Chen, and C. Deng, “Past-and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion,”arXiv preprint arXiv:2601.21896, 2026
-
[21]
Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention,
C. Lv, Y . Shi, Y . Huang, R. Gong, S. Ren, and W. Wang, “Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention,”arXiv preprint arXiv:2602.04789, 2026
-
[22]
Monar- chRT: Efficient Attention for Real-Time Video Generation,
K. Agarwal, Z. Chen, C. Luo, Y . Chen, H. Zheng, X. Huang, A. Rudra, and B. Chen, “Monar- chRT: Efficient Attention for Real-Time Video Generation,”arXiv preprint arXiv:2602.12271, 2026
-
[23]
Flow Caching for Autoregressive Video Generation,
Y . Ma, X. Zheng, J. Xu, X. Xu, F. Ling, X. Zheng, H. Kuang, H. Li, X. WANG, X. Xiao et al., “Flow Caching for Autoregressive Video Generation,” inThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[24]
D. Samuel, I. Tzachor, M. Levy, M. Green, G. Chechik, and R. Ben-Ari, “Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention,” arXiv preprint arXiv:2602.01801, 2026
-
[25]
Efficient Autoregressive Video Diffusion with Dummy Head,
H. Guo, Z. Jia, J. Li, B. Li, Y . Cai, J. Wang, Y . Li, and Y . Lu, “Efficient Autoregressive Video Diffusion with Dummy Head,”arXiv preprint arXiv:2601.20499, 2026
-
[26]
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,
T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang, “From Slow Bidirectional to Fast Autoregressive Video Diffusion Models,” inCVPR, 2025
work page 2025
-
[27]
Krea Realtime 14B: Real-time Video Generation,
E. Millon, “Krea Realtime 14B: Real-time Video Generation,” 2025
work page 2025
-
[28]
Pathwise Test-Time Correction for Autoregressive Long Video Generation,
X. Xiang, Z. Duan, G. Zhang, H. Zhang, Z. Gao, J. Wu, S. Zhang, T. Wang, Q. Fan, and C. Guo, “Pathwise Test-Time Correction for Autoregressive Long Video Generation,”arXiv preprint arXiv:2602.05871, 2026
-
[29]
Mode Seeking meets Mean Seeking for Fast Long Video Generation,
S. Cai, W. Nie, C. Liu, J. Berner, L. Zhang, N. Ma, H. Chen, M. Agrawala, L. Guibas, G. Wetzsteinet al., “Mode Seeking meets Mean Seeking for Fast Long Video Generation,” arXiv preprint arXiv:2602.24289, 2026
-
[30]
Longcat-video technical report
M. L. Team, X. Cai, Q. Huang, Z. Kang, H. Li, S. Liang, L. Ma, S. Ren, X. Wei, R. Xie, and T. Zhang, “LongCat-Video Technical Report,”arXiv preprint arXiv:2510.22200, 2025
-
[31]
Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379,
S. Yuan, Y . Yin, Z. Li, X. Huang, X. Yang, and L. Yuan, “Helios: Real Real-Time Long Video Generation Model,”arXiv preprint arXiv:2603.04379, 2026
-
[32]
Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity,
H. Xi, S. Yang, Y . Zhao, C. Xu, M. Li, X. Li, Y . Lin, H. Cai, J. Zhang, D. Liet al., “Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity,” in Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[33]
Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation,
S. Yang, H. Xi, Y . Zhao, M. Li, J. Zhang, H. Cai, Y . Lin, X. Li, C. Xu, K. Penget al., “Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[34]
B. Xu, Y . Du, Z. Liu, S. Yang, Z. Jiang, S. Yan, R. Saha, A. Pumarola, W. Wang, and P. Li, “Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation,”arXiv preprint arXiv:2604.21221, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Sana-video: Efficient video generation with block linear diffusion transformer,
J. Chen, Y . Zhao, J. Yu, R. Chu, J. Chen, S. Yang, X. Wang, Y . Pan, D. Zhou, H. Linget al., “Sana-video: Efficient video generation with block linear diffusion transformer,”ICLR, 2026. 12
work page 2026
-
[36]
J. Zhang, H. Huang, P. Zhang, J. Wei, J. Zhu, and J. Chen, “Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization,” inInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[37]
Timestep Em- bedding Tells: It’s Time to Cache for Video Diffusion Model,
F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan, “Timestep Em- bedding Tells: It’s Time to Cache for Video Diffusion Model,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2025, pp. 7353–7363
work page 2025
-
[38]
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing,
K. Gao, J. Shi, H. Zhang, C. Wang, J. Xiao, and L. Chen, “Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 18 550–18 565
work page 2025
-
[39]
KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study,
S. Ranganath, V . Menon, and A. Patnaik, “KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study,” 2026
work page 2026
-
[40]
Efficient Streaming Language Models with Attention Sinks,
G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient Streaming Language Models with Attention Sinks,” inThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[41]
Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. Ré, C. Barrett, W. Zhangyang, "Atlas", and C. Beidi, “H2o: Heavy-hitter oracle for efficient generative inference of large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 34 661–34 710, 2023
work page 2023
-
[42]
DuoAttention: Efficient Long- Context LLM Inference with Retrieval and Streaming Heads,
G. Xiao, J. Tang, J. Zuo, S. Yang, H. Tang, Y . Fu, S. Hanet al., “DuoAttention: Efficient Long- Context LLM Inference with Retrieval and Streaming Heads,” inThe Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[43]
HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference
B. Zeng, F. Ren, J. Zhang, X. Gu, K. Chen, L. Shou, and H. Li, “HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference,”arXiv preprint arXiv:2604.05887, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression,
K. Li, Z. Chen, C.-Y . Yang, and J.-N. Hwang, “Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[45]
Head-aware kv cache compression for efficient visual autoregressive modeling,
Z. Qin, Y . Lv, M. Lin, H. Guo, Z. Zhang, D. Zou, and W. Lin, “Head-aware kv cache compression for efficient visual autoregressive modeling,”arXiv preprint arXiv:2504.09261, 2025
-
[46]
VBench: Comprehensive benchmark suite for video generative models,
Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y . Wang, X. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu, “VBench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[47]
Movie Gen: A Cast of Media Foundation Models
A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuanget al., “Movie gen: A cast of media foundation models,”arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
VBench++: Comprehensive and versatile benchmark suite for video generative models,
Z. Huang, F. Zhang, X. Xu, Y . He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y . Jiang, Y . Wang, X. Chen, Y .-C. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu, “VBench++: Comprehensive and versatile benchmark suite for video generative models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[49]
H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation,”arXiv preprint arXiv:2602.02214, 2026
-
[50]
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow,
Z. Teed and J. Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow,” in European Conference on Computer Vision, 2020, pp. 402–419. 13 A Chunk Discontinuity Definition of chunk discontinuity.In Section 3 and Section 5, we usechunk discontinuityto quantify transitions across chunks. To ensure fairness and validity in evaluation, we define chu...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.