pith. sign in

arxiv: 2605.22269 · v1 · pith:J2YWMWRQnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.MM

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

Pith reviewed 2026-05-22 07:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords KV cache compressionlong streaming videoVideoQAmulti-grained representationsemi-hierarchical retrievalvisual token compressionLLM memory efficiencystreaming question answering
0
0 comments X

The pith

MuKV compresses KV caches at patch, frame and segment levels to raise accuracy in long streaming video question answering while holding memory and speed steady.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the memory and context problems that arise when large language models answer questions about long streaming videos, where the number of visual tokens quickly exceeds practical limits. It does so by storing compressed key-value pairs at three different visual scales offline and then retrieving the most relevant ones online through a semi-hierarchical process. A sympathetic reader would care because the approach promises to keep fine spatial details inside frames and longer temporal patterns across frames without paying extra memory or latency costs. The central mechanism uses self-attention and frequency signals to decide which tokens to keep, showing that this selective compression alone improves accuracy, memory footprint, and efficiency over standard full-frame caching. If the claim holds, real-time video question answering becomes feasible on longer streams without retraining the underlying language model.

Core claim

MuKV extracts visual representations at patch-, frame-, and segment-levels for the offline KV cache, applies a dual signal token compression mechanism guided by self-attention and frequency to reduce redundancy, and employs a semi-hierarchical retrieval method during online QA; experiments on long-streaming VideoQA benchmarks demonstrate that this combination raises answer accuracy without increasing memory usage or lowering online efficiency, and that the compression step by itself delivers consistent gains across all three measures.

What carries the argument

Multi-grained KV cache compression module that extracts and compresses representations at patch, frame, and segment levels using self-attention and frequency signals, paired with semi-hierarchical retrieval for online use.

If this is right

  • Answer accuracy rises on long-streaming VideoQA benchmarks while memory stays at or below the level of caching every frame or two.
  • Online question-answering latency remains comparable to or better than prior KV-cache methods.
  • The compression step alone produces measurable gains in accuracy, memory, and efficiency even when the rest of the pipeline is unchanged.
  • Local patch-level cues and segment-level temporal context are both available for retrieval without storing every token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-level compression pattern could be tested on long audio or multimodal streams where token counts also grow rapidly.
  • Adding one more hierarchy level for entire video chapters might allow still longer contexts without further memory growth.
  • The method could be combined with existing token-pruning techniques inside the language model itself to produce additive savings.
  • Deployment on edge devices would benefit if the offline compression can run once and the retrieval stays lightweight.

Load-bearing premise

Compressing visual tokens at multiple granularity levels will keep both local spatial details and global temporal context intact enough that retrieval still supplies the information needed for correct answers.

What would settle it

Measuring answer accuracy on the same long-streaming VideoQA benchmarks and finding that MuKV scores lower than an uncompressed full-frame KV cache baseline.

Figures

Figures reproduced from arXiv: 2605.22269 by Angela Yao, Jiajun Chen, Junbin Xiao, Tianxiang Sun, Xun Yang.

Figure 1
Figure 1. Figure 1: Comparison between MuKV and previous arts. (a) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of different approaches for streaming video QA. (a) The end-to-end approach trades off visual details for long-ranged [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of multi-grained video KV cache compres [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of semi-hierarchical retrieval. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of tokens’ self-attention scores (top), fre [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prediction visualization on StreamingBench [ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The answer spans and their ratios relative to the cor [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MuKV, a multi-grained KV cache compression method for long streaming VideoQA. It extracts visual KV representations at patch-, frame-, and segment-levels to preserve local spatial cues and global temporal context, applies dual-signal token compression guided by self-attention and frequency signals, and employs a semi-hierarchical retriever for online QA. Experiments on long-streaming VideoQA benchmarks report significant accuracy gains without increased memory or reduced efficiency, with the compression mechanism alone claimed to deliver consistent benefits across all three metrics.

Significance. If the empirical claims hold under fair baselines, this approach could meaningfully advance practical deployment of LLM-based video QA in streaming settings by addressing KV cache growth. The multi-grained design is a reasonable attempt to balance detail retention with compression, and the dual-signal guidance is a concrete algorithmic contribution. Reproducible benchmark results would strengthen the case for adoption in resource-constrained multimodal systems.

major comments (2)
  1. [Method (compression and retrieval subsections)] The central claim that the compression mechanism alone yields consistent accuracy, memory, and efficiency gains rests on the untested assumption that self-attention plus frequency guidance discards only irrelevant tokens. No retrieval-precision metrics, per-question-type error analysis, or ablation isolating the dual-signal pruning from the multi-grained extraction and semi-hierarchical retriever are described, leaving open the possibility that low-amplitude but answer-critical patterns are lost.
  2. [Experiments] The abstract and results assert 'consistent benefits' and 'significantly improves answer accuracy' across benchmarks, yet the manuscript provides no tables or sections reporting multiple runs, error bars, or explicit isolation of the compression component (e.g., full KV vs. compressed KV under identical retrieval). This weakens attribution of gains specifically to the proposed compression.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief definition of 'long streaming' (e.g., typical frame count or token length) to set expectations for the reported efficiency numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below. Where the feedback identifies gaps in empirical validation, we have revised the manuscript to provide additional ablations, statistical reporting, and isolation experiments.

read point-by-point responses
  1. Referee: [Method (compression and retrieval subsections)] The central claim that the compression mechanism alone yields consistent accuracy, memory, and efficiency gains rests on the untested assumption that self-attention plus frequency guidance discards only irrelevant tokens. No retrieval-precision metrics, per-question-type error analysis, or ablation isolating the dual-signal pruning from the multi-grained extraction and semi-hierarchical retriever are described, leaving open the possibility that low-amplitude but answer-critical patterns are lost.

    Authors: We agree that stronger isolation of the dual-signal compression is valuable. In the revised manuscript we add a dedicated ablation that fixes the multi-grained extraction and semi-hierarchical retriever while varying only the pruning signals (self-attention only, frequency only, and both). We also report retrieval precision@K for the online stage on the long-streaming benchmarks. A full per-question-type error breakdown is not added, as it would require new human annotations outside the current experimental scope; instead we include qualitative case studies of retained versus discarded tokens in the appendix to illustrate that answer-critical content is preserved. revision: partial

  2. Referee: [Experiments] The abstract and results assert 'consistent benefits' and 'significantly improves answer accuracy' across benchmarks, yet the manuscript provides no tables or sections reporting multiple runs, error bars, or explicit isolation of the compression component (e.g., full KV vs. compressed KV under identical retrieval). This weakens attribution of gains specifically to the proposed compression.

    Authors: We accept this criticism. The revised version now includes results averaged over three random seeds with standard deviations for all main tables. We have also inserted a new subsection that directly compares (i) full KV cache, (ii) our compressed KV cache, and (iii) baseline compression methods, all using the identical semi-hierarchical retriever and LLM backbone. These controlled comparisons isolate the contribution of the dual-signal compression and confirm consistent gains across accuracy, memory footprint, and online inference speed. revision: yes

Circularity Check

0 steps flagged

No circularity: MuKV is an algorithmic design validated on external benchmarks

full rationale

The paper describes MuKV as a multi-grained KV cache compression module (patch/frame/segment extraction plus dual-signal self-attention/frequency pruning) paired with semi-hierarchical retrieval. All performance claims (accuracy gains, memory/efficiency wins) are presented as empirical outcomes measured on long-streaming VideoQA benchmarks rather than as first-principles derivations or predictions. No equations reduce a result to a fitted parameter by construction, no load-bearing self-citations justify uniqueness, and no ansatz is smuggled in. The central mechanism is an explicit design choice whose correctness is tested externally, satisfying the criteria for a self-contained, non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that multi-level visual token compression can be performed without losing task-critical information and that the semi-hierarchical retriever can locate relevant caches at low cost. No explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5757 in / 1154 out tokens · 37347 ms · 2026-05-22T07:12:30.242988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 17 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022. 1

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1, 6

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

  4. [4]

    Memory-efficient streaming videollms for real-time procedural video understanding.arXiv preprint arXiv:2504.13915, 2025

    Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bu- gra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Ci- han Camg ˜Ak ¸z, Shreyas Hampali, Eric Sauser, Shugao Ma, et al. Memory-efficient streaming videollms for real-time procedural video understanding.arXiv preprint arXiv:2504.13915, 2025. 1

  5. [5]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 3

  6. [6]

    Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025

    Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025. 2

  7. [7]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 2

  8. [8]

    An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965

    James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965. 4

  9. [9]

    Streaming video question-answering with in-context video kv-cache retrieval

    Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang, et al. Streaming video question-answering with in-context video kv-cache retrieval. InICLR, 2025. 1, 2, 3, 5, 6, 7, 8

  10. [10]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

  11. [11]

    Videoagent: A memory-augmented mul- timodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InECCV, pages 75–

  12. [12]

    Springer, 2024. 1, 2

  13. [13]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 5

  14. [14]

    Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.ICLR, 2025

    Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.ICLR, 2025. 3

  15. [15]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024. 2

  16. [16]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2

  17. [17]

    FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

    Dongwon Jo, Jiwon Song, Yulhwa Kim, and Jae-Joon Kim. Fastkv: Kv cache compression for fast long-context pro- cessing with token-selective propagation.arXiv preprint arXiv:2502.01068, 2025. 2, 3

  18. [18]

    Freqkv: Frequency domain key- value compression for efficient context window extension

    Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. Freqkv: Frequency domain key- value compression for efficient context window extension. arXiv preprint arXiv:2505.00570, 2025. 3

  19. [19]

    Infinipot-v: Memory-constrained kv cache compres- sion for streaming video understanding.NeurIPS, 2025

    Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compres- sion for streaming video understanding.NeurIPS, 2025. 1, 2, 3, 7

  20. [20]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 3, 5, 6

  21. [21]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1

  22. [22]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1, 2

  23. [23]

    Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024. 5, 8, 1

  24. [24]

    Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025

    Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, and Jieru Zhao. Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025. 3

  25. [25]

    Cachegen: Kv cache com- pression and streaming for fast large language model serv- ing

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache com- pression and streaming for fast large language model serv- ing. InProceedings of the ACM SIGCOMM 2024 Confer- ence, pages 38–56, 2024. 2

  26. [26]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 1, 2

  27. [27]

    Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 5, 1

  28. [28]

    Morevqa: Exploring modular reason- ing models for video question answering

    Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular reason- ing models for video question answering. InCVPR, pages 13235–13245, 2024. 2

  29. [29]

    LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

    Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval. arXiv preprint arXiv:2505.15269, 2025. 2, 3

  30. [30]

    Streaming long video understanding with large language models.NeurIPS, 37: 119336–119360, 2024

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.NeurIPS, 37: 119336–119360, 2024. 1

  31. [31]

    Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 2

  32. [32]

    Question- answering dense video events

    Hangyu Qin, Junbin Xiao, and Angela Yao. Question- answering dense video events. InSIGIR, pages 884–894,

  33. [33]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 1, 2

  34. [34]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,

  35. [35]

    Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

    Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 2

  36. [36]

    Razorattention: Ef- ficient kv cache compression through retrieval heads.ICLR,

    Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Ef- ficient kv cache compression through retrieval heads.ICLR,

  37. [37]

    Fier: Fine-grained and efficient kv cache retrieval for long-context llm inference.arXiv preprint arXiv:2508.08256, 2025

    Dongwei Wang, Zijie Liu, Song Wang, Yuxin Ren, Jianing Deng, Jingtong Hu, Tianlong Chen, and Huanrui Yang. Fier: Fine-grained and efficient kv cache retrieval for long-context llm inference.arXiv preprint arXiv:2508.08256, 2025. 3

  38. [38]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2

  39. [39]

    Videoagent: Long-form video understanding with large language model as agent

    Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2

  40. [40]

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 1

  41. [41]

    Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025

    Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025. 1, 2

  42. [42]

    Videotree: Adaptive tree-based video representation for llm reasoning on long videos

    Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InCVPR, pages 3272–3283, 2025. 1, 3

  43. [43]

    Longvlm: Efficient long video understand- ing via large language models

    Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InEuropean Conference on Computer Vision, pages 453–470. Springer, 2024. 1, 2

  44. [44]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. 2

  45. [45]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 1

  46. [46]

    Videoqa in the era of llms: An empirical study.International Journal of Computer Vi- sion, 133(7):3970–3993, 2025

    Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yi- cong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, and Angela Yao. Videoqa in the era of llms: An empirical study.International Journal of Computer Vi- sion, 133(7):3970–3993, 2025. 2

  47. [47]

    Unleashing the power of llms for medical video answer localization

    Junbin Xiao, Qingyun Li, Yusen Yang, Liang Qiu, and An- gela Yao. Unleashing the power of llms for medical video answer localization. InInternational Conference on Medi- cal Image Computing and Computer-Assisted Intervention, pages 669–679. Springer, 2025. 1

  48. [48]

    Video question answer- ing via gradually refined attention over appearance and mo- tion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 1

  49. [49]

    PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 1

  50. [50]

    StreamingVLM: Real-Time Understanding for Infinite Video Streams

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 2

  51. [51]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 2

  52. [52]

    Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

    Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

  53. [53]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InAAAI, pages 9127–9134, 2019. 1

  54. [54]

    Socratic models: Composing zero-shot multimodal reasoning with language

    Andy Zeng, Maria Attarian, Krzysztof Marcin Choroman- ski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, et al. Socratic models: Composing zero-shot multimodal reasoning with language. InICLR. 2

  55. [55]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1

  56. [56]

    A simple llm framework for long-range video question-answering

    Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering. In EMNLP, pages 21715–21737, 2024. 1, 2

  57. [57]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 2

  58. [58]

    Flash-vstream: Memory- based real-time understanding for long video streams.ICCV,

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams.ICCV,

  59. [59]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 1, 2, 6

  60. [60]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691– 13701, 2025. 5, 1 MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Questi...

  61. [61]

    What is the person holding right now?

    Dataset Introduction VStream-QA [57] comprises two long-video datasets: RVS-Ego and RVS-Movie.RVS-Egocontains 10 egocen- tric videos with an average duration of 30 minutes, while RVS-Movieincludes 22 movie videos averaging 1 hour. The distributions of the temporal answer spans and their ra- tios relative to the question timestamps of both datasets are pre...

  62. [62]

    Experiments 7.1. Offline VideoQA and Different Backbones We also extend our method MuKV to the popular offline long VideoQA datasets: Video-MME [12], MLVU [59] and 0-3 3-6 6-9 9-12 12-15 >15 Time Interval (min) 0 100 200 300 400 500 600 700 # Questions 46 756 26 348 264 25 0-.1.1-.2.2-.3.3-.4.4-.5.5-.6.6-.7.7-.8.8-.9.9-1 Time Ratio 0 100 200 300 400 # Que...