MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

Angela Yao; Jiajun Chen; Junbin Xiao; Tianxiang Sun; Xun Yang

arxiv: 2605.22269 · v1 · pith:J2YWMWRQnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.MM

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

Junbin Xiao , Jiajun Chen , Tianxiang Sun , Xun Yang , Angela Yao This is my paper

Pith reviewed 2026-05-22 07:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords KV cache compressionlong streaming videoVideoQAmulti-grained representationsemi-hierarchical retrievalvisual token compressionLLM memory efficiencystreaming question answering

0 comments

The pith

MuKV compresses KV caches at patch, frame and segment levels to raise accuracy in long streaming video question answering while holding memory and speed steady.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the memory and context problems that arise when large language models answer questions about long streaming videos, where the number of visual tokens quickly exceeds practical limits. It does so by storing compressed key-value pairs at three different visual scales offline and then retrieving the most relevant ones online through a semi-hierarchical process. A sympathetic reader would care because the approach promises to keep fine spatial details inside frames and longer temporal patterns across frames without paying extra memory or latency costs. The central mechanism uses self-attention and frequency signals to decide which tokens to keep, showing that this selective compression alone improves accuracy, memory footprint, and efficiency over standard full-frame caching. If the claim holds, real-time video question answering becomes feasible on longer streams without retraining the underlying language model.

Core claim

MuKV extracts visual representations at patch-, frame-, and segment-levels for the offline KV cache, applies a dual signal token compression mechanism guided by self-attention and frequency to reduce redundancy, and employs a semi-hierarchical retrieval method during online QA; experiments on long-streaming VideoQA benchmarks demonstrate that this combination raises answer accuracy without increasing memory usage or lowering online efficiency, and that the compression step by itself delivers consistent gains across all three measures.

What carries the argument

Multi-grained KV cache compression module that extracts and compresses representations at patch, frame, and segment levels using self-attention and frequency signals, paired with semi-hierarchical retrieval for online use.

If this is right

Answer accuracy rises on long-streaming VideoQA benchmarks while memory stays at or below the level of caching every frame or two.
Online question-answering latency remains comparable to or better than prior KV-cache methods.
The compression step alone produces measurable gains in accuracy, memory, and efficiency even when the rest of the pipeline is unchanged.
Local patch-level cues and segment-level temporal context are both available for retrieval without storing every token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-level compression pattern could be tested on long audio or multimodal streams where token counts also grow rapidly.
Adding one more hierarchy level for entire video chapters might allow still longer contexts without further memory growth.
The method could be combined with existing token-pruning techniques inside the language model itself to produce additive savings.
Deployment on edge devices would benefit if the offline compression can run once and the retrieval stays lightweight.

Load-bearing premise

Compressing visual tokens at multiple granularity levels will keep both local spatial details and global temporal context intact enough that retrieval still supplies the information needed for correct answers.

What would settle it

Measuring answer accuracy on the same long-streaming VideoQA benchmarks and finding that MuKV scores lower than an uncompressed full-frame KV cache baseline.

Figures

Figures reproduced from arXiv: 2605.22269 by Angela Yao, Jiajun Chen, Junbin Xiao, Tianxiang Sun, Xun Yang.

**Figure 2.** Figure 2: Illustration of different approaches for streaming video QA. (a) The end-to-end approach trades off visual details for long-ranged [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of multi-grained video KV cache compres [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of semi-hierarchical retrieval. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of tokens’ self-attention scores (top), fre [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Prediction visualization on StreamingBench [ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The answer spans and their ratios relative to the cor [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MuKV adds a three-level KV cache with attention-plus-frequency pruning and semi-hierarchical retrieval that reports accuracy lifts on long video QA at no extra memory cost, but the claim that pruning keeps every answer-critical token rests on an assumption that still needs direct checks.

read the letter

MuKV's main contribution is a three-level KV cache that compresses at patch, frame, and segment scales using both attention and frequency cues, paired with a semi-hierarchical retriever for streaming video question answering. The abstract shows it lifting accuracy while holding memory and speed steady, and the compression step alone drives gains on the benchmarks. The multi-grained design is the fresh part. Earlier KV cache methods for video mostly cache whole frames or individual tokens, which either wastes space or drops spatial detail inside frames. By keeping representations at three different grains, the method tries to hold onto both fine local patterns and broader temporal flow. Adding frequency analysis to the usual attention-based pruning gives a second signal for deciding what to keep, which makes sense for video where some important changes might not stand out in attention scores alone. The semi-hierarchical retrieval then pulls the right cached pieces during online QA without scanning everything. The experiments back this up with consistent improvements across the three metrics. That is useful for anyone running long video streams through LLMs. The weaker point is whether the compression really preserves the tokens that matter for the final answer. Frequency pruning risks removing low-amplitude but decisive visual cues, such as small movements or rare objects, and attention scores in extended streaming contexts can be noisy. If those get dropped, the retriever has no way to recover them. The abstract does not mention retrieval precision metrics or per-question ablations that would test this directly. It also leaves open whether the baselines are the strongest current methods and if the reported gains include error bars from repeated runs. This paper is aimed at engineers and researchers focused on efficient multimodal inference for video. Anyone dealing with memory limits in deployed video QA systems could pick up practical ideas from the compression and retrieval design. It is solid enough to warrant a full referee process, since the core engineering is clear and the results suggest a genuine step forward even if some validation details need tightening. I would recommend sending it out for review, with specific asks for more analysis on what the dual-signal compressor actually discards and how that affects different question types.

Referee Report

2 major / 1 minor

Summary. The paper proposes MuKV, a multi-grained KV cache compression method for long streaming VideoQA. It extracts visual KV representations at patch-, frame-, and segment-levels to preserve local spatial cues and global temporal context, applies dual-signal token compression guided by self-attention and frequency signals, and employs a semi-hierarchical retriever for online QA. Experiments on long-streaming VideoQA benchmarks report significant accuracy gains without increased memory or reduced efficiency, with the compression mechanism alone claimed to deliver consistent benefits across all three metrics.

Significance. If the empirical claims hold under fair baselines, this approach could meaningfully advance practical deployment of LLM-based video QA in streaming settings by addressing KV cache growth. The multi-grained design is a reasonable attempt to balance detail retention with compression, and the dual-signal guidance is a concrete algorithmic contribution. Reproducible benchmark results would strengthen the case for adoption in resource-constrained multimodal systems.

major comments (2)

[Method (compression and retrieval subsections)] The central claim that the compression mechanism alone yields consistent accuracy, memory, and efficiency gains rests on the untested assumption that self-attention plus frequency guidance discards only irrelevant tokens. No retrieval-precision metrics, per-question-type error analysis, or ablation isolating the dual-signal pruning from the multi-grained extraction and semi-hierarchical retriever are described, leaving open the possibility that low-amplitude but answer-critical patterns are lost.
[Experiments] The abstract and results assert 'consistent benefits' and 'significantly improves answer accuracy' across benchmarks, yet the manuscript provides no tables or sections reporting multiple runs, error bars, or explicit isolation of the compression component (e.g., full KV vs. compressed KV under identical retrieval). This weakens attribution of gains specifically to the proposed compression.

minor comments (1)

[Abstract] The abstract would benefit from a brief definition of 'long streaming' (e.g., typical frame count or token length) to set expectations for the reported efficiency numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below. Where the feedback identifies gaps in empirical validation, we have revised the manuscript to provide additional ablations, statistical reporting, and isolation experiments.

read point-by-point responses

Referee: [Method (compression and retrieval subsections)] The central claim that the compression mechanism alone yields consistent accuracy, memory, and efficiency gains rests on the untested assumption that self-attention plus frequency guidance discards only irrelevant tokens. No retrieval-precision metrics, per-question-type error analysis, or ablation isolating the dual-signal pruning from the multi-grained extraction and semi-hierarchical retriever are described, leaving open the possibility that low-amplitude but answer-critical patterns are lost.

Authors: We agree that stronger isolation of the dual-signal compression is valuable. In the revised manuscript we add a dedicated ablation that fixes the multi-grained extraction and semi-hierarchical retriever while varying only the pruning signals (self-attention only, frequency only, and both). We also report retrieval precision@K for the online stage on the long-streaming benchmarks. A full per-question-type error breakdown is not added, as it would require new human annotations outside the current experimental scope; instead we include qualitative case studies of retained versus discarded tokens in the appendix to illustrate that answer-critical content is preserved. revision: partial
Referee: [Experiments] The abstract and results assert 'consistent benefits' and 'significantly improves answer accuracy' across benchmarks, yet the manuscript provides no tables or sections reporting multiple runs, error bars, or explicit isolation of the compression component (e.g., full KV vs. compressed KV under identical retrieval). This weakens attribution of gains specifically to the proposed compression.

Authors: We accept this criticism. The revised version now includes results averaged over three random seeds with standard deviations for all main tables. We have also inserted a new subsection that directly compares (i) full KV cache, (ii) our compressed KV cache, and (iii) baseline compression methods, all using the identical semi-hierarchical retriever and LLM backbone. These controlled comparisons isolate the contribution of the dual-signal compression and confirm consistent gains across accuracy, memory footprint, and online inference speed. revision: yes

Circularity Check

0 steps flagged

No circularity: MuKV is an algorithmic design validated on external benchmarks

full rationale

The paper describes MuKV as a multi-grained KV cache compression module (patch/frame/segment extraction plus dual-signal self-attention/frequency pruning) paired with semi-hierarchical retrieval. All performance claims (accuracy gains, memory/efficiency wins) are presented as empirical outcomes measured on long-streaming VideoQA benchmarks rather than as first-principles derivations or predictions. No equations reduce a result to a fitted parameter by construction, no load-bearing self-citations justify uniqueness, and no ansatz is smuggled in. The central mechanism is an explicit design choice whose correctness is tested externally, satisfying the criteria for a self-contained, non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that multi-level visual token compression can be performed without losing task-critical information and that the semi-hierarchical retriever can locate relevant caches at low cost. No explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5757 in / 1154 out tokens · 37347 ms · 2026-05-22T07:12:30.242988+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual signal token compression mechanism guided by self-attention and frequency... Iatt = 1/H·P Σ A(L) ... Ifft = Mean(Zfft) ... Ift = α Iatt + (1-α) Ifft

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 17 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022. 1

work page 2022
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Memory-efficient streaming videollms for real-time procedural video understanding.arXiv preprint arXiv:2504.13915, 2025

Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bu- gra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Ci- han Camg ˜Ak ¸z, Shreyas Hampali, Eric Sauser, Shugao Ma, et al. Memory-efficient streaming videollms for real-time procedural video understanding.arXiv preprint arXiv:2504.13915, 2025. 1

work page arXiv 2025
[5]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 3

work page 2024
[6]

Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025

Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025. 2

work page arXiv 2025
[7]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 2

work page 2023
[8]

An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965

James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965. 4

work page 1965
[9]

Streaming video question-answering with in-context video kv-cache retrieval

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang, et al. Streaming video question-answering with in-context video kv-cache retrieval. InICLR, 2025. 1, 2, 3, 5, 6, 7, 8

work page 2025
[10]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

work page
[11]

Videoagent: A memory-augmented mul- timodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InECCV, pages 75–

work page
[12]

Springer, 2024. 1, 2

work page 2024
[13]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 5

work page 2025
[14]

Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.ICLR, 2025

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.ICLR, 2025. 3

work page 2025
[15]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024. 2

work page 2024
[16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

Dongwon Jo, Jiwon Song, Yulhwa Kim, and Jae-Joon Kim. Fastkv: Kv cache compression for fast long-context pro- cessing with token-selective propagation.arXiv preprint arXiv:2502.01068, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Freqkv: Frequency domain key- value compression for efficient context window extension

Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. Freqkv: Frequency domain key- value compression for efficient context window extension. arXiv preprint arXiv:2505.00570, 2025. 3

work page arXiv 2025
[19]

Infinipot-v: Memory-constrained kv cache compres- sion for streaming video understanding.NeurIPS, 2025

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compres- sion for streaming video understanding.NeurIPS, 2025. 1, 2, 3, 7

work page 2025
[20]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1

work page 2024
[22]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024

Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024. 5, 8, 1

work page arXiv 2024
[24]

Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025

Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, and Jieru Zhao. Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025. 3

work page arXiv 2025
[25]

Cachegen: Kv cache com- pression and streaming for fast large language model serv- ing

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache com- pression and streaming for fast large language model serv- ing. InProceedings of the ACM SIGCOMM 2024 Confer- ence, pages 38–56, 2024. 2

work page 2024
[26]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 5, 1

work page 2023
[28]

Morevqa: Exploring modular reason- ing models for video question answering

Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular reason- ing models for video question answering. InCVPR, pages 13235–13245, 2024. 2

work page 2024
[29]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval. arXiv preprint arXiv:2505.15269, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Streaming long video understanding with large language models.NeurIPS, 37: 119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.NeurIPS, 37: 119336–119360, 2024. 1

work page 2024
[31]

Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 2

work page 2025
[32]

Question- answering dense video events

Hangyu Qin, Junbin Xiao, and Angela Yao. Question- answering dense video events. InSIGIR, pages 884–894,

work page
[33]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,

work page
[35]

Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025
[36]

Razorattention: Ef- ficient kv cache compression through retrieval heads.ICLR,

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Ef- ficient kv cache compression through retrieval heads.ICLR,

work page
[37]

Fier: Fine-grained and efficient kv cache retrieval for long-context llm inference.arXiv preprint arXiv:2508.08256, 2025

Dongwei Wang, Zijie Liu, Song Wang, Yuxin Ren, Jianing Deng, Jingtong Hu, Tianlong Chen, and Huanrui Yang. Fier: Fine-grained and efficient kv cache retrieval for long-context llm inference.arXiv preprint arXiv:2508.08256, 2025. 3

work page arXiv 2025
[38]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2

work page 2024
[40]

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025

Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025. 1, 2

work page arXiv 2025
[42]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InCVPR, pages 3272–3283, 2025. 1, 3

work page 2025
[43]

Longvlm: Efficient long video understand- ing via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InEuropean Conference on Computer Vision, pages 453–470. Springer, 2024. 1, 2

work page 2024
[44]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. 2

work page 2024
[45]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 1

work page 2021
[46]

Videoqa in the era of llms: An empirical study.International Journal of Computer Vi- sion, 133(7):3970–3993, 2025

Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yi- cong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, and Angela Yao. Videoqa in the era of llms: An empirical study.International Journal of Computer Vi- sion, 133(7):3970–3993, 2025. 2

work page 2025
[47]

Unleashing the power of llms for medical video answer localization

Junbin Xiao, Qingyun Li, Yusen Yang, Liang Qiu, and An- gela Yao. Unleashing the power of llms for medical video answer localization. InInternational Conference on Medi- cal Image Computing and Computer-Assisted Intervention, pages 669–679. Springer, 2025. 1

work page 2025
[48]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 1

work page 2017
[49]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

work page arXiv
[53]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InAAAI, pages 9127–9134, 2019. 1

work page 2019
[54]

Socratic models: Composing zero-shot multimodal reasoning with language

Andy Zeng, Maria Attarian, Krzysztof Marcin Choroman- ski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, et al. Socratic models: Composing zero-shot multimodal reasoning with language. InICLR. 2

work page
[55]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

A simple llm framework for long-range video question-answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering. In EMNLP, pages 21715–21737, 2024. 1, 2

work page 2024
[57]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Flash-vstream: Memory- based real-time understanding for long video streams.ICCV,

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams.ICCV,

work page
[59]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691– 13701, 2025. 5, 1 MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Questi...

work page 2025
[61]

What is the person holding right now?

Dataset Introduction VStream-QA [57] comprises two long-video datasets: RVS-Ego and RVS-Movie.RVS-Egocontains 10 egocen- tric videos with an average duration of 30 minutes, while RVS-Movieincludes 22 movie videos averaging 1 hour. The distributions of the temporal answer spans and their ra- tios relative to the question timestamps of both datasets are pre...

work page
[62]

Experiments 7.1. Offline VideoQA and Different Backbones We also extend our method MuKV to the popular offline long VideoQA datasets: Video-MME [12], MLVU [59] and 0-3 3-6 6-9 9-12 12-15 >15 Time Interval (min) 0 100 200 300 400 500 600 700 # Questions 46 756 26 348 264 25 0-.1.1-.2.2-.3.3-.4.4-.5.5-.6.6-.7.7-.8.8-.9.9-1 Time Ratio 0 100 200 300 400 # Que...

work page 2048

[1] [1]

Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022. 1

work page 2022

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Memory-efficient streaming videollms for real-time procedural video understanding.arXiv preprint arXiv:2504.13915, 2025

Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bu- gra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Ci- han Camg ˜Ak ¸z, Shreyas Hampali, Eric Sauser, Shugao Ma, et al. Memory-efficient streaming videollms for real-time procedural video understanding.arXiv preprint arXiv:2504.13915, 2025. 1

work page arXiv 2025

[5] [5]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 3

work page 2024

[6] [6]

Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025

Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025. 2

work page arXiv 2025

[7] [7]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 2

work page 2023

[8] [8]

An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965

James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965. 4

work page 1965

[9] [9]

Streaming video question-answering with in-context video kv-cache retrieval

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang, et al. Streaming video question-answering with in-context video kv-cache retrieval. InICLR, 2025. 1, 2, 3, 5, 6, 7, 8

work page 2025

[10] [10]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

work page

[11] [11]

Videoagent: A memory-augmented mul- timodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InECCV, pages 75–

work page

[12] [12]

Springer, 2024. 1, 2

work page 2024

[13] [13]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 5

work page 2025

[14] [14]

Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.ICLR, 2025

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.ICLR, 2025. 3

work page 2025

[15] [15]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024. 2

work page 2024

[16] [16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

Dongwon Jo, Jiwon Song, Yulhwa Kim, and Jae-Joon Kim. Fastkv: Kv cache compression for fast long-context pro- cessing with token-selective propagation.arXiv preprint arXiv:2502.01068, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Freqkv: Frequency domain key- value compression for efficient context window extension

Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. Freqkv: Frequency domain key- value compression for efficient context window extension. arXiv preprint arXiv:2505.00570, 2025. 3

work page arXiv 2025

[19] [19]

Infinipot-v: Memory-constrained kv cache compres- sion for streaming video understanding.NeurIPS, 2025

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compres- sion for streaming video understanding.NeurIPS, 2025. 1, 2, 3, 7

work page 2025

[20] [20]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1

work page 2024

[22] [22]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024

Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024. 5, 8, 1

work page arXiv 2024

[24] [24]

Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025

Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, and Jieru Zhao. Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025. 3

work page arXiv 2025

[25] [25]

Cachegen: Kv cache com- pression and streaming for fast large language model serv- ing

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache com- pression and streaming for fast large language model serv- ing. InProceedings of the ACM SIGCOMM 2024 Confer- ence, pages 38–56, 2024. 2

work page 2024

[26] [26]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 5, 1

work page 2023

[28] [28]

Morevqa: Exploring modular reason- ing models for video question answering

Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular reason- ing models for video question answering. InCVPR, pages 13235–13245, 2024. 2

work page 2024

[29] [29]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval. arXiv preprint arXiv:2505.15269, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Streaming long video understanding with large language models.NeurIPS, 37: 119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.NeurIPS, 37: 119336–119360, 2024. 1

work page 2024

[31] [31]

Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 2

work page 2025

[32] [32]

Question- answering dense video events

Hangyu Qin, Junbin Xiao, and Angela Yao. Question- answering dense video events. InSIGIR, pages 884–894,

work page

[33] [33]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,

work page

[35] [35]

Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025

[36] [36]

Razorattention: Ef- ficient kv cache compression through retrieval heads.ICLR,

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Ef- ficient kv cache compression through retrieval heads.ICLR,

work page

[37] [37]

Fier: Fine-grained and efficient kv cache retrieval for long-context llm inference.arXiv preprint arXiv:2508.08256, 2025

Dongwei Wang, Zijie Liu, Song Wang, Yuxin Ren, Jianing Deng, Jingtong Hu, Tianlong Chen, and Huanrui Yang. Fier: Fine-grained and efficient kv cache retrieval for long-context llm inference.arXiv preprint arXiv:2508.08256, 2025. 3

work page arXiv 2025

[38] [38]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2

work page 2024

[40] [40]

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025

Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025. 1, 2

work page arXiv 2025

[42] [42]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InCVPR, pages 3272–3283, 2025. 1, 3

work page 2025

[43] [43]

Longvlm: Efficient long video understand- ing via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InEuropean Conference on Computer Vision, pages 453–470. Springer, 2024. 1, 2

work page 2024

[44] [44]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. 2

work page 2024

[45] [45]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 1

work page 2021

[46] [46]

Videoqa in the era of llms: An empirical study.International Journal of Computer Vi- sion, 133(7):3970–3993, 2025

Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yi- cong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, and Angela Yao. Videoqa in the era of llms: An empirical study.International Journal of Computer Vi- sion, 133(7):3970–3993, 2025. 2

work page 2025

[47] [47]

Unleashing the power of llms for medical video answer localization

Junbin Xiao, Qingyun Li, Yusen Yang, Liang Qiu, and An- gela Yao. Unleashing the power of llms for medical video answer localization. InInternational Conference on Medi- cal Image Computing and Computer-Assisted Intervention, pages 669–679. Springer, 2025. 1

work page 2025

[48] [48]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 1

work page 2017

[49] [49]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

work page arXiv

[53] [53]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InAAAI, pages 9127–9134, 2019. 1

work page 2019

[54] [54]

Socratic models: Composing zero-shot multimodal reasoning with language

Andy Zeng, Maria Attarian, Krzysztof Marcin Choroman- ski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, et al. Socratic models: Composing zero-shot multimodal reasoning with language. InICLR. 2

work page

[55] [55]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

A simple llm framework for long-range video question-answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering. In EMNLP, pages 21715–21737, 2024. 1, 2

work page 2024

[57] [57]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Flash-vstream: Memory- based real-time understanding for long video streams.ICCV,

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams.ICCV,

work page

[59] [59]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691– 13701, 2025. 5, 1 MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Questi...

work page 2025

[61] [61]

What is the person holding right now?

Dataset Introduction VStream-QA [57] comprises two long-video datasets: RVS-Ego and RVS-Movie.RVS-Egocontains 10 egocen- tric videos with an average duration of 30 minutes, while RVS-Movieincludes 22 movie videos averaging 1 hour. The distributions of the temporal answer spans and their ra- tios relative to the question timestamps of both datasets are pre...

work page

[62] [62]

Experiments 7.1. Offline VideoQA and Different Backbones We also extend our method MuKV to the popular offline long VideoQA datasets: Video-MME [12], MLVU [59] and 0-3 3-6 6-9 9-12 12-15 >15 Time Interval (min) 0 100 200 300 400 500 600 700 # Questions 46 756 26 348 264 25 0-.1.1-.2.2-.3.3-.4.4-.5.5-.6.6-.7.7-.8.8-.9.9-1 Time Ratio 0 100 200 300 400 # Que...

work page 2048