MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
Pith reviewed 2026-05-22 07:12 UTC · model grok-4.3
The pith
MuKV compresses KV caches at patch, frame and segment levels to raise accuracy in long streaming video question answering while holding memory and speed steady.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MuKV extracts visual representations at patch-, frame-, and segment-levels for the offline KV cache, applies a dual signal token compression mechanism guided by self-attention and frequency to reduce redundancy, and employs a semi-hierarchical retrieval method during online QA; experiments on long-streaming VideoQA benchmarks demonstrate that this combination raises answer accuracy without increasing memory usage or lowering online efficiency, and that the compression step by itself delivers consistent gains across all three measures.
What carries the argument
Multi-grained KV cache compression module that extracts and compresses representations at patch, frame, and segment levels using self-attention and frequency signals, paired with semi-hierarchical retrieval for online use.
If this is right
- Answer accuracy rises on long-streaming VideoQA benchmarks while memory stays at or below the level of caching every frame or two.
- Online question-answering latency remains comparable to or better than prior KV-cache methods.
- The compression step alone produces measurable gains in accuracy, memory, and efficiency even when the rest of the pipeline is unchanged.
- Local patch-level cues and segment-level temporal context are both available for retrieval without storing every token.
Where Pith is reading between the lines
- The same multi-level compression pattern could be tested on long audio or multimodal streams where token counts also grow rapidly.
- Adding one more hierarchy level for entire video chapters might allow still longer contexts without further memory growth.
- The method could be combined with existing token-pruning techniques inside the language model itself to produce additive savings.
- Deployment on edge devices would benefit if the offline compression can run once and the retrieval stays lightweight.
Load-bearing premise
Compressing visual tokens at multiple granularity levels will keep both local spatial details and global temporal context intact enough that retrieval still supplies the information needed for correct answers.
What would settle it
Measuring answer accuracy on the same long-streaming VideoQA benchmarks and finding that MuKV scores lower than an uncompressed full-frame KV cache baseline.
Figures
read the original abstract
Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MuKV, a multi-grained KV cache compression method for long streaming VideoQA. It extracts visual KV representations at patch-, frame-, and segment-levels to preserve local spatial cues and global temporal context, applies dual-signal token compression guided by self-attention and frequency signals, and employs a semi-hierarchical retriever for online QA. Experiments on long-streaming VideoQA benchmarks report significant accuracy gains without increased memory or reduced efficiency, with the compression mechanism alone claimed to deliver consistent benefits across all three metrics.
Significance. If the empirical claims hold under fair baselines, this approach could meaningfully advance practical deployment of LLM-based video QA in streaming settings by addressing KV cache growth. The multi-grained design is a reasonable attempt to balance detail retention with compression, and the dual-signal guidance is a concrete algorithmic contribution. Reproducible benchmark results would strengthen the case for adoption in resource-constrained multimodal systems.
major comments (2)
- [Method (compression and retrieval subsections)] The central claim that the compression mechanism alone yields consistent accuracy, memory, and efficiency gains rests on the untested assumption that self-attention plus frequency guidance discards only irrelevant tokens. No retrieval-precision metrics, per-question-type error analysis, or ablation isolating the dual-signal pruning from the multi-grained extraction and semi-hierarchical retriever are described, leaving open the possibility that low-amplitude but answer-critical patterns are lost.
- [Experiments] The abstract and results assert 'consistent benefits' and 'significantly improves answer accuracy' across benchmarks, yet the manuscript provides no tables or sections reporting multiple runs, error bars, or explicit isolation of the compression component (e.g., full KV vs. compressed KV under identical retrieval). This weakens attribution of gains specifically to the proposed compression.
minor comments (1)
- [Abstract] The abstract would benefit from a brief definition of 'long streaming' (e.g., typical frame count or token length) to set expectations for the reported efficiency numbers.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below. Where the feedback identifies gaps in empirical validation, we have revised the manuscript to provide additional ablations, statistical reporting, and isolation experiments.
read point-by-point responses
-
Referee: [Method (compression and retrieval subsections)] The central claim that the compression mechanism alone yields consistent accuracy, memory, and efficiency gains rests on the untested assumption that self-attention plus frequency guidance discards only irrelevant tokens. No retrieval-precision metrics, per-question-type error analysis, or ablation isolating the dual-signal pruning from the multi-grained extraction and semi-hierarchical retriever are described, leaving open the possibility that low-amplitude but answer-critical patterns are lost.
Authors: We agree that stronger isolation of the dual-signal compression is valuable. In the revised manuscript we add a dedicated ablation that fixes the multi-grained extraction and semi-hierarchical retriever while varying only the pruning signals (self-attention only, frequency only, and both). We also report retrieval precision@K for the online stage on the long-streaming benchmarks. A full per-question-type error breakdown is not added, as it would require new human annotations outside the current experimental scope; instead we include qualitative case studies of retained versus discarded tokens in the appendix to illustrate that answer-critical content is preserved. revision: partial
-
Referee: [Experiments] The abstract and results assert 'consistent benefits' and 'significantly improves answer accuracy' across benchmarks, yet the manuscript provides no tables or sections reporting multiple runs, error bars, or explicit isolation of the compression component (e.g., full KV vs. compressed KV under identical retrieval). This weakens attribution of gains specifically to the proposed compression.
Authors: We accept this criticism. The revised version now includes results averaged over three random seeds with standard deviations for all main tables. We have also inserted a new subsection that directly compares (i) full KV cache, (ii) our compressed KV cache, and (iii) baseline compression methods, all using the identical semi-hierarchical retriever and LLM backbone. These controlled comparisons isolate the contribution of the dual-signal compression and confirm consistent gains across accuracy, memory footprint, and online inference speed. revision: yes
Circularity Check
No circularity: MuKV is an algorithmic design validated on external benchmarks
full rationale
The paper describes MuKV as a multi-grained KV cache compression module (patch/frame/segment extraction plus dual-signal self-attention/frequency pruning) paired with semi-hierarchical retrieval. All performance claims (accuracy gains, memory/efficiency wins) are presented as empirical outcomes measured on long-streaming VideoQA benchmarks rather than as first-principles derivations or predictions. No equations reduce a result to a fitted parameter by construction, no load-bearing self-citations justify uniqueness, and no ansatz is smuggled in. The central mechanism is an explicit design choice whose correctness is tested externally, satisfying the criteria for a self-contained, non-circular contribution.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual signal token compression mechanism guided by self-attention and frequency... Iatt = 1/H·P Σ A(L) ... Ifft = Mean(Zfft) ... Ift = α Iatt + (1-α) Ifft
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35: 23716–23736, 2022. 1
work page 2022
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bu- gra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Ci- han Camg ˜Ak ¸z, Shreyas Hampali, Eric Sauser, Shugao Ma, et al. Memory-efficient streaming videollms for real-time procedural video understanding.arXiv preprint arXiv:2504.13915, 2025. 1
-
[5]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 3
work page 2024
-
[6]
Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025. 2
-
[7]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 2
work page 2023
-
[8]
James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965. 4
work page 1965
-
[9]
Streaming video question-answering with in-context video kv-cache retrieval
Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang, et al. Streaming video question-answering with in-context video kv-cache retrieval. InICLR, 2025. 1, 2, 3, 5, 6, 7, 8
work page 2025
-
[10]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,
-
[11]
Videoagent: A memory-augmented mul- timodal agent for video understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented mul- timodal agent for video understanding. InECCV, pages 75–
-
[12]
Springer, 2024. 1, 2
work page 2024
-
[13]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 5
work page 2025
-
[14]
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.ICLR, 2025. 3
work page 2025
-
[15]
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024. 2
work page 2024
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
Dongwon Jo, Jiwon Song, Yulhwa Kim, and Jae-Joon Kim. Fastkv: Kv cache compression for fast long-context pro- cessing with token-selective propagation.arXiv preprint arXiv:2502.01068, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Freqkv: Frequency domain key- value compression for efficient context window extension
Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. Freqkv: Frequency domain key- value compression for efficient context window extension. arXiv preprint arXiv:2505.00570, 2025. 3
-
[19]
Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compres- sion for streaming video understanding.NeurIPS, 2025. 1, 2, 3, 7
work page 2025
-
[20]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 3, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1
work page 2024
-
[22]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024. 5, 8, 1
-
[24]
Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, and Jieru Zhao. Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025. 3
-
[25]
Cachegen: Kv cache com- pression and streaming for fast large language model serv- ing
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache com- pression and streaming for fast large language model serv- ing. InProceedings of the ACM SIGCOMM 2024 Confer- ence, pages 38–56, 2024. 2
work page 2024
-
[26]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 5, 1
work page 2023
-
[28]
Morevqa: Exploring modular reason- ing models for video question answering
Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular reason- ing models for video question answering. InCVPR, pages 13235–13245, 2024. 2
work page 2024
-
[29]
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval. arXiv preprint arXiv:2505.15269, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Streaming long video understanding with large language models.NeurIPS, 37: 119336–119360, 2024
Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.NeurIPS, 37: 119336–119360, 2024. 1
work page 2024
-
[31]
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 2
work page 2025
-
[32]
Question- answering dense video events
Hangyu Qin, Junbin Xiao, and Angela Yao. Question- answering dense video events. InSIGIR, pages 884–894,
-
[33]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,
-
[35]
Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025. 2
work page 2025
-
[36]
Razorattention: Ef- ficient kv cache compression through retrieval heads.ICLR,
Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Ef- ficient kv cache compression through retrieval heads.ICLR,
-
[37]
Dongwei Wang, Zijie Liu, Song Wang, Yuxin Ren, Jianing Deng, Jingtong Hu, Tianlong Chen, and Huanrui Yang. Fier: Fine-grained and efficient kv cache retrieval for long-context llm inference.arXiv preprint arXiv:2508.08256, 2025. 3
-
[38]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Videoagent: Long-form video understanding with large language model as agent
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung- Levy. Videoagent: Long-form video understanding with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2
work page 2024
-
[40]
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025. 1, 2
-
[42]
Videotree: Adaptive tree-based video representation for llm reasoning on long videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InCVPR, pages 3272–3283, 2025. 1, 3
work page 2025
-
[43]
Longvlm: Efficient long video understand- ing via large language models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InEuropean Conference on Computer Vision, pages 453–470. Springer, 2024. 1, 2
work page 2024
-
[44]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. 2
work page 2024
-
[45]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 1
work page 2021
-
[46]
Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yi- cong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, and Angela Yao. Videoqa in the era of llms: An empirical study.International Journal of Computer Vi- sion, 133(7):3970–3993, 2025. 2
work page 2025
-
[47]
Unleashing the power of llms for medical video answer localization
Junbin Xiao, Qingyun Li, Yusen Yang, Liang Qiu, and An- gela Yao. Unleashing the power of llms for medical video answer localization. InInternational Conference on Medi- cal Image Computing and Computer-Assisted Intervention, pages 669–679. Springer, 2025. 1
work page 2025
-
[48]
Video question answer- ing via gradually refined attention over appearance and mo- tion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 1
work page 2017
-
[49]
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,
-
[53]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InAAAI, pages 9127–9134, 2019. 1
work page 2019
-
[54]
Socratic models: Composing zero-shot multimodal reasoning with language
Andy Zeng, Maria Attarian, Krzysztof Marcin Choroman- ski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, et al. Socratic models: Composing zero-shot multimodal reasoning with language. InICLR. 2
-
[55]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
A simple llm framework for long-range video question-answering
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering. In EMNLP, pages 21715–21737, 2024. 1, 2
work page 2024
-
[57]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Flash-vstream: Memory- based real-time understanding for long video streams.ICCV,
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams.ICCV,
-
[59]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13691– 13701, 2025. 5, 1 MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Questi...
work page 2025
-
[61]
What is the person holding right now?
Dataset Introduction VStream-QA [57] comprises two long-video datasets: RVS-Ego and RVS-Movie.RVS-Egocontains 10 egocen- tric videos with an average duration of 30 minutes, while RVS-Movieincludes 22 movie videos averaging 1 hour. The distributions of the temporal answer spans and their ra- tios relative to the question timestamps of both datasets are pre...
-
[62]
Experiments 7.1. Offline VideoQA and Different Backbones We also extend our method MuKV to the popular offline long VideoQA datasets: Video-MME [12], MLVU [59] and 0-3 3-6 6-9 9-12 12-15 >15 Time Interval (min) 0 100 200 300 400 500 600 700 # Questions 46 756 26 348 264 25 0-.1.1-.2.2-.3.3-.4.4-.5.5-.6.6-.7.7-.8.8-.9.9-1 Time Ratio 0 100 200 300 400 # Que...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.