HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Pith reviewed 2026-05-16 12:52 UTC · model grok-4.3
The pith
HERMES structures the KV cache as a hierarchical memory to support real-time streaming video understanding in multimodal models without any training or query-time overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HERMES conceptualizes the KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference the system reuses a compact cache to deliver efficient streaming understanding under tight resource limits. No auxiliary computations are performed upon query arrival, which guarantees real-time responses and yields a tenfold reduction in time-to-first-token compared with prior state-of-the-art approaches. Even with up to 68 percent fewer video tokens than uniform sampling, the method matches or exceeds accuracy on all tested benchmarks and improves results by as much as 11.4 percent on streaming datasets.
What carries the argument
The hierarchical KV cache memory framework, which organizes retained video tokens at multiple granularities according to observed attention patterns and allows direct reuse during inference.
If this is right
- Real-time responses become possible for continuous video streams because no extra work is needed when a query arrives.
- Memory footprint shrinks substantially while accuracy on streaming tasks rises by up to 11.4 percent.
- Up to 68 percent of video tokens can be dropped relative to uniform sampling without loss of performance on standard benchmarks.
- The same compact cache supports repeated queries on the same stream without retraining or auxiliary passes.
- Deployment on resource-limited hardware becomes feasible for live video understanding.
Where Pith is reading between the lines
- The same hierarchical cache organization could be tested on streaming audio or sensor sequences to see whether attention patterns support similar compression.
- Longer video contexts might become practical if the hierarchy is allowed to grow dynamically rather than staying fixed.
- Integration with existing video encoders could reduce the need for separate compression stages in multimodal pipelines.
- The approach implies that attention statistics alone may suffice for temporal compression in other large models beyond video.
Load-bearing premise
The mechanistic attention investigation supplies a reliable way to group video information into stable hierarchical levels that keep every critical detail needed for later queries.
What would settle it
A direct comparison on a streaming video benchmark in which accuracy falls below the uniform-sampling baseline once token count is reduced by 68 percent, or in which measured time-to-first-token fails to show a tenfold improvement.
read the original abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HERMES, a training-free architecture for real-time streaming video understanding in MLLMs. It conceptualizes the KV cache as a hierarchical memory framework derived from a mechanistic attention investigation, enabling reuse of a compact cache that encapsulates video information across multiple granularities. This yields 10× faster TTFT than prior SOTA, with up to 68% video token reduction versus uniform sampling while achieving superior or comparable accuracy (gains up to 11.4% on streaming datasets) and no auxiliary computations on query arrival.
Significance. If the central claims hold, HERMES could meaningfully advance efficient inference for streaming multimodal models by reducing memory footprint and latency without retraining. The training-free hierarchical KV cache approach, if mechanistically justified and stable, would be a practical contribution to real-time video MLLM deployment and could inspire similar memory hierarchies in other long-context settings.
major comments (3)
- [§3] Mechanistic attention investigation (likely §3): the manuscript provides no equations, selection rules, or algorithm for constructing the hierarchy levels or for token promotion/demotion during continuous streaming input. Without these, it is impossible to verify whether the claimed stability across granularities holds under distribution shift or is an artifact of offline statistics.
- [§4] Experimental section (likely §4 and tables): accuracy gains (up to 11.4%) and the 68% token reduction are reported without ablations on the number of hierarchy levels, without error bars or statistical significance tests, and without explicit comparison of how the hierarchy is maintained versus uniform sampling baselines.
- [Abstract, §3] Claim of zero auxiliary computation on query arrival (abstract and §3): the paper asserts the compact cache enables real-time responses with no extra work, but supplies no timing breakdown or pseudocode confirming that hierarchy maintenance itself incurs no per-frame overhead during streaming.
minor comments (2)
- Notation for hierarchy levels and granularity parameters is introduced without a clear table or diagram; a single figure summarizing level sizes and update rules would improve readability.
- [Introduction] The abstract and introduction cite prior KV-cache compression work only lightly; adding 2–3 key references (e.g., recent streaming or memory-efficient attention papers) would better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work. We address each of the major comments point-by-point below, providing clarifications and indicating revisions where necessary to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] Mechanistic attention investigation (likely §3): the manuscript provides no equations, selection rules, or algorithm for constructing the hierarchy levels or for token promotion/demotion during continuous streaming input. Without these, it is impossible to verify whether the claimed stability across granularities holds under distribution shift or is an artifact of offline statistics.
Authors: We appreciate this observation. Section 3 presents the mechanistic attention investigation that motivates the hierarchical KV cache design, detailing how attention distributions across video frames inform the multi-granularity structure. To address the concern, we will include explicit equations for attention-based token selection, the rules for constructing hierarchy levels, and an algorithm for promotion/demotion in the streaming setting. This will be added to §3 and the appendix to allow verification of stability. revision: yes
-
Referee: [§4] Experimental section (likely §4 and tables): accuracy gains (up to 11.4%) and the 68% token reduction are reported without ablations on the number of hierarchy levels, without error bars or statistical significance tests, and without explicit comparison of how the hierarchy is maintained versus uniform sampling baselines.
Authors: We agree that additional ablations would strengthen the results. In the revision, we will add an ablation study on the number of hierarchy levels and include error bars from repeated runs with different seeds. We will also perform and report statistical significance tests for the accuracy gains. The comparison to uniform sampling is presented in the main tables, but we will expand the discussion in §4 to explicitly describe how the hierarchical maintenance differs from uniform sampling and why it leads to better performance. revision: yes
-
Referee: [Abstract, §3] Claim of zero auxiliary computation on query arrival (abstract and §3): the paper asserts the compact cache enables real-time responses with no extra work, but supplies no timing breakdown or pseudocode confirming that hierarchy maintenance itself incurs no per-frame overhead during streaming.
Authors: The claim refers specifically to no auxiliary computations triggered by the user query arrival, as the hierarchy is maintained incrementally during streaming. However, we acknowledge the need for more detail. We will provide a timing breakdown in the experimental section and include pseudocode in the appendix showing the per-frame streaming process, demonstrating that maintenance overhead is constant and does not affect query-time latency. revision: yes
Circularity Check
No significant circularity; hierarchy motivated by investigation but results are empirically benchmarked
full rationale
The paper motivates the hierarchical KV cache from a mechanistic attention investigation described in the manuscript, then evaluates the resulting training-free system on streaming and offline benchmarks. No equations, fitted parameters, or self-citations reduce the reported TTFT gains or accuracy improvements to the investigation outputs by construction. The central claims rest on external empirical measurements rather than self-referential definitions or renamed patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mechanistic attention patterns allow KV cache to be organized into stable hierarchical levels that preserve video semantics across granularities
invented entities (1)
-
Hierarchical memory framework for KV cache
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities... Shallow Layers: ... exponential forgetting curve ... Deep Layers: ... attention magnitude ... Middle Layers: ... interpolating recency and attention
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints... no auxiliary computations upon the arrival of user queries
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
Reference graph
Works this paper leans on
-
[1]
Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet
work page 2024
-
[2]
R.C. Atkinson and R.M. Shiffrin. Human memory: A proposed system and its control processes, 1968. ISSN 0079-
work page 1968
-
[3]
URL https://www.sciencedirect.com/science/article/pii/S0079742108604223
-
[4]
Alan D. Baddeley and Graham Hitch. Working memory , 1974. ISSN 0079-7421. URL https://www.sciencedirect. com/science/article/pii/S0079742108604521
work page 1974
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 20...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time, 2024. URL https: //arxiv.org/abs/2501.00663
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Videollm-online: Online video large language model for streaming video, 2024
Joya Chen, Zhaoyang Lv , Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongx- ing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video, 2024. URL https://arxiv.org/abs/2406.11816
-
[9]
Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compression for efficient video understanding, 2025. URL https://arxiv.org/abs/2510.18269
-
[10]
arXiv preprint arXiv:2511.07278 , year=
Yilong Chen, Xiang Bai, Zhibin Wang, Chengyu Bai, Yuhan Dai, Ming Lu, and Shanghang Zhang. Streamkv: Streaming video question-answering with segment-based kv cache retrieval and compression, 2025. URL https: //arxiv.org/abs/2511.07278
-
[11]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen T ong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, T ong ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024. URL https://arxiv.org/abs/2406.07476
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney , Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav M...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval, 2025. URL https://arxiv.org/abs/2503.00540
-
[15]
Memory: A contribution to experimental psychology
Hermann Ebbinghaus. Memory: A contribution to experimental psychology. Annals of neurosciences, 20(4):155, 2013
work page 2013
-
[16]
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos, 2024. URL https://arxiv.org/abs/2408. 14023
work page 2024
-
[17]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, T ong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 20...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Kristen Grauman, Andrew Westbury , Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, T ushar Nagarajan, Ilija Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray , Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Sid- dhant Bansal, Dhruv Batra, Vincent...
-
[19]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge T u, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Unveiling the potential of small lang...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Memory in the Age of AI Agents
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Movienet: A holistic dataset for movie under- standing, 2020
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie under- standing, 2020. URL https://arxiv.org/abs/2007.10937
-
[22]
Infinipot: Infinite context processing on memory-constrained llms, 2024
Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot: Infinite context processing on memory-constrained llms, 2024. URL https://arxiv.org/abs/2410.01518
-
[23]
Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding, 2025. URL https://arxiv.org/abs/2506.15745. 21
-
[24]
Llava-onevision: Easy visual task transfer, 2024
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408. 03326
work page 2024
-
[25]
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. URL https: //arxiv.org/abs/2311.17005
work page internal anchor Pith review arXiv 2024
-
[26]
Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuan- grui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding?, 2025. URL https://arxiv.org/abs/2501.05510
-
[27]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov , Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2024. URL https://arxiv.org/abs/2312.07533
-
[28]
Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024
Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024. URL https://arxiv.org/abs/2411. 03628
work page 2024
-
[29]
Llava-next: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024
work page 2024
-
[30]
arXiv preprint arXiv:2408.15542 , year=
Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input, 2024. URL https://arxiv. org/abs/2408.15542
-
[31]
Egoschema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov , and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023. URL https://arxiv.org/abs/2308.09126
-
[32]
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval, 2025. URL https://arxiv.org/abs/2505.15269
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
OpenAI, :, Aaron Hurst, Adam Lerer, Adam P . Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow , Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry , Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov , Alex Carney , Alex Chow , Alex Kirillov , Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kir- illov , Alexi Christ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: T owards llms as operating systems, 2024. URL https://arxiv.org/abs/2310.08560
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky , Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, An...
-
[36]
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispi- der: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction, 2025. URL https://arxiv.org/abs/2501.03218
-
[37]
NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis
Amir Shahroudy , Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis, 2016. URL https://arxiv.org/abs/1604.02808
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu. Cognitive memory in large language models, 2025. URL https://arxiv.org/abs/2504.02441
-
[39]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny , and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language understanding, 2024. URL ht...
work page internal anchor Pith review arXiv 2024
-
[40]
Hierarchical memory for high-efficiency long-term reasoning in llm agents, 2025
Haoran Sun and Shaoning Zeng. Hierarchical memory for high-efficiency long-term reasoning in llm agents, 2025. URL https://arxiv.org/abs/2507.22925
-
[41]
Dycoke: Dynamic compression of tokens for fast video large language models, 2025
Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models, 2025. URL https://arxiv.org/abs/2411.15024. 23
-
[42]
Streambridge: Turning your offline video large language model into a proactive streaming assistant,
Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: T urning your offline video large language model into a proactive streaming assistant, 2025. URL https://arxiv.org/abs/2505.05467
-
[43]
Dynamic-vlm: Simple dynamic visual token compression for videollm, 2024
Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm, 2024. URL https://arxiv.org/abs/2412. 09530
work page 2024
-
[44]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Videotree: Adaptive tree-based video representation for llm reasoning on long videos, 2025
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos, 2025. URL https://arxiv. org/abs/2405.19209
-
[46]
Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge, 2025. URL https://arxiv. org/abs/2501.13468
-
[47]
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams, 2025. URL https://arxiv.org/abs/2510.09608
work page internal anchor Pith review arXiv 2025
-
[48]
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding
Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, and Imran Razzak. Streamagent: T owards anticipatory agents for stream- ing video understanding, 2025. URL https://arxiv.org/abs/2508.01875
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models, 2024. URL https://arxiv.org/abs/2412.04467
-
[50]
Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding, 2025. URL https://arxiv. org/abs/2508.15717
-
[51]
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, and Xu Sun. Timechat-online: 80% visual tokens are naturally redundant in streaming videos, 2025. URL https://arxiv.org/abs/2504.17343
-
[52]
Streamforest: Efficient online video understanding with persistent event memory ,
Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang. Streamforest: Efficient online video understanding with persistent event memory ,
- [53]
-
[55]
arXiv preprint arXiv:2406.08085 , year=
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams, 2024. URL https://arxiv.org/abs/2406.08085
-
[56]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision, 2024. URL https://arxiv.org/abs/ 2406.16852
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. URL https://arxiv.org/abs/2410.02713. 24 Appendix Appendix Contents A More Attention Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B Guidance Prompt . . . . . . ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.