pith. sign in

arxiv: 2601.14724 · v4 · submitted 2026-01-21 · 💻 cs.CV · cs.AI· cs.CL

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Pith reviewed 2026-05-16 12:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords streaming video understandingKV cachehierarchical memorymultimodal large language modelsreal-time inferencetraining-free methodattention mechanismtoken reduction
0
0 comments X

The pith

HERMES structures the KV cache as a hierarchical memory to support real-time streaming video understanding in multimodal models without any training or query-time overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HERMES as a training-free architecture that reuses a compact KV cache organized by levels of video detail to handle continuous video streams. It draws from an analysis of attention behavior to keep necessary information across coarse and fine scales while discarding redundant tokens. This setup removes the need for extra computations when a user query arrives, which directly cuts the time to first token by a factor of ten relative to earlier methods. Accuracy stays the same or improves on standard benchmarks even after dropping up to 68 percent of the video tokens, with the largest gains appearing on streaming-specific tests.

Core claim

HERMES conceptualizes the KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference the system reuses a compact cache to deliver efficient streaming understanding under tight resource limits. No auxiliary computations are performed upon query arrival, which guarantees real-time responses and yields a tenfold reduction in time-to-first-token compared with prior state-of-the-art approaches. Even with up to 68 percent fewer video tokens than uniform sampling, the method matches or exceeds accuracy on all tested benchmarks and improves results by as much as 11.4 percent on streaming datasets.

What carries the argument

The hierarchical KV cache memory framework, which organizes retained video tokens at multiple granularities according to observed attention patterns and allows direct reuse during inference.

If this is right

  • Real-time responses become possible for continuous video streams because no extra work is needed when a query arrives.
  • Memory footprint shrinks substantially while accuracy on streaming tasks rises by up to 11.4 percent.
  • Up to 68 percent of video tokens can be dropped relative to uniform sampling without loss of performance on standard benchmarks.
  • The same compact cache supports repeated queries on the same stream without retraining or auxiliary passes.
  • Deployment on resource-limited hardware becomes feasible for live video understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hierarchical cache organization could be tested on streaming audio or sensor sequences to see whether attention patterns support similar compression.
  • Longer video contexts might become practical if the hierarchy is allowed to grow dynamically rather than staying fixed.
  • Integration with existing video encoders could reduce the need for separate compression stages in multimodal pipelines.
  • The approach implies that attention statistics alone may suffice for temporal compression in other large models beyond video.

Load-bearing premise

The mechanistic attention investigation supplies a reliable way to group video information into stable hierarchical levels that keep every critical detail needed for later queries.

What would settle it

A direct comparison on a streaming video benchmark in which accuracy falls below the uniform-sampling baseline once token count is reduced by 68 percent, or in which measured time-to-first-token fails to show a tenfold improvement.

read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes HERMES, a training-free architecture for real-time streaming video understanding in MLLMs. It conceptualizes the KV cache as a hierarchical memory framework derived from a mechanistic attention investigation, enabling reuse of a compact cache that encapsulates video information across multiple granularities. This yields 10× faster TTFT than prior SOTA, with up to 68% video token reduction versus uniform sampling while achieving superior or comparable accuracy (gains up to 11.4% on streaming datasets) and no auxiliary computations on query arrival.

Significance. If the central claims hold, HERMES could meaningfully advance efficient inference for streaming multimodal models by reducing memory footprint and latency without retraining. The training-free hierarchical KV cache approach, if mechanistically justified and stable, would be a practical contribution to real-time video MLLM deployment and could inspire similar memory hierarchies in other long-context settings.

major comments (3)
  1. [§3] Mechanistic attention investigation (likely §3): the manuscript provides no equations, selection rules, or algorithm for constructing the hierarchy levels or for token promotion/demotion during continuous streaming input. Without these, it is impossible to verify whether the claimed stability across granularities holds under distribution shift or is an artifact of offline statistics.
  2. [§4] Experimental section (likely §4 and tables): accuracy gains (up to 11.4%) and the 68% token reduction are reported without ablations on the number of hierarchy levels, without error bars or statistical significance tests, and without explicit comparison of how the hierarchy is maintained versus uniform sampling baselines.
  3. [Abstract, §3] Claim of zero auxiliary computation on query arrival (abstract and §3): the paper asserts the compact cache enables real-time responses with no extra work, but supplies no timing breakdown or pseudocode confirming that hierarchy maintenance itself incurs no per-frame overhead during streaming.
minor comments (2)
  1. Notation for hierarchy levels and granularity parameters is introduced without a clear table or diagram; a single figure summarizing level sizes and update rules would improve readability.
  2. [Introduction] The abstract and introduction cite prior KV-cache compression work only lightly; adding 2–3 key references (e.g., recent streaming or memory-efficient attention papers) would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each of the major comments point-by-point below, providing clarifications and indicating revisions where necessary to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] Mechanistic attention investigation (likely §3): the manuscript provides no equations, selection rules, or algorithm for constructing the hierarchy levels or for token promotion/demotion during continuous streaming input. Without these, it is impossible to verify whether the claimed stability across granularities holds under distribution shift or is an artifact of offline statistics.

    Authors: We appreciate this observation. Section 3 presents the mechanistic attention investigation that motivates the hierarchical KV cache design, detailing how attention distributions across video frames inform the multi-granularity structure. To address the concern, we will include explicit equations for attention-based token selection, the rules for constructing hierarchy levels, and an algorithm for promotion/demotion in the streaming setting. This will be added to §3 and the appendix to allow verification of stability. revision: yes

  2. Referee: [§4] Experimental section (likely §4 and tables): accuracy gains (up to 11.4%) and the 68% token reduction are reported without ablations on the number of hierarchy levels, without error bars or statistical significance tests, and without explicit comparison of how the hierarchy is maintained versus uniform sampling baselines.

    Authors: We agree that additional ablations would strengthen the results. In the revision, we will add an ablation study on the number of hierarchy levels and include error bars from repeated runs with different seeds. We will also perform and report statistical significance tests for the accuracy gains. The comparison to uniform sampling is presented in the main tables, but we will expand the discussion in §4 to explicitly describe how the hierarchical maintenance differs from uniform sampling and why it leads to better performance. revision: yes

  3. Referee: [Abstract, §3] Claim of zero auxiliary computation on query arrival (abstract and §3): the paper asserts the compact cache enables real-time responses with no extra work, but supplies no timing breakdown or pseudocode confirming that hierarchy maintenance itself incurs no per-frame overhead during streaming.

    Authors: The claim refers specifically to no auxiliary computations triggered by the user query arrival, as the hierarchy is maintained incrementally during streaming. However, we acknowledge the need for more detail. We will provide a timing breakdown in the experimental section and include pseudocode in the appendix showing the per-frame streaming process, demonstrating that maintenance overhead is constant and does not affect query-time latency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; hierarchy motivated by investigation but results are empirically benchmarked

full rationale

The paper motivates the hierarchical KV cache from a mechanistic attention investigation described in the manuscript, then evaluates the resulting training-free system on streaming and offline benchmarks. No equations, fitted parameters, or self-citations reduce the reported TTFT gains or accuracy improvements to the investigation outputs by construction. The central claims rest on external empirical measurements rather than self-referential definitions or renamed patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that attention patterns naturally support a stable multi-granularity memory organization that can be reused without auxiliary computation or information loss; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)
  • domain assumption Mechanistic attention patterns allow KV cache to be organized into stable hierarchical levels that preserve video semantics across granularities
    Invoked to justify the training-free reuse of a compact cache for streaming inputs
invented entities (1)
  • Hierarchical memory framework for KV cache no independent evidence
    purpose: Encapsulate video information at multiple granularities for efficient streaming reuse
    Conceptualized from the attention investigation; no independent falsifiable handle supplied in the abstract

pith-pipeline@v0.9.0 · 5497 in / 1289 out tokens · 34815 ms · 2026-05-16T12:52:26.669195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.

  2. VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 2 Pith papers · 20 internal anchors

  1. [1]

    Claude 3.5 sonnet, 2024

    Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet

  2. [2]

    Atkinson and R.M

    R.C. Atkinson and R.M. Shiffrin. Human memory: A proposed system and its control processes, 1968. ISSN 0079-

  3. [3]

    URL https://www.sciencedirect.com/science/article/pii/S0079742108604223

  4. [4]

    Baddeley and Graham Hitch

    Alan D. Baddeley and Graham Hitch. Working memory , 1974. ISSN 0079-7421. URL https://www.sciencedirect. com/science/article/pii/S0079742108604521

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  6. [6]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 20...

  7. [7]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time, 2024. URL https: //arxiv.org/abs/2501.00663

  8. [8]

    Videollm-online: Online video large language model for streaming video, 2024

    Joya Chen, Zhaoyang Lv , Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongx- ing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video, 2024. URL https://arxiv.org/abs/2406.11816

  9. [9]

    Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025

    Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compression for efficient video understanding, 2025. URL https://arxiv.org/abs/2510.18269

  10. [10]

    arXiv preprint arXiv:2511.07278 , year=

    Yilong Chen, Xiang Bai, Zhibin Wang, Chengyu Bai, Yuhan Dai, Ming Lu, and Shanghang Zhang. Streamkv: Streaming video question-answering with segment-based kv cache retrieval and compression, 2025. URL https: //arxiv.org/abs/2511.07278

  11. [11]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen T ong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, T ong ...

  12. [12]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024. URL https://arxiv.org/abs/2406.07476

  13. [13]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney , Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav M...

  14. [14]

    Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

    Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval, 2025. URL https://arxiv.org/abs/2503.00540

  15. [15]

    Memory: A contribution to experimental psychology

    Hermann Ebbinghaus. Memory: A contribution to experimental psychology. Annals of neurosciences, 20(4):155, 2013

  16. [16]

    Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos, 2024

    Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos, 2024. URL https://arxiv.org/abs/2408. 14023

  17. [17]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, T ong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 20...

  18. [18]

    Kristen Grauman, Andrew Westbury , Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, T ushar Nagarajan, Ilija Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray , Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Sid- dhant Bansal, Dhruv Batra, Vincent...

  19. [19]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge T u, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Unveiling the potential of small lang...

  20. [20]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

  21. [21]

    Movienet: A holistic dataset for movie under- standing, 2020

    Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie under- standing, 2020. URL https://arxiv.org/abs/2007.10937

  22. [22]

    Infinipot: Infinite context processing on memory-constrained llms, 2024

    Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot: Infinite context processing on memory-constrained llms, 2024. URL https://arxiv.org/abs/2410.01518

  23. [23]

    Infinipot-v: Memory-constrained kv cache com- pression for streaming video understanding.arXiv preprint arXiv:2506.15745, 2025

    Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding, 2025. URL https://arxiv.org/abs/2506.15745. 21

  24. [24]

    Llava-onevision: Easy visual task transfer, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408. 03326

  25. [25]

    MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. URL https: //arxiv.org/abs/2311.17005

  26. [26]

    Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

    Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuan- grui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding?, 2025. URL https://arxiv.org/abs/2501.05510

  27. [27]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov , Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2024. URL https://arxiv.org/abs/2312.07533

  28. [28]

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024. URL https://arxiv.org/abs/2411. 03628

  29. [29]

    Llava-next: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024

  30. [30]

    arXiv preprint arXiv:2408.15542 , year=

    Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input, 2024. URL https://arxiv. org/abs/2408.15542

  31. [31]

    Egoschema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov , and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023. URL https://arxiv.org/abs/2308.09126

  32. [32]

    LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

    Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval, 2025. URL https://arxiv.org/abs/2505.15269

  33. [33]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P . Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow , Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry , Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov , Alex Carney , Alex Chow , Alex Kirillov , Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kir- illov , Alexi Christ...

  34. [34]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: T owards llms as operating systems, 2024. URL https://arxiv.org/abs/2310.08560

  35. [35]

    Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yezhou Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alexander Fr \'e chette, Hanna Klimczak, R

    Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky , Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, An...

  36. [36]

    Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction.arXiv:2501.03218, 2025

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispi- der: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction, 2025. URL https://arxiv.org/abs/2501.03218

  37. [37]

    NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis

    Amir Shahroudy , Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis, 2016. URL https://arxiv.org/abs/1604.02808

  38. [38]

    Theodore R

    Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu. Cognitive memory in large language models, 2025. URL https://arxiv.org/abs/2504.02441

  39. [39]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny , and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language understanding, 2024. URL ht...

  40. [40]

    Hierarchical memory for high-efficiency long-term reasoning in llm agents, 2025

    Haoran Sun and Shaoning Zeng. Hierarchical memory for high-efficiency long-term reasoning in llm agents, 2025. URL https://arxiv.org/abs/2507.22925

  41. [41]

    Dycoke: Dynamic compression of tokens for fast video large language models, 2025

    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models, 2025. URL https://arxiv.org/abs/2411.15024. 23

  42. [42]

    Streambridge: Turning your offline video large language model into a proactive streaming assistant,

    Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: T urning your offline video large language model into a proactive streaming assistant, 2025. URL https://arxiv.org/abs/2505.05467

  43. [43]

    Dynamic-vlm: Simple dynamic visual token compression for videollm, 2024

    Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm, 2024. URL https://arxiv.org/abs/2412. 09530

  44. [44]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191

  45. [45]

    Videotree: Adaptive tree-based video representation for llm reasoning on long videos, 2025

    Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos, 2025. URL https://arxiv. org/abs/2405.19209

  46. [46]

    Streaming video under- standing and multi-round interaction with memory-enhanced knowledge.arXiv preprint arXiv:2501.13468, 2025

    Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge, 2025. URL https://arxiv. org/abs/2501.13468

  47. [47]

    StreamingVLM: Real-Time Understanding for Infinite Video Streams

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams, 2025. URL https://arxiv.org/abs/2510.09608

  48. [48]

    StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

    Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, and Imran Razzak. Streamagent: T owards anticipatory agents for stream- ing video understanding, 2025. URL https://arxiv.org/abs/2508.01875

  49. [49]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models, 2024. URL https://arxiv.org/abs/2412.04467

  50. [50]

    Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

    Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding, 2025. URL https://arxiv. org/abs/2508.15717

  51. [51]

    Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025

    Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, and Xu Sun. Timechat-online: 80% visual tokens are naturally redundant in streaming videos, 2025. URL https://arxiv.org/abs/2504.17343

  52. [52]

    Streamforest: Efficient online video understanding with persistent event memory ,

    Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang. Streamforest: Efficient online video understanding with persistent event memory ,

  53. [53]

    URL https://arxiv.org/abs/2509.24871

  54. [55]

    arXiv preprint arXiv:2406.08085 , year=

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams, 2024. URL https://arxiv.org/abs/2406.08085

  55. [56]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision, 2024. URL https://arxiv.org/abs/ 2406.16852

  56. [57]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. URL https://arxiv.org/abs/2410.02713. 24 Appendix Appendix Contents A More Attention Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B Guidance Prompt . . . . . . ...