HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang; Jinlan Fu; See-kiong Ng; Shudong Yang; Xipeng Qiu

arxiv: 2601.14724 · v4 · submitted 2026-01-21 · 💻 cs.CV · cs.AI· cs.CL

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang , Shudong Yang , Jinlan Fu , See-kiong Ng , Xipeng Qiu This is my paper

Pith reviewed 2026-05-16 12:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords streaming video understandingKV cachehierarchical memorymultimodal large language modelsreal-time inferencetraining-free methodattention mechanismtoken reduction

0 comments

The pith

HERMES structures the KV cache as a hierarchical memory to support real-time streaming video understanding in multimodal models without any training or query-time overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HERMES as a training-free architecture that reuses a compact KV cache organized by levels of video detail to handle continuous video streams. It draws from an analysis of attention behavior to keep necessary information across coarse and fine scales while discarding redundant tokens. This setup removes the need for extra computations when a user query arrives, which directly cuts the time to first token by a factor of ten relative to earlier methods. Accuracy stays the same or improves on standard benchmarks even after dropping up to 68 percent of the video tokens, with the largest gains appearing on streaming-specific tests.

Core claim

HERMES conceptualizes the KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference the system reuses a compact cache to deliver efficient streaming understanding under tight resource limits. No auxiliary computations are performed upon query arrival, which guarantees real-time responses and yields a tenfold reduction in time-to-first-token compared with prior state-of-the-art approaches. Even with up to 68 percent fewer video tokens than uniform sampling, the method matches or exceeds accuracy on all tested benchmarks and improves results by as much as 11.4 percent on streaming datasets.

What carries the argument

The hierarchical KV cache memory framework, which organizes retained video tokens at multiple granularities according to observed attention patterns and allows direct reuse during inference.

If this is right

Real-time responses become possible for continuous video streams because no extra work is needed when a query arrives.
Memory footprint shrinks substantially while accuracy on streaming tasks rises by up to 11.4 percent.
Up to 68 percent of video tokens can be dropped relative to uniform sampling without loss of performance on standard benchmarks.
The same compact cache supports repeated queries on the same stream without retraining or auxiliary passes.
Deployment on resource-limited hardware becomes feasible for live video understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchical cache organization could be tested on streaming audio or sensor sequences to see whether attention patterns support similar compression.
Longer video contexts might become practical if the hierarchy is allowed to grow dynamically rather than staying fixed.
Integration with existing video encoders could reduce the need for separate compression stages in multimodal pipelines.
The approach implies that attention statistics alone may suffice for temporal compression in other large models beyond video.

Load-bearing premise

The mechanistic attention investigation supplies a reliable way to group video information into stable hierarchical levels that keep every critical detail needed for later queries.

What would settle it

A direct comparison on a streaming video benchmark in which accuracy falls below the uniform-sampling baseline once token count is reduced by 68 percent, or in which measured time-to-first-token fails to show a tenfold improvement.

read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HERMES frames KV cache as hierarchical memory for streaming video, delivering 10x TTFT cuts and 68% token reduction with accuracy parity or gains, but the hierarchy construction lacks visible stability checks.

read the letter

HERMES frames the KV cache as a hierarchical memory for streaming video in MLLMs. The key results are a 10x reduction in time-to-first-token and up to 68% fewer video tokens while matching or exceeding accuracy on benchmarks, including 11.4% gains on streaming data. What is new is the explicit use of attention analysis to structure the cache into levels that capture different video granularities, enabling compact reuse during continuous input without extra query-time costs. This moves beyond uniform sampling or basic compression by making the cache query-independent and training-free. The paper does well on the empirical side, reporting consistent performance across benchmarks with clear speed advantages for real-time use cases. The main soft spot is the thin description of how the hierarchy is built and maintained. The mechanistic investigation is mentioned, but without equations, selection criteria, or tests on distribution shifts, it's difficult to assess if the levels reliably preserve critical details over long streams. Minor concerns include the absence of error bars and limited ablations on level count. This work is aimed at engineers and researchers building low-latency video understanding systems. Anyone dealing with continuous multimodal streams would find the efficiency claims relevant. It deserves a serious referee because the problem is practical and the approach is novel enough to test. Reviewers can verify the attention study and run additional streaming experiments. I would recommend sending this to peer review. The reported gains are substantial enough to merit closer examination, even with the current gaps in methodological detail.

Referee Report

3 major / 2 minor

Summary. The paper proposes HERMES, a training-free architecture for real-time streaming video understanding in MLLMs. It conceptualizes the KV cache as a hierarchical memory framework derived from a mechanistic attention investigation, enabling reuse of a compact cache that encapsulates video information across multiple granularities. This yields 10× faster TTFT than prior SOTA, with up to 68% video token reduction versus uniform sampling while achieving superior or comparable accuracy (gains up to 11.4% on streaming datasets) and no auxiliary computations on query arrival.

Significance. If the central claims hold, HERMES could meaningfully advance efficient inference for streaming multimodal models by reducing memory footprint and latency without retraining. The training-free hierarchical KV cache approach, if mechanistically justified and stable, would be a practical contribution to real-time video MLLM deployment and could inspire similar memory hierarchies in other long-context settings.

major comments (3)

[§3] Mechanistic attention investigation (likely §3): the manuscript provides no equations, selection rules, or algorithm for constructing the hierarchy levels or for token promotion/demotion during continuous streaming input. Without these, it is impossible to verify whether the claimed stability across granularities holds under distribution shift or is an artifact of offline statistics.
[§4] Experimental section (likely §4 and tables): accuracy gains (up to 11.4%) and the 68% token reduction are reported without ablations on the number of hierarchy levels, without error bars or statistical significance tests, and without explicit comparison of how the hierarchy is maintained versus uniform sampling baselines.
[Abstract, §3] Claim of zero auxiliary computation on query arrival (abstract and §3): the paper asserts the compact cache enables real-time responses with no extra work, but supplies no timing breakdown or pseudocode confirming that hierarchy maintenance itself incurs no per-frame overhead during streaming.

minor comments (2)

Notation for hierarchy levels and granularity parameters is introduced without a clear table or diagram; a single figure summarizing level sizes and update rules would improve readability.
[Introduction] The abstract and introduction cite prior KV-cache compression work only lightly; adding 2–3 key references (e.g., recent streaming or memory-efficient attention papers) would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each of the major comments point-by-point below, providing clarifications and indicating revisions where necessary to strengthen the manuscript.

read point-by-point responses

Referee: [§3] Mechanistic attention investigation (likely §3): the manuscript provides no equations, selection rules, or algorithm for constructing the hierarchy levels or for token promotion/demotion during continuous streaming input. Without these, it is impossible to verify whether the claimed stability across granularities holds under distribution shift or is an artifact of offline statistics.

Authors: We appreciate this observation. Section 3 presents the mechanistic attention investigation that motivates the hierarchical KV cache design, detailing how attention distributions across video frames inform the multi-granularity structure. To address the concern, we will include explicit equations for attention-based token selection, the rules for constructing hierarchy levels, and an algorithm for promotion/demotion in the streaming setting. This will be added to §3 and the appendix to allow verification of stability. revision: yes
Referee: [§4] Experimental section (likely §4 and tables): accuracy gains (up to 11.4%) and the 68% token reduction are reported without ablations on the number of hierarchy levels, without error bars or statistical significance tests, and without explicit comparison of how the hierarchy is maintained versus uniform sampling baselines.

Authors: We agree that additional ablations would strengthen the results. In the revision, we will add an ablation study on the number of hierarchy levels and include error bars from repeated runs with different seeds. We will also perform and report statistical significance tests for the accuracy gains. The comparison to uniform sampling is presented in the main tables, but we will expand the discussion in §4 to explicitly describe how the hierarchical maintenance differs from uniform sampling and why it leads to better performance. revision: yes
Referee: [Abstract, §3] Claim of zero auxiliary computation on query arrival (abstract and §3): the paper asserts the compact cache enables real-time responses with no extra work, but supplies no timing breakdown or pseudocode confirming that hierarchy maintenance itself incurs no per-frame overhead during streaming.

Authors: The claim refers specifically to no auxiliary computations triggered by the user query arrival, as the hierarchy is maintained incrementally during streaming. However, we acknowledge the need for more detail. We will provide a timing breakdown in the experimental section and include pseudocode in the appendix showing the per-frame streaming process, demonstrating that maintenance overhead is constant and does not affect query-time latency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; hierarchy motivated by investigation but results are empirically benchmarked

full rationale

The paper motivates the hierarchical KV cache from a mechanistic attention investigation described in the manuscript, then evaluates the resulting training-free system on streaming and offline benchmarks. No equations, fitted parameters, or self-citations reduce the reported TTFT gains or accuracy improvements to the investigation outputs by construction. The central claims rest on external empirical measurements rather than self-referential definitions or renamed patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that attention patterns naturally support a stable multi-granularity memory organization that can be reused without auxiliary computation or information loss; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)

domain assumption Mechanistic attention patterns allow KV cache to be organized into stable hierarchical levels that preserve video semantics across granularities
Invoked to justify the training-free reuse of a compact cache for streaming inputs

invented entities (1)

Hierarchical memory framework for KV cache no independent evidence
purpose: Encapsulate video information at multiple granularities for efficient streaming reuse
Conceptualized from the attention investigation; no independent falsifiable handle supplied in the abstract

pith-pipeline@v0.9.0 · 5497 in / 1289 out tokens · 34815 ms · 2026-05-16T12:52:26.669195+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities... Shallow Layers: ... exponential forgetting curve ... Deep Layers: ... attention magnitude ... Middle Layers: ... interpolating recency and attention
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints... no auxiliary computations upon the arrival of user queries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 2 Pith papers · 20 internal anchors

[1]

Claude 3.5 sonnet, 2024

Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet

work page 2024
[2]

Atkinson and R.M

R.C. Atkinson and R.M. Shiffrin. Human memory: A proposed system and its control processes, 1968. ISSN 0079-

work page 1968
[3]

URL https://www.sciencedirect.com/science/article/pii/S0079742108604223

work page
[4]

Baddeley and Graham Hitch

Alan D. Baddeley and Graham Hitch. Working memory , 1974. ISSN 0079-7421. URL https://www.sciencedirect. com/science/article/pii/S0079742108604521

work page 1974
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 20...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time, 2024. URL https: //arxiv.org/abs/2501.00663

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Videollm-online: Online video large language model for streaming video, 2024

Joya Chen, Zhaoyang Lv , Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongx- ing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video, 2024. URL https://arxiv.org/abs/2406.11816

work page arXiv 2024
[9]

Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025

Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compression for efficient video understanding, 2025. URL https://arxiv.org/abs/2510.18269

work page arXiv 2025
[10]

arXiv preprint arXiv:2511.07278 , year=

Yilong Chen, Xiang Bai, Zhibin Wang, Chengyu Bai, Yuhan Dai, Ming Lu, and Shanghang Zhang. Streamkv: Streaming video question-answering with segment-based kv cache retrieval and compression, 2025. URL https: //arxiv.org/abs/2511.07278

work page arXiv 2025
[11]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen T ong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, T ong ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024. URL https://arxiv.org/abs/2406.07476

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney , Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav M...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval, 2025. URL https://arxiv.org/abs/2503.00540

work page arXiv 2025
[15]

Memory: A contribution to experimental psychology

Hermann Ebbinghaus. Memory: A contribution to experimental psychology. Annals of neurosciences, 20(4):155, 2013

work page 2013
[16]

Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos, 2024

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos, 2024. URL https://arxiv.org/abs/2408. 14023

work page 2024
[17]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, T ong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 20...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Kristen Grauman, Andrew Westbury , Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, T ushar Nagarajan, Ilĳa Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray , Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Sid- dhant Bansal, Dhruv Batra, Vincent...

work page arXiv 2022
[19]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge T u, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Unveiling the potential of small lang...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Movienet: A holistic dataset for movie under- standing, 2020

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie under- standing, 2020. URL https://arxiv.org/abs/2007.10937

work page arXiv 2020
[22]

Infinipot: Infinite context processing on memory-constrained llms, 2024

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot: Infinite context processing on memory-constrained llms, 2024. URL https://arxiv.org/abs/2410.01518

work page arXiv 2024
[23]

Infinipot-v: Memory-constrained kv cache com- pression for streaming video understanding.arXiv preprint arXiv:2506.15745, 2025

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding, 2025. URL https://arxiv.org/abs/2506.15745. 21

work page arXiv 2025
[24]

Llava-onevision: Easy visual task transfer, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408. 03326

work page 2024
[25]

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. URL https: //arxiv.org/abs/2311.17005

work page internal anchor Pith review arXiv 2024
[26]

Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuan- grui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding?, 2025. URL https://arxiv.org/abs/2501.05510

work page arXiv 2025
[27]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov , Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2024. URL https://arxiv.org/abs/2312.07533

work page arXiv 2024
[28]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024

Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024. URL https://arxiv.org/abs/2411. 03628

work page 2024
[29]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024

work page 2024
[30]

arXiv preprint arXiv:2408.15542 , year=

Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input, 2024. URL https://arxiv. org/abs/2408.15542

work page arXiv 2024
[31]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov , and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023. URL https://arxiv.org/abs/2308.09126

work page arXiv 2023
[32]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval, 2025. URL https://arxiv.org/abs/2505.15269

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P . Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow , Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry , Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov , Alex Carney , Alex Chow , Alex Kirillov , Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kir- illov , Alexi Christ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: T owards llms as operating systems, 2024. URL https://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yezhou Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alexander Fr \'e chette, Hanna Klimczak, R

Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky , Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, An...

work page arXiv 2023
[36]

Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction.arXiv:2501.03218, 2025

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispi- der: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction, 2025. URL https://arxiv.org/abs/2501.03218

work page arXiv 2025
[37]

NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis

Amir Shahroudy , Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis, 2016. URL https://arxiv.org/abs/1604.02808

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Theodore R

Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu. Cognitive memory in large language models, 2025. URL https://arxiv.org/abs/2504.02441

work page arXiv 2025
[39]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny , and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language understanding, 2024. URL ht...

work page internal anchor Pith review arXiv 2024
[40]

Hierarchical memory for high-efficiency long-term reasoning in llm agents, 2025

Haoran Sun and Shaoning Zeng. Hierarchical memory for high-efficiency long-term reasoning in llm agents, 2025. URL https://arxiv.org/abs/2507.22925

work page arXiv 2025
[41]

Dycoke: Dynamic compression of tokens for fast video large language models, 2025

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models, 2025. URL https://arxiv.org/abs/2411.15024. 23

work page arXiv 2025
[42]

Streambridge: Turning your offline video large language model into a proactive streaming assistant,

Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: T urning your offline video large language model into a proactive streaming assistant, 2025. URL https://arxiv.org/abs/2505.05467

work page arXiv 2025
[43]

Dynamic-vlm: Simple dynamic visual token compression for videollm, 2024

Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm, 2024. URL https://arxiv.org/abs/2412. 09530

work page 2024
[44]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shĳie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos, 2025

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos, 2025. URL https://arxiv. org/abs/2405.19209

work page arXiv 2025
[46]

Streaming video under- standing and multi-round interaction with memory-enhanced knowledge.arXiv preprint arXiv:2501.13468, 2025

Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge, 2025. URL https://arxiv. org/abs/2501.13468

work page arXiv 2025
[47]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams, 2025. URL https://arxiv.org/abs/2510.09608

work page internal anchor Pith review arXiv 2025
[48]

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, and Imran Razzak. Streamagent: T owards anticipatory agents for stream- ing video understanding, 2025. URL https://arxiv.org/abs/2508.01875

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models, 2024. URL https://arxiv.org/abs/2412.04467

work page arXiv 2024
[50]

Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding, 2025. URL https://arxiv. org/abs/2508.15717

work page arXiv 2025
[51]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, and Xu Sun. Timechat-online: 80% visual tokens are naturally redundant in streaming videos, 2025. URL https://arxiv.org/abs/2504.17343

work page arXiv 2025
[52]

Streamforest: Efficient online video understanding with persistent event memory ,

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang. Streamforest: Efficient online video understanding with persistent event memory ,

work page
[53]

URL https://arxiv.org/abs/2509.24871

work page arXiv
[55]

arXiv preprint arXiv:2406.08085 , year=

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams, 2024. URL https://arxiv.org/abs/2406.08085

work page arXiv 2024
[56]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision, 2024. URL https://arxiv.org/abs/ 2406.16852

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. URL https://arxiv.org/abs/2410.02713. 24 Appendix Appendix Contents A More Attention Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B Guidance Prompt . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Claude 3.5 sonnet, 2024

Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet

work page 2024

[2] [2]

Atkinson and R.M

R.C. Atkinson and R.M. Shiffrin. Human memory: A proposed system and its control processes, 1968. ISSN 0079-

work page 1968

[3] [3]

URL https://www.sciencedirect.com/science/article/pii/S0079742108604223

work page

[4] [4]

Baddeley and Graham Hitch

Alan D. Baddeley and Graham Hitch. Working memory , 1974. ISSN 0079-7421. URL https://www.sciencedirect. com/science/article/pii/S0079742108604521

work page 1974

[5] [5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 20...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time, 2024. URL https: //arxiv.org/abs/2501.00663

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Videollm-online: Online video large language model for streaming video, 2024

Joya Chen, Zhaoyang Lv , Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongx- ing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video, 2024. URL https://arxiv.org/abs/2406.11816

work page arXiv 2024

[9] [9]

Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025

Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Streamingtom: Streaming token compression for efficient video understanding, 2025. URL https://arxiv.org/abs/2510.18269

work page arXiv 2025

[10] [10]

arXiv preprint arXiv:2511.07278 , year=

Yilong Chen, Xiang Bai, Zhibin Wang, Chengyu Bai, Yuhan Dai, Ming Lu, and Shanghang Zhang. Streamkv: Streaming video question-answering with segment-based kv cache retrieval and compression, 2025. URL https: //arxiv.org/abs/2511.07278

work page arXiv 2025

[11] [11]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen T ong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, T ong ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024. URL https://arxiv.org/abs/2406.07476

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney , Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav M...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval, 2025. URL https://arxiv.org/abs/2503.00540

work page arXiv 2025

[15] [15]

Memory: A contribution to experimental psychology

Hermann Ebbinghaus. Memory: A contribution to experimental psychology. Annals of neurosciences, 20(4):155, 2013

work page 2013

[16] [16]

Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos, 2024

Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos, 2024. URL https://arxiv.org/abs/2408. 14023

work page 2024

[17] [17]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, T ong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 20...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Kristen Grauman, Andrew Westbury , Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, T ushar Nagarajan, Ilĳa Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray , Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Sid- dhant Bansal, Dhruv Batra, Vincent...

work page arXiv 2022

[19] [19]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge T u, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Unveiling the potential of small lang...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Movienet: A holistic dataset for movie under- standing, 2020

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie under- standing, 2020. URL https://arxiv.org/abs/2007.10937

work page arXiv 2020

[22] [22]

Infinipot: Infinite context processing on memory-constrained llms, 2024

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot: Infinite context processing on memory-constrained llms, 2024. URL https://arxiv.org/abs/2410.01518

work page arXiv 2024

[23] [23]

Infinipot-v: Memory-constrained kv cache com- pression for streaming video understanding.arXiv preprint arXiv:2506.15745, 2025

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding, 2025. URL https://arxiv.org/abs/2506.15745. 21

work page arXiv 2025

[24] [24]

Llava-onevision: Easy visual task transfer, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408. 03326

work page 2024

[25] [25]

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. URL https: //arxiv.org/abs/2311.17005

work page internal anchor Pith review arXiv 2024

[26] [26]

Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuan- grui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding?, 2025. URL https://arxiv.org/abs/2501.05510

work page arXiv 2025

[27] [27]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov , Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2024. URL https://arxiv.org/abs/2312.07533

work page arXiv 2024

[28] [28]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024

Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024. URL https://arxiv.org/abs/2411. 03628

work page 2024

[29] [29]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024

work page 2024

[30] [30]

arXiv preprint arXiv:2408.15542 , year=

Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input, 2024. URL https://arxiv. org/abs/2408.15542

work page arXiv 2024

[31] [31]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov , and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023. URL https://arxiv.org/abs/2308.09126

work page arXiv 2023

[32] [32]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval, 2025. URL https://arxiv.org/abs/2505.15269

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P . Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow , Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry , Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov , Alex Carney , Alex Chow , Alex Kirillov , Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kir- illov , Alexi Christ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: T owards llms as operating systems, 2024. URL https://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yezhou Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alexander Fr \'e chette, Hanna Klimczak, R

Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky , Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, An...

work page arXiv 2023

[36] [36]

Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction.arXiv:2501.03218, 2025

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispi- der: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction, 2025. URL https://arxiv.org/abs/2501.03218

work page arXiv 2025

[37] [37]

NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis

Amir Shahroudy , Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis, 2016. URL https://arxiv.org/abs/1604.02808

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Theodore R

Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu. Cognitive memory in large language models, 2025. URL https://arxiv.org/abs/2504.02441

work page arXiv 2025

[39] [39]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny , and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language understanding, 2024. URL ht...

work page internal anchor Pith review arXiv 2024

[40] [40]

Hierarchical memory for high-efficiency long-term reasoning in llm agents, 2025

Haoran Sun and Shaoning Zeng. Hierarchical memory for high-efficiency long-term reasoning in llm agents, 2025. URL https://arxiv.org/abs/2507.22925

work page arXiv 2025

[41] [41]

Dycoke: Dynamic compression of tokens for fast video large language models, 2025

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models, 2025. URL https://arxiv.org/abs/2411.15024. 23

work page arXiv 2025

[42] [42]

Streambridge: Turning your offline video large language model into a proactive streaming assistant,

Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: T urning your offline video large language model into a proactive streaming assistant, 2025. URL https://arxiv.org/abs/2505.05467

work page arXiv 2025

[43] [43]

Dynamic-vlm: Simple dynamic visual token compression for videollm, 2024

Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm, 2024. URL https://arxiv.org/abs/2412. 09530

work page 2024

[44] [44]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shĳie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos, 2025

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos, 2025. URL https://arxiv. org/abs/2405.19209

work page arXiv 2025

[46] [46]

Streaming video under- standing and multi-round interaction with memory-enhanced knowledge.arXiv preprint arXiv:2501.13468, 2025

Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge, 2025. URL https://arxiv. org/abs/2501.13468

work page arXiv 2025

[47] [47]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams, 2025. URL https://arxiv.org/abs/2510.09608

work page internal anchor Pith review arXiv 2025

[48] [48]

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, and Imran Razzak. Streamagent: T owards anticipatory agents for stream- ing video understanding, 2025. URL https://arxiv.org/abs/2508.01875

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models, 2024. URL https://arxiv.org/abs/2412.04467

work page arXiv 2024

[50] [50]

Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding, 2025. URL https://arxiv. org/abs/2508.15717

work page arXiv 2025

[51] [51]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, and Xu Sun. Timechat-online: 80% visual tokens are naturally redundant in streaming videos, 2025. URL https://arxiv.org/abs/2504.17343

work page arXiv 2025

[52] [52]

Streamforest: Efficient online video understanding with persistent event memory ,

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang. Streamforest: Efficient online video understanding with persistent event memory ,

work page

[53] [53]

URL https://arxiv.org/abs/2509.24871

work page arXiv

[54] [55]

arXiv preprint arXiv:2406.08085 , year=

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams, 2024. URL https://arxiv.org/abs/2406.08085

work page arXiv 2024

[55] [56]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision, 2024. URL https://arxiv.org/abs/ 2406.16852

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [57]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. URL https://arxiv.org/abs/2410.02713. 24 Appendix Appendix Contents A More Attention Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B Guidance Prompt . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025