Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Guangliang Cheng; Haochen Wang; Jason Li; Jiahao Meng; Jiangning Zhang; Kuan Gao; Lingdong Kong; Lu Qi; Minghsuan Yang; Qianyu Zhou

arxiv: 2606.07433 · v1 · pith:W7NYHLPEnew · submitted 2026-06-05 · 💻 cs.CV · cs.AI· cs.MM

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Jiahao Meng , Yue Tan , Qi Xu , Kuan Gao , Weisong Liu , Yanwei Li , Jason Li , Lingdong Kong

show 7 more authors

Haochen Wang Qianyu Zhou Jiangning Zhang Guangliang Cheng Yunhai Tong Lu Qi Minghsuan Yang

This is my paper

Pith reviewed 2026-06-27 21:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords video understandingmultimodal large language modelswatching remembering reasoninglong video processingmemory modelingegocentric videostreaming understandingfaithful reasoning

0 comments

The pith

Video MLLMs acquire evidence, preserve context, and produce outputs through watching, remembering, and reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a unified human-view framework for LLM-based video understanding built around three functional abilities: watching for evidence acquisition, remembering for context preservation, and reasoning for grounded outputs. Instead of isolated task benchmarks, this structure analyzes how models handle sparse evidence, long-range dependencies, and multimodal alignment under computational limits. It formulates video systems through perceptual representations, memory states, reasoning traces, and final predictions, then maps representative methods, challenges, application domains, datasets, and benchmarks onto these elements. The work reviews current approaches across perception, memory, and reasoning while highlighting open problems for scalable video intelligence.

Core claim

Video understanding with MLLMs is best characterized by a formulation that decomposes systems into perceptual representations, memory states, reasoning traces, and final predictions, which in turn map onto the three abilities of watching, remembering, and reasoning; this decomposition supplies a single lens for organizing methods, identifying challenges in spatio-temporal perception and memory modeling, and covering domains from egocentric to narrative videos.

What carries the argument

The three functional abilities—watching, remembering, and reasoning—that partition video MLLM behavior and link the four system components (perceptual representations, memory states, reasoning traces, final predictions) into a unified analysis structure.

Load-bearing premise

Every video understanding system can be usefully described by four fixed components and partitioned without overlap or remainder into the three abilities of watching, remembering, and reasoning.

What would settle it

A deployed video MLLM whose accuracy, efficiency, or failure modes on long videos cannot be improved or explained by separately measuring or modifying its watching, remembering, or reasoning components.

Figures

Figures reproduced from arXiv: 2606.07433 by Guangliang Cheng, Haochen Wang, Jason Li, Jiahao Meng, Jiangning Zhang, Kuan Gao, Lingdong Kong, Lu Qi, Minghsuan Yang, Qianyu Zhou, Qi Xu, Weisong Liu, Yanwei Li, Yue Tan, Yunhai Tong.

**Figure 1.** Figure 1: Overview of our survey. Left: the survey pipeline. Right: our Watch–Remember–Reason taxonomy for MLLM-based video understanding. Watch (Sec. 3.1) covers fine-grained grounding, captioning, audio-visual perception, and efficient processing. Remember (Sec. 3.2) includes offline and streaming memory. Reason (Sec. 3.3) covers text-only reasoning and thinking with videos, with both agentic and non-agent approac… view at source ↗

**Figure 2.** Figure 2: Overview of methods related to ”How to Watch?”. Fine-grained watching localizes task-relevant evidence in time and space. Comprehensive watching abstracts videos into summaries, and segment-level or region-level descriptions. Audio-visual watching aligns visual and acoustic streams for omni-modal perception. Efficient watching reduces redundancy through frame selection, token compression, and efficient mod… view at source ↗

**Figure 3.** Figure 3: Overview of methods related to ”How to Remember?”. Agentic offline memory constructs and updates external memory through LLM/VLM agents. Non-agentic offline memory builds structured short-term and long-term memory via event extraction, frame selection, token compression, and event clustering. Streaming memory maintains and retrieves memory online through sliding windows, recent memory, and long-term memory… view at source ↗

**Figure 4.** Figure 4: Overview of methods related to ”How to Reason?”. Agentic text-only reasoning methods decompose reasoning into modular steps such as clip summarization, adaptive search, memory retrieval, reflection, and answer verification. Nonagent text-only reasoning methods perform a single MLLM forward pass and produces textual chain-of-thought with the final answer. Agentic thinking with videos methods actively inter… view at source ↗

read the original abstract

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that organizes video MLLM work around watching, remembering, and reasoning but introduces no new results or derivations.

read the letter

This survey organizes recent work on multimodal LLMs for video around three functional abilities—watching, remembering, and reasoning—plus four system components: perceptual representations, memory states, reasoning traces, and final predictions. The abstract and structure make clear that the goal is to give a unified view of how these models handle evidence, context, and output rather than to report fresh experiments.

It does a reasonable job pulling together challenges in long-video perception, memory modeling, streaming, and faithful reasoning, then mapping representative methods, datasets, and benchmarks across domains like egocentric and medical video. The GitHub repo for ongoing updates is a practical addition that many readers will use.

The soft spots are predictable for a survey. The three-ability partition is offered as a useful lens, yet the paper does not demonstrate that the categories are exhaustive or non-overlapping; any overlaps or gaps will only become visible once the full categorization is checked against the cited papers. Soundness therefore depends on accurate mapping of existing methods, which cannot be verified from the abstract alone. No new empirical claims or formal derivations are present, so there is nothing to reproduce or falsify.

This paper is mainly for people already active in video MLLMs who want a structured overview and a list of open problems. It is not aimed at readers seeking a new model or benchmark numbers. It deserves a serious referee because the subfield moves quickly and a balanced survey can help keep the literature usable, even if the framing itself is incremental.

Referee Report

0 major / 3 minor

Summary. This survey paper proposes a human-view organizational framework for video understanding with multimodal large language models (MLLMs), structured around three functional abilities (watching, remembering, reasoning) and four system components (perceptual representations, memory states, reasoning traces, final predictions). It uses this framing to categorize methods for fine-grained perception, memory modeling (offline/streaming), and reasoning (text-only or video-grounded), while surveying challenges in spatio-temporal perception and long-video processing, application domains (egocentric, sports, medical, etc.), datasets, benchmarks, and open problems.

Significance. If the proposed partition proves useful for synthesis, the work could help researchers map existing methods onto a common structure for identifying gaps in memory-aware and evidence-grounded video intelligence, particularly as the field shifts toward long, multimodal scenarios. Its value is primarily in organization and coverage rather than new derivations or measurements.

minor comments (3)

The formulation characterizing systems by perceptual representations, memory states, reasoning traces, and final predictions is introduced in the abstract and presumably detailed early in the manuscript; if this is presented only descriptively without a diagram or explicit mapping table to the three abilities, it risks remaining informal for readers attempting to apply the framework to new papers.
The abstract states that representative methods are 'organized by their roles in video MLLM systems' under the three abilities, but without an explicit cross-reference table or section that lists which cited works map to which component, the organizational claim is harder to verify.
The GitHub link for continuously traced related works is mentioned but not cited as a reference; adding a formal citation or footnote would improve traceability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our survey and the recommendation of minor revision. No specific major comments were provided in the report, so we have no individual points requiring point-by-point rebuttal. We will incorporate minor improvements for clarity and completeness in the revised version.

Circularity Check

0 steps flagged

No significant circularity; organizational survey framing is self-contained

full rationale

The paper is a literature survey that proposes an organizational perspective on video MLLMs structured around three functional abilities (watching, remembering, reasoning) and four system components (perceptual representations, memory states, reasoning traces, final predictions). The central claim is that this supplies a unified structure for analyzing existing methods, challenges, datasets, and benchmarks rather than introducing new empirical results or a falsifiable model. The partition is presented as a useful framing for the survey; no stronger claim of exhaustiveness, disjointness, or predictive power is required for the work to fulfill its stated purpose. No equations, fitted parameters, predictions, or self-citation chains appear in the derivation chain. All cited works are external and the framework does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a survey and introduces no new free parameters, mathematical axioms, or invented physical entities; it synthesizes prior literature under an organizational framing.

pith-pipeline@v0.9.1-grok · 5856 in / 1148 out tokens · 15229 ms · 2026-06-27T21:59:04.020071+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

298 extracted references · 39 linked inside Pith

[1]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id= qwen3.5

2026
[2]

Qwen3- omni technical report,

J. Xu, Z. Guo, H. Hu, Y. Chu, X. Wang, J. He, Y. Wang, X. Shi, T. He, X. Zhu, Y. Lv, Y. Wang, D. Guo, H. Wang, L. Ma, P . Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P . Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3- omni technical report,” arXiv...

Pith/arXiv arXiv 2025
[3]

Qwen3-vl technical report,

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chenet al., “Qwen3-vl technical report,” Nov. 2025. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 24

2025
[4]

Qwen2. 5-omni technical report,

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Danget al., “Qwen2. 5-omni technical report,” arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025
[5]

Video-xl-2: Towards very long-video under- standing through task-aware kv sparsification,

M. Qin, X. Liu, Z. Liang, Y. Shu, H. Yuan, J. Zhou, S. Xiao, B. Zhao, and Z. Liu, “Video-xl-2: Towards very long-video under- standing through task-aware kv sparsification,” arXiv preprint arXiv:2506.19225, 2025

arXiv 2025
[6]

Msr-vtt: A large video descrip- tion dataset for bridging video and language,

J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video descrip- tion dataset for bridging video and language,” inProceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2016, pp. 5288–5296

2016
[7]

Video question answering via gradually refined attention over appearance and motion,

D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Video question answering via gradually refined attention over appearance and motion,” inProceedings of the 25th ACM interna- tional conference on Multimedia. New York, NY, USA: Association for Computing Machinery, 2017, pp. 1645–1653

2017
[8]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering,

Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2017, pp. 2758–2766

2017
[9]

LongVU: Spatiotemporal adaptive compression for long video- language understanding,

X. Shen, Y. Xiong, C. Zhao, L. Wu, J. Chen, C. Zhu, Z. Liu, F. Xiao, B. Varadarajan, F. Bordes, Z. Liu, H. Xu, H. J. Kim, B. Soran, R. Krishnamoorthi, M. Elhoseiny, and V . Chandra, “LongVU: Spatiotemporal adaptive compression for long video- language understanding,” inForty-second International Conference on Machine Learning, 2025

2025
[10]

Longvideo-r1: Smart navigation for low-cost long video understanding,

J. Qiu, L. Xie, X. Huo, Q. Tian, and Q. Ye, “Longvideo-r1: Smart navigation for low-cost long video understanding,” arXiv preprint arXiv:2602.20913, 2026

Pith/arXiv arXiv 2026
[11]

Video-o3: Native inter- leaved clue seeking for long video multi-hop reasoning,

X. Zeng, Z. Zhang, Y. Zhu, X. Li, Z. Wang, C. Ma, Q. Zhang, Z. Huang, K. Ouyang, T. Jianget al., “Video-o3: Native inter- leaved clue seeking for long video multi-hop reasoning,” arXiv preprint arXiv:2601.23224, 2026

Pith/arXiv arXiv 2026
[12]

Ego4d: Around the world in 3,000 hours of egocentric video,

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liuet al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2022

2022
[14]

A visually grounded language model for fetal ultrasound understanding,

X. Guo, M. Alsharid, H. Zhao, Y. Wang, J. Lander, A. T. Pa- pageorghiou, and J. A. Noble, “A visually grounded language model for fetal ultrasound understanding,” Nature Biomedical Engineering, advance online publication, 2026

2026
[15]

Streamingvlm: Real-time understanding for infinite video streams,

R. Xu, G. Xiao, Y. Chen, L. He, K. Peng, Y. Lu, and S. Han, “Streamingvlm: Real-time understanding for infinite video streams,” 2025. [Online]. Available: https://arxiv.org/abs/2510. 09608

2025
[16]

Timechat: A time- sensitive multimodal large language model for long video un- derstanding,

S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time- sensitive multimodal large language model for long video un- derstanding,” arXiv preprint arXiv:2312.02051, 2023

arXiv 2023
[17]

Adaptive keyframe sampling for long video understanding,

X. Tang, J. Qiu, L. Xie, Y. Tian, J. Jiao, and Q. Ye, “Adaptive keyframe sampling for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 29 118–29 128

2025
[18]

FrameFusion: Combining similarity and importance for video token reduction on large vision language models,

T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y. Wang, “FrameFusion: Combining similarity and importance for video token reduction on large vision language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 22 654–22 663

2025
[19]

Moviechat: From dense token to sparse memory for long video understanding,

E. Song, W. Chai, G. Wang, Y. Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y. Zhanget al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 18 221–18 232

2024
[20]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

B. He, H. Li, Y. K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 504–13 514

2024
[21]

Memory-enhanced retrieval augmentation for long video understanding,

H. Yuan, Z. Liu, M. Qin, H. Qian, Y. Shu, Z. Dou, J.-R. Wen, and N. Sebe, “Memory-enhanced retrieval augmentation for long video understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09149

arXiv 2025
[22]

Flash-vstream: Memory-based real-time understanding for long video streams,

H. Zhang, Y. Wang, Y. Tang, Y. Liu, J. Feng, J. Dai, and X. Jin, “Flash-vstream: Memory-based real-time understanding for long video streams,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.08085

arXiv 2024
[23]

Streammem: Query-agnostic kv cache memory for streaming video understanding,

Y. Yang, Z. Zhao, S. N. Shukla, A. Singh, S. K. Mishra, L. Zhang, and M. Ren, “Streammem: Query-agnostic kv cache memory for streaming video understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2508.15717

arXiv 2025
[24]

Video-r1: Reinforcing video reasoning in mllms,

K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025

Pith/arXiv arXiv 2025
[25]

Videorft: In- centivizing video reasoning capability in mllms via reinforced fine-tuning,

Q. Wang, Y. Yu, Y. Yuan, R. Mao, and T. Zhou, “Videorft: In- centivizing video reasoning capability in mllms via reinforced fine-tuning,” arXiv preprint arXiv:2505.12434, 2025

arXiv 2025
[26]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence,

J. Meng, X. Li, H. Wang, Y. Tan, T. Zhang, L. Kong, Y. Tong, A. Wang, Z. Teng, Y. Wanget al., “Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence,” arXiv preprint arXiv:2510.20579, 2025

arXiv 2025
[27]

Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

H. Zhang, X. Gu, J. Li, C. Ma, S. Bai, C. Zhang, B. Zhang, Z. Zhou, D. He, and Y. Tang, “Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,” arXiv preprint arXiv:2508.04416, 2025

arXiv 2025
[28]

Video-language understanding: A survey from model architecture, model training, and data per- spectives,

T. Nguyen, Y. Bin, J. Xiao, L. Qu, Y. Li, J. Z. Wu, C.-D. Nguyen, S. K. Ng, and L. A. Tuan, “Video-language understanding: A survey from model architecture, model training, and data per- spectives,” inFindings of the Association for Computational Lin- guistics: ACL 2024. Stroudsburg, PA, USA: Association for Computational Linguistics, 2024, pp. 3636–3657

2024
[29]

Video understanding with large language models: A survey,

Y. Tang, J. Bi, S. Xu, L. Song, S. Liang, T. Wang, D. Zhang, J. An, J. Lin, R. Zhu, A. Vosoughi, C. Huang, Z. Zhang, P . Liu, M. Feng, F. Zheng, J.-L. Gaudiot, P . Luo, J. Luo, and C. Xu, “Video understanding with large language models: A survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 36, no. 2, pp. 1355–1376, Feb. 2026

2026
[30]

A survey on video temporal grounding with multimodal large lan- guage model,

J. Wu, W. Liu, Y. Liu, M. Liu, L. Nie, Z. Lin, and C. W. Chen, “A survey on video temporal grounding with multimodal large lan- guage model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1521–1541, 2026

2026
[31]

Video-LMM post-training: A deep dive into video reasoning with large multimodal models,

Y. Tang, J. Bi, P . Liu, Z. Pan, Z. Tan, Q. Shen, J. Liu, H. Hua, J. Guo, Y. Xiaoet al., “Video-LMM post-training: A deep dive into video reasoning with large multimodal models,” arXiv preprint arXiv:2510.05034, 2025

arXiv 2025
[32]

A survey of reinforcement learning for large reasoning models,

K. Zhang, Y. Zuo, B. He, Y. Sun, R. Liu, C. Jiang, Y. Fan, K. Tian, G. Jia, P . Liet al., “A survey of reinforcement learning for large reasoning models,” arXiv preprint arXiv:2509.08827, 2025

Pith/arXiv arXiv 2025
[33]

Memory in the age of ai agents,

Y. Hu, S. Liu, Y. Yue, G. Zhang, B. Liu, F. Zhu, J. Lin, H. Guo, S. Dou, Z. Xiet al., “Memory in the age of ai agents,” arXiv preprint arXiv:2512.13564, 2025

Pith/arXiv arXiv 2025
[34]

Token reduction should go beyond efficiency in generative models–from vision, language to multimodality,

Z. Kong, Y. Li, F. Zeng, L. Xin, S. Messica, X. Lin, P . Zhao, M. Kellis, H. Tang, and M. Zitnik, “Token reduction should go beyond efficiency in generative models–from vision, language to multimodality,” arXiv preprint arXiv:2505.18227, 2025

arXiv 2025
[35]

Perception, reason, think, and plan: A survey on large multimodal reasoning models,

Y. Li, Z. Liu, Z. Li, X. Zhang, Z. Xu, X. Chen, H. Shi, S. Jiang, X. Wang, J. Wanget al., “Perception, reason, think, and plan: A survey on large multimodal reasoning models,” arXiv preprint arXiv:2505.04921, 2025

arXiv 2025
[36]

Lita: Language instructed temporal- localization assistant,

D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P . Molchanov, Z. Yu, and J. Kautz, “Lita: Language instructed temporal- localization assistant,” inEuropean Conference on Computer Vision (ECCV). Cham, Switzerland: Springer, 2024

2024
[37]

Universal video temporal grounding with generative multi-modal large language models,

Z. Li, S. Di, Z. Zhai, W. Huang, Y. Wang, and W. Xie, “Universal video temporal grounding with generative multi-modal large language models,” inAdvances in Neural Information Processing Systems (NeurIPS). Red Hook, NY, USA: Curran Associates, Inc., 2025, affiliations: Shanghai Jiao Tong University; ByteDance Seed

2025
[38]

Timelens: Rethinking video temporal grounding with multi- modal llms,

J. Zhang, T. Wang, Y. Ge, Y. Ge, X. Li, Y. Shan, and L. Wang, “Timelens: Rethinking video temporal grounding with multi- modal llms,” arXiv preprint arXiv:2512.14698, 2025, affiliations: Nanjing University; ARC Lab, Tencent PCG; Shanghai AI Lab

arXiv 2025
[39]

Towards one-to-many temporal grounding,

Q. Xu, T. Yue, S. Chen, J. Meng, A. Wang, S. Ji, H. Fei, and X. Li, “Towards one-to-many temporal grounding,” inProceedings of the 43rd International Conference on Machine Learning (ICML). Brookline, MA, USA: PMLR, 2026

2026
[40]

Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,

H. Yuan, X. Li, T. Zhang, Y. Sun, Z. Huang, S. Xu, S. Ji, Y. Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,” arXiv preprint arXiv:2501.04001, 2025. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 25

Pith/arXiv arXiv 2025
[41]

Sama: Towards multi-turn referential grounded video chat with large language models,

Y. Sun, H. Zhang, H. Ding, T. Zhang, X. Ma, and Y.-G. Jiang, “Sama: Towards multi-turn referential grounded video chat with large language models,” inAdvances in Neural Information Pro- cessing Systems. Red Hook, NY, USA: Curran Associates, Inc., 2025

2025
[42]

Streaming dense video captioning,

G. Zhou, X. Xiong, A. Bhattacharyya, and J. J. Corso, “Streaming dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 18 486–18 496

2024
[43]

Do you remember? dense video captioning with cross-modal memory retrieval,

M. Kim, H. B. Kim, J. Moon, J. Choi, and S. T. Kim, “Do you remember? dense video captioning with cross-modal memory retrieval,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 13 894–13 904

2024
[44]

Dibs: Enhancing dense video captioning with unlabeled videos via pseudo boundary en- richment and online refinement,

H. Wu, H. Liu, Y. Qiao, and X. Sun, “Dibs: Enhancing dense video captioning with unlabeled videos via pseudo boundary en- richment and online refinement,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 18 699–18 708

2024
[45]

PLLaVA: Parameter-free LLaVA extension from images to videos for video dense captioning,

L. Xu, Y. Huang, S. Xie, W. Wei, T. Li, B. Pan, Y. Zhao, and J. Yuan, “PLLaVA: Parameter-free LLaVA extension from images to videos for video dense captioning,” arXiv preprint arXiv:2404.16994, 2024

Pith/arXiv arXiv 2024
[46]

AuroraCap: Efficient, performant video detailed captioning and a new benchmark,

W. Chai, E. Song, Y. Du, C. Meng, V . Madhavan, O. Bar-Tal, J.- N. Hwang, S. Xie, and C. D. Manning, “AuroraCap: Efficient, performant video detailed captioning and a new benchmark,” arXiv preprint arXiv:2410.03051, 2024

arXiv 2024
[47]

Tarsier2: Advancing large vision-language models from detailed video de- scription to comprehensive video understanding,

L. Yuan, J. Wang, H. Sun, Y. Zhang, and Y. Lin, “Tarsier2: Advancing large vision-language models from detailed video de- scription to comprehensive video understanding,” arXiv preprint arXiv:2501.07888, 2025

arXiv 2025
[48]

Baichuan-omni technical report,

Y. Li, H. Sun, M. Lin, T. Li, G. Dong, T. Zhang, B. Ding, W. Song, Z. Cheng, Y. Huoet al., “Baichuan-omni technical report,” arXiv preprint arXiv:2410.08565, 2024

arXiv 2024
[49]

Ming-omni: A unified mul- timodal model for perception and generation,

I. AI, B. Gong, C. Zou, C. Zheng, C. Zhou, C. Yan, C. Jin, C. Shen, D. Zheng, F. Wanget al., “Ming-omni: A unified mul- timodal model for perception and generation,” arXiv preprint arXiv:2506.09344, 2025

arXiv 2025
[50]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Guo, Y. Zhou, Z. Ma, S. Zhang, and Y. Feng, “Llama- omni: Seamless speech interaction with large language models,” arXiv preprint arXiv:2409.06666, 2024

arXiv 2024
[51]

Stream-omni: Simultaneous multimodal interactions with large language- vision-speech model,

S. Zhang, S. Guo, Q. Fang, Y. Zhou, and Y. Feng, “Stream-omni: Simultaneous multimodal interactions with large language- vision-speech model,” arXiv preprint arXiv:2506.13642, 2025

arXiv 2025
[52]

Omnicaptioner: One captioner to rule them all,

Y. Lu, J. Yuan, Z. Li, S. Zhao, Q. Qin, X. Li, L. Zhuo, L. Wen, D. Liu, Y. Caoet al., “Omnicaptioner: One captioner to rule them all,” arXiv preprint arXiv:2504.07089, 2025

arXiv 2025
[53]

Omnivinci: Enhancing ar- chitecture and data for omni-modal understanding llm,

H. Ye, C.-H. H. Yang, A. Goel, W. Huang, L. Zhu, Y. Su, S. Lin, A.-C. Cheng, Z. Wan, J. Tianet al., “Omnivinci: Enhancing ar- chitecture and data for omni-modal understanding llm,” arXiv preprint arXiv:2510.15870, 2025

arXiv 2025
[54]

Q-Frame: Query- aware frame selection and multi-resolution adaptation for video- LLMs,

S. Zhang, J. Yang, J. Yin, Z. Luo, and J. Luan, “Q-Frame: Query- aware frame selection and multi-resolution adaptation for video- LLMs,” arXiv preprint arXiv:2506.22139, 2025

arXiv 2025
[55]

DyCoke: Dynamic compression of tokens for fast video large language models,

K. Tao, C. Qin, H. You, Y. Sui, and H. Wang, “DyCoke: Dynamic compression of tokens for fast video large language models,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 18 992–19 001

2025
[56]

Videonsa: Native sparse attention scales video un- derstanding,

E. Song, W. Chai, S. Yang, E. Armand, X. Shan, H. Xu, J. Xie, and Z. Tu, “Videonsa: Native sparse attention scales video un- derstanding,” arXiv preprint arXiv:2510.02295, 2025

arXiv 2025
[57]

Vtimellm: Empower llm to grasp video moments,

B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, 2024

2024
[58]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,

Y. Guo, J. Liu, M. Li, D. Cheng, X. Tang, D. Sui, Q. Liu, X. Chen, and K. Zhao, “Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,” inProceed- ings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2025

2025
[59]

Distime: Distribution-based time representation for video large language models,

Y. Zeng, Z. Huang, Y. Zhong, C. Feng, J. Hu, L. Ma, and Y. Liu, “Distime: Distribution-based time representation for video large language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, 2025

2025
[60]

Self-chained image- language model for video localization and question answering,

S. Yu, J. Cho, P . Yadav, and M. Bansal, “Self-chained image- language model for video localization and question answering,” inAdvances in Neural Information Processing Systems (NeurIPS). Red Hook, NY, USA: Curran Associates, Inc., 2023, affiliation: UNC Chapel Hill

2023
[61]

Llava-mr: Large language-and-vision assistant for video moment retrieval,

W. Lu, J. Li, A. Yu, M.-C. Chang, S. Ji, and M. Xia, “Llava-mr: Large language-and-vision assistant for video moment retrieval,” arXiv preprint arXiv:2411.14505, 2024, affiliations: Peking Univer- sity; Tencent Youtu; University at Albany; Zhejiang University. (* indicates corresponding author in the paper.)

arXiv 2024
[62]

Timesuite: Improving mllms for long video understanding via grounded tuning,

X. Zeng, K. Li, C. Wang, X. Li, T. Jiang, Z. Yan, S. Li, Y. Shi, Z. Yue, Y. Wang, Y. Wang, Y. Qiao, and L. Wang, “Timesuite: Improving mllms for long video understanding via grounded tuning,” inInternational Conference on Learning Representations (ICLR). Online: OpenReview.net, 2025

2025
[63]

Scanning only once: An end-to-end framework for fast temporal grounding in long videos,

Y. Pan, X. He, B. Gong, Y. Lv, Y. Shen, Y. Peng, and D. Zhao, “Scanning only once: An end-to-end framework for fast temporal grounding in long videos,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, 2023, affiliations: Alibaba Group; Wangxuan Institute of Computer Technolog...

2023
[64]

Trace: Temporal grounding video llm via causal event modeling,

Y. Guo, J. Liu, M. Li, Q. Liu, X. Chen, and X. Tang, “Trace: Temporal grounding video llm via causal event modeling,” in International Conference on Learning Representations (ICLR). On- line: OpenReview.net, 2025, affiliations: School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen; Tencent PCG; Shenzhen Institute of Artificial I...

2025
[65]

Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding,

C. Guo, X. Mo, Y. Nie, X. Xu, C. Xu, F. Yu, and C. Long, “Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding,” arXiv preprint arXiv:2508.07683, 2025

arXiv 2025
[66]

Grounded-videollm: Sharpening fine- grained temporal grounding in video large language models,

H. Wang, Z. Xu, Y. Cheng, S. Diao, Y. Zhou, Y. Cao, Q. Wang, W. Ge, and L. Huang, “Grounded-videollm: Sharpening fine- grained temporal grounding in video large language models,” arXiv preprint arXiv:2410.03290, 2024

arXiv 2024
[67]

Momentor: Advancing video large language model with fine-grained temporal reasoning,

L. Qian, J. Li, Y. Wu, Y. Ye, H. Fei, T.-S. Chua, Y. Zhuang, and S. Tang, “Momentor: Advancing video large language model with fine-grained temporal reasoning,” inProceedings of the 41st International Conference on Machine Learning (ICML). Brookline, MA, USA: PMLR, 2024

2024
[68]

Videoperceiver: Enhancing fine-grained temporal perception in video multimodal large language models,

F. Zhao, L. Zhang, D. Shi, Y. Gao, C. Ye, Y. Cai, J. Gao, and D. Yan, “Videoperceiver: Enhancing fine-grained temporal perception in video multimodal large language models,” arXiv preprint arXiv:2511.18823, 2025

arXiv 2025
[69]

Time-r1: Post-training large vision language model for temporal video grounding,

Y. Wang, Z. Wang, B. Xu, Y. Du, K. Lin, Z. Xiao, Z. Yue, J. Ju, L. Zhang, D. Yanget al., “Time-r1: Post-training large vision language model for temporal video grounding,” arXiv preprint arXiv:2503.13377, 2025

Pith/arXiv arXiv 2025
[70]

Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation,

J. Li, H. Yin, H. Xu, B. Xu, W. Tan, Z. He, J. Ju, Z. Luo, and J. Luan, “Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation,” arXiv preprint arXiv:2602.02994, 2026

Pith/arXiv arXiv 2026
[71]

Videozoomer: Reinforcement-learned temporal focusing for long video reason- ing,

Y. Ding, Y. Zhang, X. Lai, R. Chu, and Y. Yang, “Videozoomer: Reinforcement-learned temporal focusing for long video reason- ing,” arXiv preprint arXiv:2512.22315, 2025

arXiv 2025
[72]

Datasets and recipes for video temporal grounding via reinforcement learning,

R. Chen, T. Luo, Z. Fan, H. Zou, Z. Feng, G. Xie, H. Zhang, Z. Wang, Z. Liu, and Z. Huaijian, “Datasets and recipes for video temporal grounding via reinforcement learning,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. Stroudsburg, PA, USA: Association for Computational Linguistics, 2025, pp. 983–992

2025
[73]

Museg: Reinforc- ing video temporal understanding via timestamp-aware multi- segment grounding,

F. Luo, S. Lou, C. Chen, Z. Wang, C. Li, W. Shen, J. Guo, P . Li, M. Yan, J. Zhang, F. Huang, and Y. Liu, “Museg: Reinforc- ing video temporal understanding via timestamp-aware multi- segment grounding,” arXiv preprint arXiv:2505.20715, 2025

Pith/arXiv arXiv 2025
[74]

Detect anything via next point prediction,

Q. Jiang, J. Huo, X. Chen, Y. Xiong, Z. Zeng, Y. Chen, T. Ren, J. Yu, and L. Zhang, “Detect anything via next point prediction,” arXiv preprint arXiv:2510.12798, 2025

arXiv 2025
[75]

Thinking with bounding boxes: Enhancing spatio-temporal video grounding via reinforcement fine-tuning,

X. Gu, H. Zhang, Q. Fan, J. Niu, Z. Zhang, L. Zhang, G. Chen, F. Chen, L. Wen, and S. Zhu, “Thinking with bounding boxes: Enhancing spatio-temporal video grounding via reinforcement fine-tuning,” arXiv preprint arXiv:2511.21375, 2025

arXiv 2025
[76]

Universal instance perception as object discovery and retrieval,

B. Yan, Y. Jiang, J. Wu, D. Wang, Z. Yuan, P . Luo, and H. Lu, IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 26 “Universal instance perception as object discovery and retrieval,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2023

2023
[77]

Multimodal referring segmentation: A survey,

H. Ding, S. Tang, S. He, C. Liu, Z. Wu, and Y.-G. Jiang, “Multimodal referring segmentation: A survey,” arXiv preprint arXiv:2508.00265, 2025

arXiv 2025
[78]

Deformable detr: Deformable transformers for end-to-end object detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020

Pith/arXiv arXiv 2010
[79]

Lisa: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2024

2024
[80]

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding,

T. Zhang, X. Li, H. Fei, H. Yuan, S. Wu, S. Ji, C. L. Chen, and S. Yan, “Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding,” inNeurIPS. Red Hook, NY, USA: Curran Associates, Inc., 2024

2024
[81]

Generalizable entity grounding via assistance of large language model,

L. Qi, Y.-W. Chen, L. Yang, T. Shen, X. Li, W. Guo, Y. Xu, and M.- H. Yang, “Generalizable entity grounding via assistance of large language model,” arXiv preprint arXiv:2402.02555, 2024

Pith/arXiv arXiv 2024

Showing first 80 references.

[1] [1]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id= qwen3.5

2026

[2] [2]

Qwen3- omni technical report,

J. Xu, Z. Guo, H. Hu, Y. Chu, X. Wang, J. He, Y. Wang, X. Shi, T. He, X. Zhu, Y. Lv, Y. Wang, D. Guo, H. Wang, L. Ma, P . Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P . Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3- omni technical report,” arXiv...

Pith/arXiv arXiv 2025

[3] [3]

Qwen3-vl technical report,

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chenet al., “Qwen3-vl technical report,” Nov. 2025. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 24

2025

[4] [4]

Qwen2. 5-omni technical report,

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Danget al., “Qwen2. 5-omni technical report,” arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025

[5] [5]

Video-xl-2: Towards very long-video under- standing through task-aware kv sparsification,

M. Qin, X. Liu, Z. Liang, Y. Shu, H. Yuan, J. Zhou, S. Xiao, B. Zhao, and Z. Liu, “Video-xl-2: Towards very long-video under- standing through task-aware kv sparsification,” arXiv preprint arXiv:2506.19225, 2025

arXiv 2025

[6] [6]

Msr-vtt: A large video descrip- tion dataset for bridging video and language,

J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video descrip- tion dataset for bridging video and language,” inProceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2016, pp. 5288–5296

2016

[7] [7]

Video question answering via gradually refined attention over appearance and motion,

D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Video question answering via gradually refined attention over appearance and motion,” inProceedings of the 25th ACM interna- tional conference on Multimedia. New York, NY, USA: Association for Computing Machinery, 2017, pp. 1645–1653

2017

[8] [8]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering,

Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2017, pp. 2758–2766

2017

[9] [9]

LongVU: Spatiotemporal adaptive compression for long video- language understanding,

X. Shen, Y. Xiong, C. Zhao, L. Wu, J. Chen, C. Zhu, Z. Liu, F. Xiao, B. Varadarajan, F. Bordes, Z. Liu, H. Xu, H. J. Kim, B. Soran, R. Krishnamoorthi, M. Elhoseiny, and V . Chandra, “LongVU: Spatiotemporal adaptive compression for long video- language understanding,” inForty-second International Conference on Machine Learning, 2025

2025

[10] [10]

Longvideo-r1: Smart navigation for low-cost long video understanding,

J. Qiu, L. Xie, X. Huo, Q. Tian, and Q. Ye, “Longvideo-r1: Smart navigation for low-cost long video understanding,” arXiv preprint arXiv:2602.20913, 2026

Pith/arXiv arXiv 2026

[11] [11]

Video-o3: Native inter- leaved clue seeking for long video multi-hop reasoning,

X. Zeng, Z. Zhang, Y. Zhu, X. Li, Z. Wang, C. Ma, Q. Zhang, Z. Huang, K. Ouyang, T. Jianget al., “Video-o3: Native inter- leaved clue seeking for long video multi-hop reasoning,” arXiv preprint arXiv:2601.23224, 2026

Pith/arXiv arXiv 2026

[12] [12]

Ego4d: Around the world in 3,000 hours of egocentric video,

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liuet al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2022

2022

[13] [14]

A visually grounded language model for fetal ultrasound understanding,

X. Guo, M. Alsharid, H. Zhao, Y. Wang, J. Lander, A. T. Pa- pageorghiou, and J. A. Noble, “A visually grounded language model for fetal ultrasound understanding,” Nature Biomedical Engineering, advance online publication, 2026

2026

[14] [15]

Streamingvlm: Real-time understanding for infinite video streams,

R. Xu, G. Xiao, Y. Chen, L. He, K. Peng, Y. Lu, and S. Han, “Streamingvlm: Real-time understanding for infinite video streams,” 2025. [Online]. Available: https://arxiv.org/abs/2510. 09608

2025

[15] [16]

Timechat: A time- sensitive multimodal large language model for long video un- derstanding,

S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time- sensitive multimodal large language model for long video un- derstanding,” arXiv preprint arXiv:2312.02051, 2023

arXiv 2023

[16] [17]

Adaptive keyframe sampling for long video understanding,

X. Tang, J. Qiu, L. Xie, Y. Tian, J. Jiao, and Q. Ye, “Adaptive keyframe sampling for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 29 118–29 128

2025

[17] [18]

FrameFusion: Combining similarity and importance for video token reduction on large vision language models,

T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y. Wang, “FrameFusion: Combining similarity and importance for video token reduction on large vision language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 22 654–22 663

2025

[18] [19]

Moviechat: From dense token to sparse memory for long video understanding,

E. Song, W. Chai, G. Wang, Y. Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y. Zhanget al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 18 221–18 232

2024

[19] [20]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

B. He, H. Li, Y. K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 504–13 514

2024

[20] [21]

Memory-enhanced retrieval augmentation for long video understanding,

H. Yuan, Z. Liu, M. Qin, H. Qian, Y. Shu, Z. Dou, J.-R. Wen, and N. Sebe, “Memory-enhanced retrieval augmentation for long video understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09149

arXiv 2025

[21] [22]

Flash-vstream: Memory-based real-time understanding for long video streams,

H. Zhang, Y. Wang, Y. Tang, Y. Liu, J. Feng, J. Dai, and X. Jin, “Flash-vstream: Memory-based real-time understanding for long video streams,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.08085

arXiv 2024

[22] [23]

Streammem: Query-agnostic kv cache memory for streaming video understanding,

Y. Yang, Z. Zhao, S. N. Shukla, A. Singh, S. K. Mishra, L. Zhang, and M. Ren, “Streammem: Query-agnostic kv cache memory for streaming video understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2508.15717

arXiv 2025

[23] [24]

Video-r1: Reinforcing video reasoning in mllms,

K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025

Pith/arXiv arXiv 2025

[24] [25]

Videorft: In- centivizing video reasoning capability in mllms via reinforced fine-tuning,

Q. Wang, Y. Yu, Y. Yuan, R. Mao, and T. Zhou, “Videorft: In- centivizing video reasoning capability in mllms via reinforced fine-tuning,” arXiv preprint arXiv:2505.12434, 2025

arXiv 2025

[25] [26]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence,

J. Meng, X. Li, H. Wang, Y. Tan, T. Zhang, L. Kong, Y. Tong, A. Wang, Z. Teng, Y. Wanget al., “Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence,” arXiv preprint arXiv:2510.20579, 2025

arXiv 2025

[26] [27]

Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

H. Zhang, X. Gu, J. Li, C. Ma, S. Bai, C. Zhang, B. Zhang, Z. Zhou, D. He, and Y. Tang, “Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,” arXiv preprint arXiv:2508.04416, 2025

arXiv 2025

[27] [28]

Video-language understanding: A survey from model architecture, model training, and data per- spectives,

T. Nguyen, Y. Bin, J. Xiao, L. Qu, Y. Li, J. Z. Wu, C.-D. Nguyen, S. K. Ng, and L. A. Tuan, “Video-language understanding: A survey from model architecture, model training, and data per- spectives,” inFindings of the Association for Computational Lin- guistics: ACL 2024. Stroudsburg, PA, USA: Association for Computational Linguistics, 2024, pp. 3636–3657

2024

[28] [29]

Video understanding with large language models: A survey,

Y. Tang, J. Bi, S. Xu, L. Song, S. Liang, T. Wang, D. Zhang, J. An, J. Lin, R. Zhu, A. Vosoughi, C. Huang, Z. Zhang, P . Liu, M. Feng, F. Zheng, J.-L. Gaudiot, P . Luo, J. Luo, and C. Xu, “Video understanding with large language models: A survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 36, no. 2, pp. 1355–1376, Feb. 2026

2026

[29] [30]

A survey on video temporal grounding with multimodal large lan- guage model,

J. Wu, W. Liu, Y. Liu, M. Liu, L. Nie, Z. Lin, and C. W. Chen, “A survey on video temporal grounding with multimodal large lan- guage model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1521–1541, 2026

2026

[30] [31]

Video-LMM post-training: A deep dive into video reasoning with large multimodal models,

Y. Tang, J. Bi, P . Liu, Z. Pan, Z. Tan, Q. Shen, J. Liu, H. Hua, J. Guo, Y. Xiaoet al., “Video-LMM post-training: A deep dive into video reasoning with large multimodal models,” arXiv preprint arXiv:2510.05034, 2025

arXiv 2025

[31] [32]

A survey of reinforcement learning for large reasoning models,

K. Zhang, Y. Zuo, B. He, Y. Sun, R. Liu, C. Jiang, Y. Fan, K. Tian, G. Jia, P . Liet al., “A survey of reinforcement learning for large reasoning models,” arXiv preprint arXiv:2509.08827, 2025

Pith/arXiv arXiv 2025

[32] [33]

Memory in the age of ai agents,

Y. Hu, S. Liu, Y. Yue, G. Zhang, B. Liu, F. Zhu, J. Lin, H. Guo, S. Dou, Z. Xiet al., “Memory in the age of ai agents,” arXiv preprint arXiv:2512.13564, 2025

Pith/arXiv arXiv 2025

[33] [34]

Token reduction should go beyond efficiency in generative models–from vision, language to multimodality,

Z. Kong, Y. Li, F. Zeng, L. Xin, S. Messica, X. Lin, P . Zhao, M. Kellis, H. Tang, and M. Zitnik, “Token reduction should go beyond efficiency in generative models–from vision, language to multimodality,” arXiv preprint arXiv:2505.18227, 2025

arXiv 2025

[34] [35]

Perception, reason, think, and plan: A survey on large multimodal reasoning models,

Y. Li, Z. Liu, Z. Li, X. Zhang, Z. Xu, X. Chen, H. Shi, S. Jiang, X. Wang, J. Wanget al., “Perception, reason, think, and plan: A survey on large multimodal reasoning models,” arXiv preprint arXiv:2505.04921, 2025

arXiv 2025

[35] [36]

Lita: Language instructed temporal- localization assistant,

D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P . Molchanov, Z. Yu, and J. Kautz, “Lita: Language instructed temporal- localization assistant,” inEuropean Conference on Computer Vision (ECCV). Cham, Switzerland: Springer, 2024

2024

[36] [37]

Universal video temporal grounding with generative multi-modal large language models,

Z. Li, S. Di, Z. Zhai, W. Huang, Y. Wang, and W. Xie, “Universal video temporal grounding with generative multi-modal large language models,” inAdvances in Neural Information Processing Systems (NeurIPS). Red Hook, NY, USA: Curran Associates, Inc., 2025, affiliations: Shanghai Jiao Tong University; ByteDance Seed

2025

[37] [38]

Timelens: Rethinking video temporal grounding with multi- modal llms,

J. Zhang, T. Wang, Y. Ge, Y. Ge, X. Li, Y. Shan, and L. Wang, “Timelens: Rethinking video temporal grounding with multi- modal llms,” arXiv preprint arXiv:2512.14698, 2025, affiliations: Nanjing University; ARC Lab, Tencent PCG; Shanghai AI Lab

arXiv 2025

[38] [39]

Towards one-to-many temporal grounding,

Q. Xu, T. Yue, S. Chen, J. Meng, A. Wang, S. Ji, H. Fei, and X. Li, “Towards one-to-many temporal grounding,” inProceedings of the 43rd International Conference on Machine Learning (ICML). Brookline, MA, USA: PMLR, 2026

2026

[39] [40]

Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,

H. Yuan, X. Li, T. Zhang, Y. Sun, Z. Huang, S. Xu, S. Ji, Y. Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,” arXiv preprint arXiv:2501.04001, 2025. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 25

Pith/arXiv arXiv 2025

[40] [41]

Sama: Towards multi-turn referential grounded video chat with large language models,

Y. Sun, H. Zhang, H. Ding, T. Zhang, X. Ma, and Y.-G. Jiang, “Sama: Towards multi-turn referential grounded video chat with large language models,” inAdvances in Neural Information Pro- cessing Systems. Red Hook, NY, USA: Curran Associates, Inc., 2025

2025

[41] [42]

Streaming dense video captioning,

G. Zhou, X. Xiong, A. Bhattacharyya, and J. J. Corso, “Streaming dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 18 486–18 496

2024

[42] [43]

Do you remember? dense video captioning with cross-modal memory retrieval,

M. Kim, H. B. Kim, J. Moon, J. Choi, and S. T. Kim, “Do you remember? dense video captioning with cross-modal memory retrieval,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 13 894–13 904

2024

[43] [44]

Dibs: Enhancing dense video captioning with unlabeled videos via pseudo boundary en- richment and online refinement,

H. Wu, H. Liu, Y. Qiao, and X. Sun, “Dibs: Enhancing dense video captioning with unlabeled videos via pseudo boundary en- richment and online refinement,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 18 699–18 708

2024

[44] [45]

PLLaVA: Parameter-free LLaVA extension from images to videos for video dense captioning,

L. Xu, Y. Huang, S. Xie, W. Wei, T. Li, B. Pan, Y. Zhao, and J. Yuan, “PLLaVA: Parameter-free LLaVA extension from images to videos for video dense captioning,” arXiv preprint arXiv:2404.16994, 2024

Pith/arXiv arXiv 2024

[45] [46]

AuroraCap: Efficient, performant video detailed captioning and a new benchmark,

W. Chai, E. Song, Y. Du, C. Meng, V . Madhavan, O. Bar-Tal, J.- N. Hwang, S. Xie, and C. D. Manning, “AuroraCap: Efficient, performant video detailed captioning and a new benchmark,” arXiv preprint arXiv:2410.03051, 2024

arXiv 2024

[46] [47]

Tarsier2: Advancing large vision-language models from detailed video de- scription to comprehensive video understanding,

L. Yuan, J. Wang, H. Sun, Y. Zhang, and Y. Lin, “Tarsier2: Advancing large vision-language models from detailed video de- scription to comprehensive video understanding,” arXiv preprint arXiv:2501.07888, 2025

arXiv 2025

[47] [48]

Baichuan-omni technical report,

Y. Li, H. Sun, M. Lin, T. Li, G. Dong, T. Zhang, B. Ding, W. Song, Z. Cheng, Y. Huoet al., “Baichuan-omni technical report,” arXiv preprint arXiv:2410.08565, 2024

arXiv 2024

[48] [49]

Ming-omni: A unified mul- timodal model for perception and generation,

I. AI, B. Gong, C. Zou, C. Zheng, C. Zhou, C. Yan, C. Jin, C. Shen, D. Zheng, F. Wanget al., “Ming-omni: A unified mul- timodal model for perception and generation,” arXiv preprint arXiv:2506.09344, 2025

arXiv 2025

[49] [50]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Guo, Y. Zhou, Z. Ma, S. Zhang, and Y. Feng, “Llama- omni: Seamless speech interaction with large language models,” arXiv preprint arXiv:2409.06666, 2024

arXiv 2024

[50] [51]

Stream-omni: Simultaneous multimodal interactions with large language- vision-speech model,

S. Zhang, S. Guo, Q. Fang, Y. Zhou, and Y. Feng, “Stream-omni: Simultaneous multimodal interactions with large language- vision-speech model,” arXiv preprint arXiv:2506.13642, 2025

arXiv 2025

[51] [52]

Omnicaptioner: One captioner to rule them all,

Y. Lu, J. Yuan, Z. Li, S. Zhao, Q. Qin, X. Li, L. Zhuo, L. Wen, D. Liu, Y. Caoet al., “Omnicaptioner: One captioner to rule them all,” arXiv preprint arXiv:2504.07089, 2025

arXiv 2025

[52] [53]

Omnivinci: Enhancing ar- chitecture and data for omni-modal understanding llm,

H. Ye, C.-H. H. Yang, A. Goel, W. Huang, L. Zhu, Y. Su, S. Lin, A.-C. Cheng, Z. Wan, J. Tianet al., “Omnivinci: Enhancing ar- chitecture and data for omni-modal understanding llm,” arXiv preprint arXiv:2510.15870, 2025

arXiv 2025

[53] [54]

Q-Frame: Query- aware frame selection and multi-resolution adaptation for video- LLMs,

S. Zhang, J. Yang, J. Yin, Z. Luo, and J. Luan, “Q-Frame: Query- aware frame selection and multi-resolution adaptation for video- LLMs,” arXiv preprint arXiv:2506.22139, 2025

arXiv 2025

[54] [55]

DyCoke: Dynamic compression of tokens for fast video large language models,

K. Tao, C. Qin, H. You, Y. Sui, and H. Wang, “DyCoke: Dynamic compression of tokens for fast video large language models,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 18 992–19 001

2025

[55] [56]

Videonsa: Native sparse attention scales video un- derstanding,

E. Song, W. Chai, S. Yang, E. Armand, X. Shan, H. Xu, J. Xie, and Z. Tu, “Videonsa: Native sparse attention scales video un- derstanding,” arXiv preprint arXiv:2510.02295, 2025

arXiv 2025

[56] [57]

Vtimellm: Empower llm to grasp video moments,

B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, 2024

2024

[57] [58]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,

Y. Guo, J. Liu, M. Li, D. Cheng, X. Tang, D. Sui, Q. Liu, X. Chen, and K. Zhao, “Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,” inProceed- ings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2025

2025

[58] [59]

Distime: Distribution-based time representation for video large language models,

Y. Zeng, Z. Huang, Y. Zhong, C. Feng, J. Hu, L. Ma, and Y. Liu, “Distime: Distribution-based time representation for video large language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, 2025

2025

[59] [60]

Self-chained image- language model for video localization and question answering,

S. Yu, J. Cho, P . Yadav, and M. Bansal, “Self-chained image- language model for video localization and question answering,” inAdvances in Neural Information Processing Systems (NeurIPS). Red Hook, NY, USA: Curran Associates, Inc., 2023, affiliation: UNC Chapel Hill

2023

[60] [61]

Llava-mr: Large language-and-vision assistant for video moment retrieval,

W. Lu, J. Li, A. Yu, M.-C. Chang, S. Ji, and M. Xia, “Llava-mr: Large language-and-vision assistant for video moment retrieval,” arXiv preprint arXiv:2411.14505, 2024, affiliations: Peking Univer- sity; Tencent Youtu; University at Albany; Zhejiang University. (* indicates corresponding author in the paper.)

arXiv 2024

[61] [62]

Timesuite: Improving mllms for long video understanding via grounded tuning,

X. Zeng, K. Li, C. Wang, X. Li, T. Jiang, Z. Yan, S. Li, Y. Shi, Z. Yue, Y. Wang, Y. Wang, Y. Qiao, and L. Wang, “Timesuite: Improving mllms for long video understanding via grounded tuning,” inInternational Conference on Learning Representations (ICLR). Online: OpenReview.net, 2025

2025

[62] [63]

Scanning only once: An end-to-end framework for fast temporal grounding in long videos,

Y. Pan, X. He, B. Gong, Y. Lv, Y. Shen, Y. Peng, and D. Zhao, “Scanning only once: An end-to-end framework for fast temporal grounding in long videos,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, 2023, affiliations: Alibaba Group; Wangxuan Institute of Computer Technolog...

2023

[63] [64]

Trace: Temporal grounding video llm via causal event modeling,

Y. Guo, J. Liu, M. Li, Q. Liu, X. Chen, and X. Tang, “Trace: Temporal grounding video llm via causal event modeling,” in International Conference on Learning Representations (ICLR). On- line: OpenReview.net, 2025, affiliations: School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen; Tencent PCG; Shenzhen Institute of Artificial I...

2025

[64] [65]

Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding,

C. Guo, X. Mo, Y. Nie, X. Xu, C. Xu, F. Yu, and C. Long, “Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding,” arXiv preprint arXiv:2508.07683, 2025

arXiv 2025

[65] [66]

Grounded-videollm: Sharpening fine- grained temporal grounding in video large language models,

H. Wang, Z. Xu, Y. Cheng, S. Diao, Y. Zhou, Y. Cao, Q. Wang, W. Ge, and L. Huang, “Grounded-videollm: Sharpening fine- grained temporal grounding in video large language models,” arXiv preprint arXiv:2410.03290, 2024

arXiv 2024

[66] [67]

Momentor: Advancing video large language model with fine-grained temporal reasoning,

L. Qian, J. Li, Y. Wu, Y. Ye, H. Fei, T.-S. Chua, Y. Zhuang, and S. Tang, “Momentor: Advancing video large language model with fine-grained temporal reasoning,” inProceedings of the 41st International Conference on Machine Learning (ICML). Brookline, MA, USA: PMLR, 2024

2024

[67] [68]

Videoperceiver: Enhancing fine-grained temporal perception in video multimodal large language models,

F. Zhao, L. Zhang, D. Shi, Y. Gao, C. Ye, Y. Cai, J. Gao, and D. Yan, “Videoperceiver: Enhancing fine-grained temporal perception in video multimodal large language models,” arXiv preprint arXiv:2511.18823, 2025

arXiv 2025

[68] [69]

Time-r1: Post-training large vision language model for temporal video grounding,

Y. Wang, Z. Wang, B. Xu, Y. Du, K. Lin, Z. Xiao, Z. Yue, J. Ju, L. Zhang, D. Yanget al., “Time-r1: Post-training large vision language model for temporal video grounding,” arXiv preprint arXiv:2503.13377, 2025

Pith/arXiv arXiv 2025

[69] [70]

Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation,

J. Li, H. Yin, H. Xu, B. Xu, W. Tan, Z. He, J. Ju, Z. Luo, and J. Luan, “Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation,” arXiv preprint arXiv:2602.02994, 2026

Pith/arXiv arXiv 2026

[70] [71]

Videozoomer: Reinforcement-learned temporal focusing for long video reason- ing,

Y. Ding, Y. Zhang, X. Lai, R. Chu, and Y. Yang, “Videozoomer: Reinforcement-learned temporal focusing for long video reason- ing,” arXiv preprint arXiv:2512.22315, 2025

arXiv 2025

[71] [72]

Datasets and recipes for video temporal grounding via reinforcement learning,

R. Chen, T. Luo, Z. Fan, H. Zou, Z. Feng, G. Xie, H. Zhang, Z. Wang, Z. Liu, and Z. Huaijian, “Datasets and recipes for video temporal grounding via reinforcement learning,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. Stroudsburg, PA, USA: Association for Computational Linguistics, 2025, pp. 983–992

2025

[72] [73]

Museg: Reinforc- ing video temporal understanding via timestamp-aware multi- segment grounding,

F. Luo, S. Lou, C. Chen, Z. Wang, C. Li, W. Shen, J. Guo, P . Li, M. Yan, J. Zhang, F. Huang, and Y. Liu, “Museg: Reinforc- ing video temporal understanding via timestamp-aware multi- segment grounding,” arXiv preprint arXiv:2505.20715, 2025

Pith/arXiv arXiv 2025

[73] [74]

Detect anything via next point prediction,

Q. Jiang, J. Huo, X. Chen, Y. Xiong, Z. Zeng, Y. Chen, T. Ren, J. Yu, and L. Zhang, “Detect anything via next point prediction,” arXiv preprint arXiv:2510.12798, 2025

arXiv 2025

[74] [75]

Thinking with bounding boxes: Enhancing spatio-temporal video grounding via reinforcement fine-tuning,

X. Gu, H. Zhang, Q. Fan, J. Niu, Z. Zhang, L. Zhang, G. Chen, F. Chen, L. Wen, and S. Zhu, “Thinking with bounding boxes: Enhancing spatio-temporal video grounding via reinforcement fine-tuning,” arXiv preprint arXiv:2511.21375, 2025

arXiv 2025

[75] [76]

Universal instance perception as object discovery and retrieval,

B. Yan, Y. Jiang, J. Wu, D. Wang, Z. Yuan, P . Luo, and H. Lu, IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 26 “Universal instance perception as object discovery and retrieval,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2023

2023

[76] [77]

Multimodal referring segmentation: A survey,

H. Ding, S. Tang, S. He, C. Liu, Z. Wu, and Y.-G. Jiang, “Multimodal referring segmentation: A survey,” arXiv preprint arXiv:2508.00265, 2025

arXiv 2025

[77] [78]

Deformable detr: Deformable transformers for end-to-end object detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020

Pith/arXiv arXiv 2010

[78] [79]

Lisa: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2024

2024

[79] [80]

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding,

T. Zhang, X. Li, H. Fei, H. Yuan, S. Wu, S. Ji, C. L. Chen, and S. Yan, “Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding,” inNeurIPS. Red Hook, NY, USA: Curran Associates, Inc., 2024

2024

[80] [81]

Generalizable entity grounding via assistance of large language model,

L. Qi, Y.-W. Chen, L. Yang, T. Shen, X. Li, W. Guo, Y. Xu, and M.- H. Yang, “Generalizable entity grounding via assistance of large language model,” arXiv preprint arXiv:2402.02555, 2024

Pith/arXiv arXiv 2024