pith. sign in

arxiv: 2606.07433 · v1 · pith:W7NYHLPEnew · submitted 2026-06-05 · 💻 cs.CV · cs.AI· cs.MM

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Pith reviewed 2026-06-27 21:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords video understandingmultimodal large language modelswatching remembering reasoninglong video processingmemory modelingegocentric videostreaming understandingfaithful reasoning
0
0 comments X

The pith

Video MLLMs acquire evidence, preserve context, and produce outputs through watching, remembering, and reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a unified human-view framework for LLM-based video understanding built around three functional abilities: watching for evidence acquisition, remembering for context preservation, and reasoning for grounded outputs. Instead of isolated task benchmarks, this structure analyzes how models handle sparse evidence, long-range dependencies, and multimodal alignment under computational limits. It formulates video systems through perceptual representations, memory states, reasoning traces, and final predictions, then maps representative methods, challenges, application domains, datasets, and benchmarks onto these elements. The work reviews current approaches across perception, memory, and reasoning while highlighting open problems for scalable video intelligence.

Core claim

Video understanding with MLLMs is best characterized by a formulation that decomposes systems into perceptual representations, memory states, reasoning traces, and final predictions, which in turn map onto the three abilities of watching, remembering, and reasoning; this decomposition supplies a single lens for organizing methods, identifying challenges in spatio-temporal perception and memory modeling, and covering domains from egocentric to narrative videos.

What carries the argument

The three functional abilities—watching, remembering, and reasoning—that partition video MLLM behavior and link the four system components (perceptual representations, memory states, reasoning traces, final predictions) into a unified analysis structure.

Load-bearing premise

Every video understanding system can be usefully described by four fixed components and partitioned without overlap or remainder into the three abilities of watching, remembering, and reasoning.

What would settle it

A deployed video MLLM whose accuracy, efficiency, or failure modes on long videos cannot be improved or explained by separately measuring or modifying its watching, remembering, or reasoning components.

Figures

Figures reproduced from arXiv: 2606.07433 by Guangliang Cheng, Haochen Wang, Jason Li, Jiahao Meng, Jiangning Zhang, Kuan Gao, Lingdong Kong, Lu Qi, Minghsuan Yang, Qianyu Zhou, Qi Xu, Weisong Liu, Yanwei Li, Yue Tan, Yunhai Tong.

Figure 1
Figure 1. Figure 1: Overview of our survey. Left: the survey pipeline. Right: our Watch–Remember–Reason taxonomy for MLLM-based video understanding. Watch (Sec. 3.1) covers fine-grained grounding, captioning, audio-visual perception, and efficient processing. Remember (Sec. 3.2) includes offline and streaming memory. Reason (Sec. 3.3) covers text-only reasoning and thinking with videos, with both agentic and non-agent approac… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of methods related to ”How to Watch?”. Fine-grained watching localizes task-relevant evidence in time and space. Comprehensive watching abstracts videos into summaries, and segment-level or region-level descriptions. Audio-visual watching aligns visual and acoustic streams for omni-modal perception. Efficient watching reduces redundancy through frame selection, token compression, and efficient mod… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of methods related to ”How to Remember?”. Agentic offline memory constructs and updates external memory through LLM/VLM agents. Non-agentic offline memory builds structured short-term and long-term memory via event extraction, frame selection, token compression, and event clustering. Streaming memory maintains and retrieves memory online through sliding windows, recent memory, and long-term memory… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of methods related to ”How to Reason?”. Agentic text-only reasoning methods decompose reasoning into modular steps such as clip summarization, adaptive search, memory retrieval, reflection, and answer verification. Non￾agent text-only reasoning methods perform a single MLLM forward pass and produces textual chain-of-thought with the final answer. Agentic thinking with videos methods actively inter… view at source ↗
read the original abstract

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. This survey paper proposes a human-view organizational framework for video understanding with multimodal large language models (MLLMs), structured around three functional abilities (watching, remembering, reasoning) and four system components (perceptual representations, memory states, reasoning traces, final predictions). It uses this framing to categorize methods for fine-grained perception, memory modeling (offline/streaming), and reasoning (text-only or video-grounded), while surveying challenges in spatio-temporal perception and long-video processing, application domains (egocentric, sports, medical, etc.), datasets, benchmarks, and open problems.

Significance. If the proposed partition proves useful for synthesis, the work could help researchers map existing methods onto a common structure for identifying gaps in memory-aware and evidence-grounded video intelligence, particularly as the field shifts toward long, multimodal scenarios. Its value is primarily in organization and coverage rather than new derivations or measurements.

minor comments (3)
  1. The formulation characterizing systems by perceptual representations, memory states, reasoning traces, and final predictions is introduced in the abstract and presumably detailed early in the manuscript; if this is presented only descriptively without a diagram or explicit mapping table to the three abilities, it risks remaining informal for readers attempting to apply the framework to new papers.
  2. The abstract states that representative methods are 'organized by their roles in video MLLM systems' under the three abilities, but without an explicit cross-reference table or section that lists which cited works map to which component, the organizational claim is harder to verify.
  3. The GitHub link for continuously traced related works is mentioned but not cited as a reference; adding a formal citation or footnote would improve traceability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our survey and the recommendation of minor revision. No specific major comments were provided in the report, so we have no individual points requiring point-by-point rebuttal. We will incorporate minor improvements for clarity and completeness in the revised version.

Circularity Check

0 steps flagged

No significant circularity; organizational survey framing is self-contained

full rationale

The paper is a literature survey that proposes an organizational perspective on video MLLMs structured around three functional abilities (watching, remembering, reasoning) and four system components (perceptual representations, memory states, reasoning traces, final predictions). The central claim is that this supplies a unified structure for analyzing existing methods, challenges, datasets, and benchmarks rather than introducing new empirical results or a falsifiable model. The partition is presented as a useful framing for the survey; no stronger claim of exhaustiveness, disjointness, or predictive power is required for the work to fulfill its stated purpose. No equations, fitted parameters, predictions, or self-citation chains appear in the derivation chain. All cited works are external and the framework does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a survey and introduces no new free parameters, mathematical axioms, or invented physical entities; it synthesizes prior literature under an organizational framing.

pith-pipeline@v0.9.1-grok · 5856 in / 1148 out tokens · 15229 ms · 2026-06-27T21:59:04.020071+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

298 extracted references · 39 linked inside Pith

  1. [1]

    Qwen3.5: Towards native multimodal agents,

    Qwen Team, “Qwen3.5: Towards native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id= qwen3.5

  2. [2]

    Qwen3- omni technical report,

    J. Xu, Z. Guo, H. Hu, Y. Chu, X. Wang, J. He, Y. Wang, X. Shi, T. He, X. Zhu, Y. Lv, Y. Wang, D. Guo, H. Wang, L. Ma, P . Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P . Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3- omni technical report,” arXiv...

  3. [3]

    Qwen3-vl technical report,

    S. Bai, Y. Cai, R. Chen, K. Chen, X. Chenet al., “Qwen3-vl technical report,” Nov. 2025. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 24

  4. [4]

    Qwen2. 5-omni technical report,

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Danget al., “Qwen2. 5-omni technical report,” arXiv preprint arXiv:2503.20215, 2025

  5. [5]

    Video-xl-2: Towards very long-video under- standing through task-aware kv sparsification,

    M. Qin, X. Liu, Z. Liang, Y. Shu, H. Yuan, J. Zhou, S. Xiao, B. Zhao, and Z. Liu, “Video-xl-2: Towards very long-video under- standing through task-aware kv sparsification,” arXiv preprint arXiv:2506.19225, 2025

  6. [6]

    Msr-vtt: A large video descrip- tion dataset for bridging video and language,

    J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video descrip- tion dataset for bridging video and language,” inProceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2016, pp. 5288–5296

  7. [7]

    Video question answering via gradually refined attention over appearance and motion,

    D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Video question answering via gradually refined attention over appearance and motion,” inProceedings of the 25th ACM interna- tional conference on Multimedia. New York, NY, USA: Association for Computing Machinery, 2017, pp. 1645–1653

  8. [8]

    Tgif-qa: Toward spatio-temporal reasoning in visual question answering,

    Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2017, pp. 2758–2766

  9. [9]

    LongVU: Spatiotemporal adaptive compression for long video- language understanding,

    X. Shen, Y. Xiong, C. Zhao, L. Wu, J. Chen, C. Zhu, Z. Liu, F. Xiao, B. Varadarajan, F. Bordes, Z. Liu, H. Xu, H. J. Kim, B. Soran, R. Krishnamoorthi, M. Elhoseiny, and V . Chandra, “LongVU: Spatiotemporal adaptive compression for long video- language understanding,” inForty-second International Conference on Machine Learning, 2025

  10. [10]

    Longvideo-r1: Smart navigation for low-cost long video understanding,

    J. Qiu, L. Xie, X. Huo, Q. Tian, and Q. Ye, “Longvideo-r1: Smart navigation for low-cost long video understanding,” arXiv preprint arXiv:2602.20913, 2026

  11. [11]

    Video-o3: Native inter- leaved clue seeking for long video multi-hop reasoning,

    X. Zeng, Z. Zhang, Y. Zhu, X. Li, Z. Wang, C. Ma, Q. Zhang, Z. Huang, K. Ouyang, T. Jianget al., “Video-o3: Native inter- leaved clue seeking for long video multi-hop reasoning,” arXiv preprint arXiv:2601.23224, 2026

  12. [12]

    Ego4d: Around the world in 3,000 hours of egocentric video,

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liuet al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2022

  13. [14]

    A visually grounded language model for fetal ultrasound understanding,

    X. Guo, M. Alsharid, H. Zhao, Y. Wang, J. Lander, A. T. Pa- pageorghiou, and J. A. Noble, “A visually grounded language model for fetal ultrasound understanding,” Nature Biomedical Engineering, advance online publication, 2026

  14. [15]

    Streamingvlm: Real-time understanding for infinite video streams,

    R. Xu, G. Xiao, Y. Chen, L. He, K. Peng, Y. Lu, and S. Han, “Streamingvlm: Real-time understanding for infinite video streams,” 2025. [Online]. Available: https://arxiv.org/abs/2510. 09608

  15. [16]

    Timechat: A time- sensitive multimodal large language model for long video un- derstanding,

    S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time- sensitive multimodal large language model for long video un- derstanding,” arXiv preprint arXiv:2312.02051, 2023

  16. [17]

    Adaptive keyframe sampling for long video understanding,

    X. Tang, J. Qiu, L. Xie, Y. Tian, J. Jiao, and Q. Ye, “Adaptive keyframe sampling for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 29 118–29 128

  17. [18]

    FrameFusion: Combining similarity and importance for video token reduction on large vision language models,

    T. Fu, T. Liu, Q. Han, G. Dai, S. Yan, H. Yang, X. Ning, and Y. Wang, “FrameFusion: Combining similarity and importance for video token reduction on large vision language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 22 654–22 663

  18. [19]

    Moviechat: From dense token to sparse memory for long video understanding,

    E. Song, W. Chai, G. Wang, Y. Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y. Zhanget al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 18 221–18 232

  19. [20]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

    B. He, H. Li, Y. K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 504–13 514

  20. [21]

    Memory-enhanced retrieval augmentation for long video understanding,

    H. Yuan, Z. Liu, M. Qin, H. Qian, Y. Shu, Z. Dou, J.-R. Wen, and N. Sebe, “Memory-enhanced retrieval augmentation for long video understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09149

  21. [22]

    Flash-vstream: Memory-based real-time understanding for long video streams,

    H. Zhang, Y. Wang, Y. Tang, Y. Liu, J. Feng, J. Dai, and X. Jin, “Flash-vstream: Memory-based real-time understanding for long video streams,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.08085

  22. [23]

    Streammem: Query-agnostic kv cache memory for streaming video understanding,

    Y. Yang, Z. Zhao, S. N. Shukla, A. Singh, S. K. Mishra, L. Zhang, and M. Ren, “Streammem: Query-agnostic kv cache memory for streaming video understanding,” 2025. [Online]. Available: https://arxiv.org/abs/2508.15717

  23. [24]

    Video-r1: Reinforcing video reasoning in mllms,

    K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025

  24. [25]

    Videorft: In- centivizing video reasoning capability in mllms via reinforced fine-tuning,

    Q. Wang, Y. Yu, Y. Yuan, R. Mao, and T. Zhou, “Videorft: In- centivizing video reasoning capability in mllms via reinforced fine-tuning,” arXiv preprint arXiv:2505.12434, 2025

  25. [26]

    Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence,

    J. Meng, X. Li, H. Wang, Y. Tan, T. Zhang, L. Kong, Y. Tong, A. Wang, Z. Teng, Y. Wanget al., “Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence,” arXiv preprint arXiv:2510.20579, 2025

  26. [27]

    Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

    H. Zhang, X. Gu, J. Li, C. Ma, S. Bai, C. Zhang, B. Zhang, Z. Zhou, D. He, and Y. Tang, “Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,” arXiv preprint arXiv:2508.04416, 2025

  27. [28]

    Video-language understanding: A survey from model architecture, model training, and data per- spectives,

    T. Nguyen, Y. Bin, J. Xiao, L. Qu, Y. Li, J. Z. Wu, C.-D. Nguyen, S. K. Ng, and L. A. Tuan, “Video-language understanding: A survey from model architecture, model training, and data per- spectives,” inFindings of the Association for Computational Lin- guistics: ACL 2024. Stroudsburg, PA, USA: Association for Computational Linguistics, 2024, pp. 3636–3657

  28. [29]

    Video understanding with large language models: A survey,

    Y. Tang, J. Bi, S. Xu, L. Song, S. Liang, T. Wang, D. Zhang, J. An, J. Lin, R. Zhu, A. Vosoughi, C. Huang, Z. Zhang, P . Liu, M. Feng, F. Zheng, J.-L. Gaudiot, P . Luo, J. Luo, and C. Xu, “Video understanding with large language models: A survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 36, no. 2, pp. 1355–1376, Feb. 2026

  29. [30]

    A survey on video temporal grounding with multimodal large lan- guage model,

    J. Wu, W. Liu, Y. Liu, M. Liu, L. Nie, Z. Lin, and C. W. Chen, “A survey on video temporal grounding with multimodal large lan- guage model,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1521–1541, 2026

  30. [31]

    Video-LMM post-training: A deep dive into video reasoning with large multimodal models,

    Y. Tang, J. Bi, P . Liu, Z. Pan, Z. Tan, Q. Shen, J. Liu, H. Hua, J. Guo, Y. Xiaoet al., “Video-LMM post-training: A deep dive into video reasoning with large multimodal models,” arXiv preprint arXiv:2510.05034, 2025

  31. [32]

    A survey of reinforcement learning for large reasoning models,

    K. Zhang, Y. Zuo, B. He, Y. Sun, R. Liu, C. Jiang, Y. Fan, K. Tian, G. Jia, P . Liet al., “A survey of reinforcement learning for large reasoning models,” arXiv preprint arXiv:2509.08827, 2025

  32. [33]

    Memory in the age of ai agents,

    Y. Hu, S. Liu, Y. Yue, G. Zhang, B. Liu, F. Zhu, J. Lin, H. Guo, S. Dou, Z. Xiet al., “Memory in the age of ai agents,” arXiv preprint arXiv:2512.13564, 2025

  33. [34]

    Token reduction should go beyond efficiency in generative models–from vision, language to multimodality,

    Z. Kong, Y. Li, F. Zeng, L. Xin, S. Messica, X. Lin, P . Zhao, M. Kellis, H. Tang, and M. Zitnik, “Token reduction should go beyond efficiency in generative models–from vision, language to multimodality,” arXiv preprint arXiv:2505.18227, 2025

  34. [35]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models,

    Y. Li, Z. Liu, Z. Li, X. Zhang, Z. Xu, X. Chen, H. Shi, S. Jiang, X. Wang, J. Wanget al., “Perception, reason, think, and plan: A survey on large multimodal reasoning models,” arXiv preprint arXiv:2505.04921, 2025

  35. [36]

    Lita: Language instructed temporal- localization assistant,

    D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P . Molchanov, Z. Yu, and J. Kautz, “Lita: Language instructed temporal- localization assistant,” inEuropean Conference on Computer Vision (ECCV). Cham, Switzerland: Springer, 2024

  36. [37]

    Universal video temporal grounding with generative multi-modal large language models,

    Z. Li, S. Di, Z. Zhai, W. Huang, Y. Wang, and W. Xie, “Universal video temporal grounding with generative multi-modal large language models,” inAdvances in Neural Information Processing Systems (NeurIPS). Red Hook, NY, USA: Curran Associates, Inc., 2025, affiliations: Shanghai Jiao Tong University; ByteDance Seed

  37. [38]

    Timelens: Rethinking video temporal grounding with multi- modal llms,

    J. Zhang, T. Wang, Y. Ge, Y. Ge, X. Li, Y. Shan, and L. Wang, “Timelens: Rethinking video temporal grounding with multi- modal llms,” arXiv preprint arXiv:2512.14698, 2025, affiliations: Nanjing University; ARC Lab, Tencent PCG; Shanghai AI Lab

  38. [39]

    Towards one-to-many temporal grounding,

    Q. Xu, T. Yue, S. Chen, J. Meng, A. Wang, S. Ji, H. Fei, and X. Li, “Towards one-to-many temporal grounding,” inProceedings of the 43rd International Conference on Machine Learning (ICML). Brookline, MA, USA: PMLR, 2026

  39. [40]

    Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,

    H. Yuan, X. Li, T. Zhang, Y. Sun, Z. Huang, S. Xu, S. Ji, Y. Tong, L. Qi, J. Fenget al., “Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos,” arXiv preprint arXiv:2501.04001, 2025. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 25

  40. [41]

    Sama: Towards multi-turn referential grounded video chat with large language models,

    Y. Sun, H. Zhang, H. Ding, T. Zhang, X. Ma, and Y.-G. Jiang, “Sama: Towards multi-turn referential grounded video chat with large language models,” inAdvances in Neural Information Pro- cessing Systems. Red Hook, NY, USA: Curran Associates, Inc., 2025

  41. [42]

    Streaming dense video captioning,

    G. Zhou, X. Xiong, A. Bhattacharyya, and J. J. Corso, “Streaming dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 18 486–18 496

  42. [43]

    Do you remember? dense video captioning with cross-modal memory retrieval,

    M. Kim, H. B. Kim, J. Moon, J. Choi, and S. T. Kim, “Do you remember? dense video captioning with cross-modal memory retrieval,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 13 894–13 904

  43. [44]

    Dibs: Enhancing dense video captioning with unlabeled videos via pseudo boundary en- richment and online refinement,

    H. Wu, H. Liu, Y. Qiao, and X. Sun, “Dibs: Enhancing dense video captioning with unlabeled videos via pseudo boundary en- richment and online refinement,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition. Los Alamitos, CA, USA: IEEE Computer Society, 2024, pp. 18 699–18 708

  44. [45]

    PLLaVA: Parameter-free LLaVA extension from images to videos for video dense captioning,

    L. Xu, Y. Huang, S. Xie, W. Wei, T. Li, B. Pan, Y. Zhao, and J. Yuan, “PLLaVA: Parameter-free LLaVA extension from images to videos for video dense captioning,” arXiv preprint arXiv:2404.16994, 2024

  45. [46]

    AuroraCap: Efficient, performant video detailed captioning and a new benchmark,

    W. Chai, E. Song, Y. Du, C. Meng, V . Madhavan, O. Bar-Tal, J.- N. Hwang, S. Xie, and C. D. Manning, “AuroraCap: Efficient, performant video detailed captioning and a new benchmark,” arXiv preprint arXiv:2410.03051, 2024

  46. [47]

    Tarsier2: Advancing large vision-language models from detailed video de- scription to comprehensive video understanding,

    L. Yuan, J. Wang, H. Sun, Y. Zhang, and Y. Lin, “Tarsier2: Advancing large vision-language models from detailed video de- scription to comprehensive video understanding,” arXiv preprint arXiv:2501.07888, 2025

  47. [48]

    Baichuan-omni technical report,

    Y. Li, H. Sun, M. Lin, T. Li, G. Dong, T. Zhang, B. Ding, W. Song, Z. Cheng, Y. Huoet al., “Baichuan-omni technical report,” arXiv preprint arXiv:2410.08565, 2024

  48. [49]

    Ming-omni: A unified mul- timodal model for perception and generation,

    I. AI, B. Gong, C. Zou, C. Zheng, C. Zhou, C. Yan, C. Jin, C. Shen, D. Zheng, F. Wanget al., “Ming-omni: A unified mul- timodal model for perception and generation,” arXiv preprint arXiv:2506.09344, 2025

  49. [50]

    Llama- omni: Seamless speech interaction with large language models,

    Q. Fang, S. Guo, Y. Zhou, Z. Ma, S. Zhang, and Y. Feng, “Llama- omni: Seamless speech interaction with large language models,” arXiv preprint arXiv:2409.06666, 2024

  50. [51]

    Stream-omni: Simultaneous multimodal interactions with large language- vision-speech model,

    S. Zhang, S. Guo, Q. Fang, Y. Zhou, and Y. Feng, “Stream-omni: Simultaneous multimodal interactions with large language- vision-speech model,” arXiv preprint arXiv:2506.13642, 2025

  51. [52]

    Omnicaptioner: One captioner to rule them all,

    Y. Lu, J. Yuan, Z. Li, S. Zhao, Q. Qin, X. Li, L. Zhuo, L. Wen, D. Liu, Y. Caoet al., “Omnicaptioner: One captioner to rule them all,” arXiv preprint arXiv:2504.07089, 2025

  52. [53]

    Omnivinci: Enhancing ar- chitecture and data for omni-modal understanding llm,

    H. Ye, C.-H. H. Yang, A. Goel, W. Huang, L. Zhu, Y. Su, S. Lin, A.-C. Cheng, Z. Wan, J. Tianet al., “Omnivinci: Enhancing ar- chitecture and data for omni-modal understanding llm,” arXiv preprint arXiv:2510.15870, 2025

  53. [54]

    Q-Frame: Query- aware frame selection and multi-resolution adaptation for video- LLMs,

    S. Zhang, J. Yang, J. Yin, Z. Luo, and J. Luan, “Q-Frame: Query- aware frame selection and multi-resolution adaptation for video- LLMs,” arXiv preprint arXiv:2506.22139, 2025

  54. [55]

    DyCoke: Dynamic compression of tokens for fast video large language models,

    K. Tao, C. Qin, H. You, Y. Sui, and H. Wang, “DyCoke: Dynamic compression of tokens for fast video large language models,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence. Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 18 992–19 001

  55. [56]

    Videonsa: Native sparse attention scales video un- derstanding,

    E. Song, W. Chai, S. Yang, E. Armand, X. Shan, H. Xu, J. Xie, and Z. Tu, “Videonsa: Native sparse attention scales video un- derstanding,” arXiv preprint arXiv:2510.02295, 2025

  56. [57]

    Vtimellm: Empower llm to grasp video moments,

    B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, 2024

  57. [58]

    Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,

    Y. Guo, J. Liu, M. Li, D. Cheng, X. Tang, D. Sui, Q. Liu, X. Chen, and K. Zhao, “Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,” inProceed- ings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA, USA: AAAI Press, 2025

  58. [59]

    Distime: Distribution-based time representation for video large language models,

    Y. Zeng, Z. Huang, Y. Zhong, C. Feng, J. Hu, L. Ma, and Y. Liu, “Distime: Distribution-based time representation for video large language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, 2025

  59. [60]

    Self-chained image- language model for video localization and question answering,

    S. Yu, J. Cho, P . Yadav, and M. Bansal, “Self-chained image- language model for video localization and question answering,” inAdvances in Neural Information Processing Systems (NeurIPS). Red Hook, NY, USA: Curran Associates, Inc., 2023, affiliation: UNC Chapel Hill

  60. [61]

    Llava-mr: Large language-and-vision assistant for video moment retrieval,

    W. Lu, J. Li, A. Yu, M.-C. Chang, S. Ji, and M. Xia, “Llava-mr: Large language-and-vision assistant for video moment retrieval,” arXiv preprint arXiv:2411.14505, 2024, affiliations: Peking Univer- sity; Tencent Youtu; University at Albany; Zhejiang University. (* indicates corresponding author in the paper.)

  61. [62]

    Timesuite: Improving mllms for long video understanding via grounded tuning,

    X. Zeng, K. Li, C. Wang, X. Li, T. Jiang, Z. Yan, S. Li, Y. Shi, Z. Yue, Y. Wang, Y. Wang, Y. Qiao, and L. Wang, “Timesuite: Improving mllms for long video understanding via grounded tuning,” inInternational Conference on Learning Representations (ICLR). Online: OpenReview.net, 2025

  62. [63]

    Scanning only once: An end-to-end framework for fast temporal grounding in long videos,

    Y. Pan, X. He, B. Gong, Y. Lv, Y. Shen, Y. Peng, and D. Zhao, “Scanning only once: An end-to-end framework for fast temporal grounding in long videos,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, 2023, affiliations: Alibaba Group; Wangxuan Institute of Computer Technolog...

  63. [64]

    Trace: Temporal grounding video llm via causal event modeling,

    Y. Guo, J. Liu, M. Li, Q. Liu, X. Chen, and X. Tang, “Trace: Temporal grounding video llm via causal event modeling,” in International Conference on Learning Representations (ICLR). On- line: OpenReview.net, 2025, affiliations: School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen; Tencent PCG; Shenzhen Institute of Artificial I...

  64. [65]

    Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding,

    C. Guo, X. Mo, Y. Nie, X. Xu, C. Xu, F. Yu, and C. Long, “Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding,” arXiv preprint arXiv:2508.07683, 2025

  65. [66]

    Grounded-videollm: Sharpening fine- grained temporal grounding in video large language models,

    H. Wang, Z. Xu, Y. Cheng, S. Diao, Y. Zhou, Y. Cao, Q. Wang, W. Ge, and L. Huang, “Grounded-videollm: Sharpening fine- grained temporal grounding in video large language models,” arXiv preprint arXiv:2410.03290, 2024

  66. [67]

    Momentor: Advancing video large language model with fine-grained temporal reasoning,

    L. Qian, J. Li, Y. Wu, Y. Ye, H. Fei, T.-S. Chua, Y. Zhuang, and S. Tang, “Momentor: Advancing video large language model with fine-grained temporal reasoning,” inProceedings of the 41st International Conference on Machine Learning (ICML). Brookline, MA, USA: PMLR, 2024

  67. [68]

    Videoperceiver: Enhancing fine-grained temporal perception in video multimodal large language models,

    F. Zhao, L. Zhang, D. Shi, Y. Gao, C. Ye, Y. Cai, J. Gao, and D. Yan, “Videoperceiver: Enhancing fine-grained temporal perception in video multimodal large language models,” arXiv preprint arXiv:2511.18823, 2025

  68. [69]

    Time-r1: Post-training large vision language model for temporal video grounding,

    Y. Wang, Z. Wang, B. Xu, Y. Du, K. Lin, Z. Xiao, Z. Yue, J. Ju, L. Zhang, D. Yanget al., “Time-r1: Post-training large vision language model for temporal video grounding,” arXiv preprint arXiv:2503.13377, 2025

  69. [70]

    Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation,

    J. Li, H. Yin, H. Xu, B. Xu, W. Tan, Z. He, J. Ju, Z. Luo, and J. Luan, “Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation,” arXiv preprint arXiv:2602.02994, 2026

  70. [71]

    Videozoomer: Reinforcement-learned temporal focusing for long video reason- ing,

    Y. Ding, Y. Zhang, X. Lai, R. Chu, and Y. Yang, “Videozoomer: Reinforcement-learned temporal focusing for long video reason- ing,” arXiv preprint arXiv:2512.22315, 2025

  71. [72]

    Datasets and recipes for video temporal grounding via reinforcement learning,

    R. Chen, T. Luo, Z. Fan, H. Zou, Z. Feng, G. Xie, H. Zhang, Z. Wang, Z. Liu, and Z. Huaijian, “Datasets and recipes for video temporal grounding via reinforcement learning,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. Stroudsburg, PA, USA: Association for Computational Linguistics, 2025, pp. 983–992

  72. [73]

    Museg: Reinforc- ing video temporal understanding via timestamp-aware multi- segment grounding,

    F. Luo, S. Lou, C. Chen, Z. Wang, C. Li, W. Shen, J. Guo, P . Li, M. Yan, J. Zhang, F. Huang, and Y. Liu, “Museg: Reinforc- ing video temporal understanding via timestamp-aware multi- segment grounding,” arXiv preprint arXiv:2505.20715, 2025

  73. [74]

    Detect anything via next point prediction,

    Q. Jiang, J. Huo, X. Chen, Y. Xiong, Z. Zeng, Y. Chen, T. Ren, J. Yu, and L. Zhang, “Detect anything via next point prediction,” arXiv preprint arXiv:2510.12798, 2025

  74. [75]

    Thinking with bounding boxes: Enhancing spatio-temporal video grounding via reinforcement fine-tuning,

    X. Gu, H. Zhang, Q. Fan, J. Niu, Z. Zhang, L. Zhang, G. Chen, F. Chen, L. Wen, and S. Zhu, “Thinking with bounding boxes: Enhancing spatio-temporal video grounding via reinforcement fine-tuning,” arXiv preprint arXiv:2511.21375, 2025

  75. [76]

    Universal instance perception as object discovery and retrieval,

    B. Yan, Y. Jiang, J. Wu, D. Wang, Z. Yuan, P . Luo, and H. Lu, IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 26 “Universal instance perception as object discovery and retrieval,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2023

  76. [77]

    Multimodal referring segmentation: A survey,

    H. Ding, S. Tang, S. He, C. Liu, Z. Wu, and Y.-G. Jiang, “Multimodal referring segmentation: A survey,” arXiv preprint arXiv:2508.00265, 2025

  77. [78]

    Deformable detr: Deformable transformers for end-to-end object detection,

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020

  78. [79]

    Lisa: Reasoning segmentation via large language model,

    X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” inCVPR. Los Alamitos, CA, USA: IEEE Computer Society, 2024

  79. [80]

    Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding,

    T. Zhang, X. Li, H. Fei, H. Yuan, S. Wu, S. Ji, C. L. Chen, and S. Yan, “Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding,” inNeurIPS. Red Hook, NY, USA: Curran Associates, Inc., 2024

  80. [81]

    Generalizable entity grounding via assistance of large language model,

    L. Qi, Y.-W. Chen, L. Yang, T. Shen, X. Li, W. Guo, Y. Xu, and M.- H. Yang, “Generalizable entity grounding via assistance of large language model,” arXiv preprint arXiv:2402.02555, 2024

Showing first 80 references.