pith. machine review for the scientific record. sign in

arxiv: 2604.05079 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: no theorem link

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords video question answeringmulti-agent collaborationstoryline reasoninglong video understandingcross-modal alignmentnarrative constructionagent framework
0
0 comments X

The pith

SVAgent improves long video question answering by having agents collaboratively construct an evolving storyline instead of selecting isolated frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most current video question answering systems answer queries by locating relevant frames within a clip, but this often misses the coherent flow that humans use to interpret events over time. The paper introduces SVAgent, a framework in which specialized agents collaborate to build and follow a narrative representation of the video. A storyline agent assembles the story progressively while a refinement agent reviews past errors to propose better frames, and separate visual and textual agents generate predictions that a meta-agent then aligns for consistency. This design is intended to produce answers that are both more accurate and more interpretable for complex, extended video sequences. If the approach holds, it offers a path toward video understanding that treats time and context as central rather than optional.

Core claim

SVAgent is a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

What carries the argument

A cross-modal multi-agent collaboration system in which a storyline agent builds a progressive narrative, a refinement suggestion agent uses historical failures to propose frames, cross-modal decision agents generate independent visual and textual predictions, and a meta-agent aligns those predictions.

If this is right

  • More robust handling of temporal dynamics in long videos through explicit narrative tracking.
  • Increased interpretability because the evolving storyline provides an explicit reasoning trace.
  • Better answer consistency by reconciling independent visual and textual predictions via the meta-agent.
  • Improved performance on VideoQA tasks that require understanding of event sequences rather than isolated moments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent structure could be tested on other sequential multimodal tasks such as audio narration or live event description.
  • Failure-driven refinement might be applied more broadly to iterative reasoning loops in agent systems beyond video.
  • If the storyline representation can be made explicit and editable, it could support interactive video analysis tools for users.

Load-bearing premise

That the progressive storyline construction by the storyline agent, informed by the refinement suggestion agent's analysis of historical failures, combined with cross-modal alignment by the meta-agent, will produce more robust and contextually grounded predictions than frame-locating methods.

What would settle it

A direct comparison on a long-video QA benchmark where SVAgent is run against standard frame-locating baselines and shows no measurable gain in answer accuracy, consistency, or human-judged interpretability.

Figures

Figures reproduced from arXiv: 2604.05079 by Shuo Zhan, Tan Yue, Wei Pang, Yingfang Yuan, Zhongyu Yang, Zuhao Yang.

Figure 1
Figure 1. Figure 1: Comparison of existing video reasoning paradigms and our proposed framework. (a) Caption-based methods rely on text summaries, which often result in temporal collapse and insufficient visual grounding. (b) Keyframe retrieval methods focus on top-k evidence, often leading to fragmented reasoning. (c) Event/graph-based approaches depend on fixed structures, which limits adaptability. (d) We propose SVAgent, … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SVAgent, which consists of six interacting agents: (1) a Storyline Agent that summarizes sampled frames into a query-oriented narrative, (2) a Hypothesis Agent that proposes and evaluates hypotheses while selecting evidence frames via Determinantal Point Processes (DPPs), (3) two Cross-Modal Decision Agents that respectively conduct modality-specific reasoning to produce decisions, (4) a Meta-D… view at source ↗
Figure 3
Figure 3. Figure 3: Parameter and Latency Analysis on Video-MME. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case study showing how SVAgent maintains coherent temporal reasoning and consistent cross-modal verification. By combining [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. A storyline agent progressively builds a narrative representation from frames suggested by a refinement suggestion agent that analyzes historical failures. Cross-modal decision agents generate independent predictions from visual and textual streams under storyline guidance, and a meta-agent aligns these outputs to improve robustness and consistency. The central claim is that this architecture emulates human-like storyline reasoning and thereby delivers superior performance and interpretability relative to frame-locating baselines.

Significance. If the experimental results hold, the work could meaningfully shift VideoQA research toward agent-based, storyline-centric reasoning rather than isolated frame retrieval. The multi-agent decomposition offers a concrete mechanism for incorporating failure-driven refinement and cross-modal alignment, which may improve robustness on long, narratively complex videos and provide a more interpretable reasoning trace than end-to-end models.

major comments (1)
  1. [Abstract] Abstract: the claim that 'Experimental results demonstrate that SVAgent achieves superior performance and interpretability' is unsupported by any metrics, baselines, datasets, error analysis, or implementation details. This absence is load-bearing for the central assertion of superiority over frame-locating methods.
minor comments (2)
  1. The roles and interaction protocol among the storyline agent, refinement suggestion agent, cross-modal decision agents, and meta-agent are described at a high level only; a diagram or pseudocode of the message-passing schedule would improve clarity and reproducibility.
  2. The abstract introduces several new agent names without referencing prior multi-agent VideoQA literature; a brief positioning paragraph would help readers assess novelty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and outline the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'Experimental results demonstrate that SVAgent achieves superior performance and interpretability' is unsupported by any metrics, baselines, datasets, error analysis, or implementation details. This absence is load-bearing for the central assertion of superiority over frame-locating methods.

    Authors: We agree that the abstract would be strengthened by including more specific supporting details for the performance claim. While the full manuscript provides quantitative results, dataset descriptions, baseline comparisons (including frame-locating methods), error analyses, and implementation details in the Experiments section, the abstract currently summarizes these at a high level. In the revised version, we will update the abstract to briefly reference the key datasets, the magnitude of improvements over baselines, and the interpretability aspects to better ground the claim of superiority without exceeding length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper proposes SVAgent as a multi-agent architectural framework for VideoQA, with agents for progressive storyline construction, failure analysis, cross-modal prediction, and meta-alignment. No equations, fitted parameters, self-referential definitions, or mathematical derivations are present in the provided text. Claims of superior performance rest on experimental demonstration rather than any reduction to prior inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are visible. The derivation chain is self-contained as a descriptive system design without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 4 invented entities

The central claim rests on the effectiveness of newly introduced agent components for narrative construction and cross-modal alignment, with no mathematical derivations, data fits, or external benchmarks provided in the abstract.

invented entities (4)
  • Storyline agent no independent evidence
    purpose: Progressively constructs a narrative representation based on frames
    Core new component of the proposed framework.
  • Refinement suggestion agent no independent evidence
    purpose: Analyzes historical failures to suggest frames
    New component introduced to improve storyline construction iteratively.
  • Cross-modal decision agents no independent evidence
    purpose: Independently predict answers from visual and textual modalities under storyline guidance
    New agents for separate modality processing.
  • Meta-agent no independent evidence
    purpose: Evaluates and aligns cross-modal predictions for consistency
    New component for final decision making.

pith-pipeline@v0.9.0 · 5487 in / 1400 out tokens · 39863 ms · 2026-05-10T20:16:15.517414+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

Reference graph

Works this paper leans on

63 extracted references · 34 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed El- hoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024. 2

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 6

  3. [3]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 2

  4. [4]

    Fast greedy map inference for determinantal point process to improve recommendation diversity.Advances in neural information processing systems, 31, 2018

    Laming Chen, Guoxin Zhang, and Eric Zhou. Fast greedy map inference for determinantal point process to improve recommendation diversity.Advances in neural information processing systems, 31, 2018. 4

  5. [5]

    Bring reason to vision: Understanding perception and reasoning through model merging

    Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging. InForty-second International Conference on Machine Learning, 2025. 3

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 5

  7. [7]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 2

  8. [8]

    Videoagent: A memory- augmented multimodal agent for video understanding, arXiv preprint arXiv:2403.11481, 2024

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding.arXiv preprint arXiv:2403.11481, 2024. 2

  9. [9]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 6

  10. [10]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 5

  11. [11]

    M-llm based video frame selection for effi- cient video understanding

    Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, et al. M-llm based video frame selection for effi- cient video understanding. InCVPR, 2025. 2

  12. [12]

    A comprehensive study on visual token redundancy for discrete diffusion-based multimodal large language models

    Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, and Shijian Lu. A comprehensive study on visual token redundancy for discrete diffusion-based multimodal large language models. arXiv preprint arXiv:2511.15098, 2025. 2

  13. [13]

    Todre: Effective visual token pruning via token diversity and task relevance.arXiv preprint arXiv:2505.18757, 2025

    Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, and Shijian Lu. Todre: Effective visual token pruning via token diversity and task relevance.arXiv preprint arXiv:2505.18757, 2025. 2

  14. [14]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 2

  15. [15]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InCVPR, 2024. 2

  16. [16]

    arXiv preprint arXiv:2504.06958 (2025)

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025

  17. [17]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 2

  18. [18]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

  19. [19]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 2

  20. [20]

    Mixing importance with diversity: Joint optimization for kv cache compression in large vision-language models.arXiv preprint arXiv:2510.20707, 2025

    Xuyang Liu, Xiyan Gui, Yuchao Zhang, and Linfeng Zhang. Mixing importance with diversity: Joint optimization for kv cache compression in large vision-language models.arXiv preprint arXiv:2510.20707, 2025. 2

  21. [21]

    Video compression commander: Plug-and-play inference ac- celeration for video large language models

    Xuyang Liu, Yiyu Wang, Junpeng Ma, and Linfeng Zhang. Video compression commander: Plug-and-play inference ac- celeration for video large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1910–1924, 2025. 2

  22. [22]

    Shifting ai efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147, 2025

    Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhis- han Tao, Yubo Wang, Tailai Chen, Xiangqi Jin, Chang Zou, Yiyu Wang, et al. Shifting ai efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147,

  23. [23]

    Global compression commander: Plug- and-play inference acceleration for high-resolution large vision-language models

    Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Siteng Huang, and Honggang Chen. Global compression commander: Plug- and-play inference acceleration for high-resolution large vision-language models. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 7350–7358, 2026. 2

  24. [24]

    Videomind: A chain-of-lora agent for temporal-grounded video reasoning.arXiv preprint arXiv:2503.13444, 2025

    Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

  25. [25]

    arXiv preprint arXiv:2306.07207 , year=

    Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023. 2

  26. [27]

    Video-rag: Visually-aligned retrieval- augmented long video comprehension.arXiv preprint arXiv:2411.13093, 2024

    Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-rag: Visually-aligned retrieval- augmented long video comprehension.arXiv preprint arXiv:2411.13093, 2024. 5

  27. [28]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024. 5

  28. [29]

    Ac- tionart: Advancing multimodal large models for fine- grained human-centric video understanding.arXiv preprint arXiv:2504.18152, 2025

    Yi-Xing Peng, Qize Yang, Yu-Ming Tang, Shenghao Fu, Kun-Yu Lin, Xihan Wei, and Wei-Shi Zheng. Ac- tionart: Advancing multimodal large models for fine- grained human-centric video understanding.arXiv preprint arXiv:2504.18152, 2025. 2

  29. [30]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 7

  30. [31]

    Videorag: Retrieval-augmented generation with extreme long-context videos, 2025

    Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025. 2

  31. [32]

    Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv pr...

  32. [33]

    Vgent: Graph-based retrieval-reasoning- augmented generation for long video understanding

    Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mo- hamed Elhoseiny. Vgent: Graph-based retrieval-reasoning- augmented generation for long video understanding. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025. 2, 5

  33. [34]

    Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 3

  34. [35]

    Qwen3-vl system card, 2025

    Qwen Team. Qwen3-vl system card, 2025. 5

  35. [36]

    Qwen3-vl

    Qwen3-VL Team. Qwen3-vl. https : / / qwen . ai / blog ? id = 99f0335c4ad9ff6153e517418d48535ab6d8afef& from = research . latest - advancements - list,

  36. [37]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms.arXiv preprint arXiv:2406.16860,

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian- 1: A fully open, vision-centric exploration of multimodal llms.arXiv preprint arXiv:2406.16860, 2024. 2

  37. [38]

    Alvarez, Lei Zhang, and Zhiding Yu

    Shihao Wang, Guo Chen, De-An Huang, Zhiqi Li, Minghan Li, Guilin Liu, Jose M. Alvarez, Lei Zhang, and Zhiding Yu. Videoitg: Multimodal video understanding with instructed temporal grounding.arXiv preprint arXiv:2507.13353,

  38. [39]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark. InICCV, 2025. 6

  39. [40]

    Videoagent: Long-form video understand- ing with large language model as agent.arXiv preprint arXiv:2403.10517, 2024

    Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understand- ing with large language model as agent.arXiv preprint arXiv:2403.10517, 2024. 2, 6

  40. [41]

    Accelerating streaming video large language models via hierarchical token compression.arXiv preprint arXiv:2512.00891, 2025

    Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, and Linfeng Zhang. Ac- celerating streaming video large language models via hierar- chical token compression.arXiv preprint arXiv:2512.00891,

  41. [42]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2025

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2025. 6

  42. [43]

    Vca: Video curious agent for long video un- derstanding

    Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, and Chuang Gan. Vca: Video curious agent for long video un- derstanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20168–20179, 2025. 3

  43. [44]

    Wikiautogen: Towards multi-modal wikipedia- style article generation

    Zhongyu Yang, Jun Chen, Dannong Xu, Junjie Fei, Xiao- qian Shen, Liangbing Zhao, Chun-Mei Feng, and Mohamed Elhoseiny. Wikiautogen: Towards multi-modal wikipedia- style article generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15532–15541, 2025. 2

  44. [45]

    MERMAID: Multi-perspective self- reflective agents with generative augmentation for emotion recognition

    Zhongyu Yang, Junhao Song, Siyang Song, Wei Pang, and Yingfang Yuan. MERMAID: Multi-perspective self- reflective agents with generative augmentation for emotion recognition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24650–24666, Suzhou, China, 2025. Association for Com- putational Linguistics. 2

  45. [46]

    thinking with long videos

    Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, et al. Longvt: Incentivizing” thinking with long videos” via native tool calling.arXiv preprint arXiv:2511.20785, 2025. 2

  46. [47]

    Script: Graph-structured and query-conditioned semantic to- ken pruning for multimodal large language models.Trans- actions on Machine Learning Research, 2025

    Zhongyu Yang, Dannong Xu, Wei Pang, and Yingfang Yuan. Script: Graph-structured and query-conditioned semantic to- ken pruning for multimodal large language models.Trans- actions on Machine Learning Research, 2025. 2

  47. [48]

    Timeexpert: An expert-guided video llm for video temporal grounding

    Zuhao Yang, Yingchen Yu, Yunqing Zhao, Shijian Lu, and Song Bai. Timeexpert: An expert-guided video llm for video temporal grounding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 24286–24296, 2025. 2

  48. [49]

    Xr: Cross-modal agents for composed image retrieval.arXiv preprint arXiv:2601.14245, 2026

    Zhongyu Yang, Wei Pang, and Yingfang Yuan. Xr: Cross-modal agents for composed image retrieval.ArXiv, abs/2601.14245, 2026. 2

  49. [50]

    Inex: Hallucination mitigation via introspec- tion and cross-modal multi-agent collaboration, 2026

    Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, and Wei Pang. Inex: Hallucination mitigation via introspec- tion and cross-modal multi-agent collaboration, 2026. 2

  50. [51]

    Think with videos for agentic long-video understanding, 2025

    Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, and Zhicheng Dou. Think with videos for agentic long-video understanding, 2025. 2

  51. [52]

    Sigmoid loss for language image pre-training,

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training,

  52. [53]

    Long-clip: Unlocking the long-text capa- bility of clip

    Beichen Zhang, Pan Zhang, Xiao wen Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capa- bility of clip. InEuropean Conference on Computer Vision,

  53. [54]

    Zhang, K

    Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing. Openmmreasoner: Pushing the frontiers for multimodal rea- soning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025. 2

  54. [55]

    OmAgent: A multi-modal agent framework for complex video understanding with task divide-and-conquer

    Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, and Kyu- song Lee. OmAgent: A multi-modal agent framework for complex video understanding with task divide-and-conquer. InProceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing, pages 10031–10045, Miami, Florida, USA, 2024. Association for Computational Linguistics. 2

  55. [56]

    Q-frame: Query-aware frame selection and multi- resolution adaptation for video-llms

    Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-frame: Query-aware frame selection and multi- resolution adaptation for video-llms. InICCV, 2025. 2

  56. [57]

    Deep video discovery: Agentic search with tool use for long-form video understanding

    Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video discovery: Agentic search with tool use for long-form video understanding. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2

  57. [58]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 2

  58. [59]

    LLaV A-video: Video instruction tuning with synthetic data.Transactions on Machine Learn- ing Research, 2025

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Zi- wei Liu, and Chunyuan Li. LLaV A-video: Video instruction tuning with synthetic data.Transactions on Machine Learn- ing Research, 2025. 5

  59. [60]

    arXiv preprint arXiv:2501.15111 , year=

    Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Xihan Wei, Liefeng Bo, et al. Humanomni: A large vision-speech lan- guage model for human-centric video understanding.arXiv preprint arXiv:2501.15111, 2025. 2

  60. [61]

    Videoa- gent2: Enhancing the llm-based agent system for long- form video understanding by uncertainty-aware cot.ArXiv, abs/2504.04471, 2025

    Zhuo Zhi, Qiangqiang Wu, Minghe Shen, Wenbo Li, Yinchuan Li, Kun Shao, and Kaiwen Zhou. Videoa- gent2: Enhancing the llm-based agent system for long- form video understanding by uncertainty-aware cot.ArXiv, abs/2504.04471, 2025. 2

  61. [62]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264,

  62. [63]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2

  63. [64]

    Videotree: Adaptive tree- based video representation for llm reasoning on long videos,

    Wang Ziyang, Yu Shoubin, Stengel-Eskin Elias, Yoon Jae- hong, Cheng Feng, Bertasius Gedas, and Bansal Mohit. Videotree: Adaptive tree-based video representation for llm reasoning on long videos.arXiv preprint arXiv:2405.19209,